Skip to content

Add __slots__ entries.#4637

Open
knotapun wants to merge 3 commits intopymupdf:mainfrom
knotapun:much-improve
Open

Add __slots__ entries.#4637
knotapun wants to merge 3 commits intopymupdf:mainfrom
knotapun:much-improve

Conversation

@knotapun
Copy link
Contributor

Hello, all this does is add a __slots__ entry to a few classes. this small change makes an outsized impact, reducing the size of instances dramatically, and leads the way to efficient and fully typed page extraction.

from pympler import asizeof
class Rect:
    def __init__(self, a, b, c, d):
        self.x0: float = float(a)
        self.y0: float = float(b)
        self.x1: float = float(c)
        self.y1: float = float(d)
class RectWSlot:
    __slots__ = ("x0", "y0", "x1", "y1")
    def __init__(self, a, b, c, d):
        self.x0: float = float(a)
        self.y0: float = float(b)
        self.x1: float = float(c)
        self.y1: float = float(d)
print(asizeof.asizeof(Rect(1, 2, 3, 4)))       # 624

print(asizeof.asizeof(RectWSlot(1, 2, 3, 4)))  # 160

@knotapun
Copy link
Contributor Author

I noticed the reasoning for non-typed values on the Structure of Dictionary Outputs section.

If this pull gets merged, would you accept a pull that types the dictionaries?

@JorjMcKie
Copy link
Collaborator

Thanks for the idea - and you are quite right: we should have considered doing this.
Before we accept: could you please extend this PR to the Matrix class please? The slots here would carry the names "a" through "f".

@JorjMcKie JorjMcKie self-requested a review August 15, 2025 21:31
Copy link
Collaborator

@JorjMcKie JorjMcKie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please extend this idea to the remaining "geometry" class Matrix.

@knotapun
Copy link
Contributor Author

please extend this idea to the remaining "geometry" class Matrix.

Gladly! Note, this might be kind of limited in it's effectiveness because of some upstream code, namely the swig generated code. Seems easy enough to add.

@knotapun knotapun requested a review from JorjMcKie August 16, 2025 22:36
@julian-smith-artifex-com
Copy link
Collaborator

I think the test failures can be fixed by deleting IdentityMatrix's __setattr__() method. Could you try doing this in your PR?

[Presumably this was somehow forcing the creation of a __dict__ and so messing up lookups of in __slots__.]

@knotapun
Copy link
Contributor Author

Apologies, I've been trying to get a job, and I'm helping someone move. I also just found out how to run your tests locally via act, so it's a bit easier to make commits/changes.

@knotapun
Copy link
Contributor Author

I think the test failures can be fixed by deleting IdentityMatrix's __setattr__() method.

You have to be careful with that, as you don't want someone to change the identity matrix to be anything else.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Aug 29, 2025

I haven't tested it but the only reason for the existence of pymupdf.IdentityMatrix is its immutability. This is ensured by __setattr__().
So deleting this method also deletes the reason for having IdentityMatrix at all.
I could think of a way out:
Define this object as a @property
Sorry, that will not work either!

@JorjMcKie
Copy link
Collaborator

I think best is to give up supplying Matrix with slots.

@julian-smith-artifex-com
Copy link
Collaborator

This passes tests for me:

class IdentityMatrix(Matrix):
    def __init__(self):
        # Deliberately do not call Matrix.__init__(self).
        pass
    
    @property
    def a(self):
        return 1
    
    @property
    def b(self):
        return 0
    
    @property
    def c(self):
        return 0
    
    @property
    def d(self):
        return 1
    
    @property
    def e(self):
        return 0
    
    @property
    def f(self):
        return 0

Identity = IdentityMatrix()

assert Identity.a == 1
assert Identity.b == 0

try:
    Identity.a = 23
except Exception:
    pass
else:
    assert 0
  • The error from Identity.a = 23 is AttributeError: property 'a' of 'IdentityMatrix' object has no setter.
  • Instances of IdentityMatrix() will not benefit from the __slots__ optimisation, but this probably doesn't matter.

@julian-smith-artifex-com
Copy link
Collaborator

I've run some simple tests so see whether using __slots__ makes any difference.

Unfortunately it looks like space and speed do not significantly improve.

It can be difficult to figure out a way to measure things like this in a useful way, and i would be very happy if better evidence showed my conclusion to be incorrect.

In particular, it would be fantastic if there was evidence to back up the original claim that:

this small change makes an outsized impact, reducing the size of instances
dramatically, and leads the way to efficient and fully typed page extraction.

Here are the three functions i've used:

def test_4637():
    '''
    Shows memory size of different objects.
    '''
    print('', flush=1)
    import subprocess
    subprocess.run(f'pip install pympler', shell=1, check=1)
    import pympler
    import pympler.asizeof
    p = pymupdf.Point()
    r = pymupdf.Rect()
    m = pymupdf.Matrix()

    print(f'{sys.getsizeof(p)=}')
    print(f'{sys.getsizeof(r)=}')
    print(f'{sys.getsizeof(m)=}')
    print(f'{sys.getsizeof(pymupdf.Identity)=}')

    print(f'{pympler.asizeof.asizeof(p)=}')
    print(f'{pympler.asizeof.asizeof(r)=}')
    print(f'{pympler.asizeof.asizeof(m)=}')
    print(f'{pympler.asizeof.asizeof(pymupdf.Identity)=}')

    points1000 = 1000 * (pymupdf.Point(),)
    print(f'{sys.getsizeof(points1000)=}')
    print(f'{pympler.asizeof.asizeof(points1000)=}')

    class Foo:
        def __init__(self):
            self.a = pymupdf.Rect()
            self.b = pymupdf.Rect()
            self.c = pymupdf.Rect()
            self.d = pymupdf.Rect()
            self.e = pymupdf.Rect()
            self.f = pymupdf.Rect()
            self.g = pymupdf.Rect()
            self.h = pymupdf.Rect()
            self.i = pymupdf.Rect()
            self.j = pymupdf.Rect()
    f = Foo()
    print(f'{sys.getsizeof(f)=}')
    print(f'{pympler.asizeof.asizeof(f)=}')

def test_4637b():
    '''
    Shows memory use when extracting text from large pdf.
    '''
    print()
    path = os.path.normpath(f'{__file__}/../../tests/resources/test_3594.pdf')
    texts = list()
    import psutil
    process = psutil.Process()
    t = time.time()
    a = process.memory_info().rss
    with pymupdf.open(path) as document:
        for i, page in enumerate(document):
            #print(f'{i+1}/{len(document)}')
            text = page.get_text()
            texts.append(text)
        t = time.time() - t
        b = process.memory_info().rss
        print(f'{a=:,}')
        print(f'{b=:,}')
        print(f'{t=}')

    # Use <texts>.
    print(sum([len(i) for i in texts]))

    wt = pymupdf.TOOLS.mupdf_warnings()


def test_4637c():
    '''
    Shows time taken to do many operations on a
    '''
    print()
    import random
    random.seed()
    items = list()
    for i in range(1000):
        a = random.random()
        b = random.random()
        c = random.random()
        d = random.random()
        items.append(pymupdf.Rect(a, b, c, d))
    operations = list()
    for i in range(1000*1000):
        a = random.randint(0, len(items)-1)
        b = random.randint(0, len(items)-1)
        operations.append((a, b))
    t = time.time()
    for a, b in operations:
        items[a] += items[b]
    t = time.time() - t
    print(f'{t=}')

Results:

test_4637

    pymupdf-1.27.1:
        PyMuPDF/tests/test_geometry.py::test_4637 
        Requirement already satisfied: pympler in ./venv-aptest-3.13.5-64/lib/python3.13/site-packages (1.1)
        sys.getsizeof(p)=48
        sys.getsizeof(r)=48
        sys.getsizeof(m)=48
        sys.getsizeof(pymupdf.Identity)=48
        pympler.asizeof.asizeof(p)=456
        pympler.asizeof.asizeof(r)=592
        pympler.asizeof.asizeof(m)=656
        pympler.asizeof.asizeof(pymupdf.Identity)=680
        sys.getsizeof(points1000)=8040
        pympler.asizeof.asizeof(points1000)=8488
        sys.getsizeof(f)=48
        pympler.asizeof.asizeof(f)=4216
        FAILED
        PyMuPDF/tests/test_geometry.py::test_4637b 
        a=102,068,224
        b=117,882,880

    with __slots__:
        PyMuPDF/tests/test_geometry.py::test_4637 
        Requirement already satisfied: pympler in ./venv-aptest-3.13.5-64/lib/python3.13/site-packages (1.1)
        sys.getsizeof(p)=48
        sys.getsizeof(r)=64
        sys.getsizeof(m)=80
        sys.getsizeof(pymupdf.Identity)=96
        pympler.asizeof.asizeof(p)=72
        pympler.asizeof.asizeof(r)=160
        pympler.asizeof.asizeof(m)=104
        pympler.asizeof.asizeof(pymupdf.Identity)=464
        sys.getsizeof(points1000)=8040
        pympler.asizeof.asizeof(points1000)=8112
        sys.getsizeof(f)=48
        pympler.asizeof.asizeof(f)=2424
        PASSED
        PyMuPDF/tests/test_geometry.py::test_4637b 
        a=101,851,136
        b=117,121,024

test_4637b

    pymupdf-1.27.1:
        0.223, 0.2246

    with __slots__:
        0.2308, 0.2213

test_4637c()

    pymupdf-1.27.1:

        1.0739 1.0209 1.07179
    with __slots__:
        0.9868 1.0865 1.03134

The only significant difference seems to be test_4637()'s pympler.asizeof.asizeof(f), where f is an object containing 10 pymupdf.Rect's.

Even with __slots__, Python does not constrain the values in Point, Rect and Matrix to be floats. Maybe this is the main source of slow-down.

@JorjMcKie has suggested looking at using Python's array, which will fixe the types in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants