Image scale blobs no cleaned by zeopack after removing ScalesDict

Newer versions of Plone seem to store image scales (NamedBlobImage) while older versions of Plone stored scales in annotations directly within the ZODB..that'ok. As part of a migration I wrote some code to pure all image annotations by deleting the plone.scale annotation. After packing the ZODB, the number blob files remained identical. This is perhaps a bit surprising. Shouldn't the ZODB/zeopack remove unused blob files automatically?

Dexterity content types (plone.namedfile) use the type of the field for scales. NamedImageField stores its scales in ZODB, NamedBlobImageField stores its scales as blobs.

Results of pack IS surprising. Doesn't that mean that something in ZODB is still referring those blobs? Or simply the deletion was to recent to be backed?

Nope, I explicitly checked that the plone.scale annotation was removed. This observation is somewhat consistent with former vague observations that "newer" versions of Plone (or the ZODB) would not pack properly in the context of blob files.

I tried this with clean Plone. Deleting plone.scale annotation from OOTB Image content object worker properly. All scale blobs were gone after packing.

Then I added Lead image behavior for versioned OOTB Document content type. Now, there seems to be a bug we fixed for Mosaic tile annotations earlier: versioning crates zero size blobs into blob storage. Nevertheless, deleting plone.scale and packing removed all the current scales. Only the originals were left (and those zero sized blobs from "versioned scales").

I had the exact same problem... I migrated from NamedImage to NamedBlobImage, purged all scales by removing plone.scale from annotations. But the Database did not really shrunk in size after packing the database.

Before the migration I had 8000 plone.namedfile.file.NamedImage instances stored.
After the migration There were still 3500 left.

After digging a bit I probably found a cause why packing the DB does not help

>>> named_image = db._storage.load('\x00\x00\x00\x00\x00%\x90\xf8')
>>> referencesf(named_image[0])
['\x00\x00\x00\x00\x00%\x90\xfb']
>>> ref = db._storage.load('\x00\x00\x00\x00\x00%\x90\xfb')
>>> ref[0][:100]
'cplone.namedfile.file\nFileChunk\nq\x01.}q\x02U\x05_dataq\x03T\\\xfc$\x00\xff\xd8\xff\xe1A\xdaExif\x00\x00II*\x00\x08\x00\x00\x00\x0b\x00\x0e\x01\x02\x00 \x00\x00\x00\x92\x00\x00\x00\x0f\x01\x02\x00\x05\x00\x00\x00\xb2\x00\x00\x00\x10\x01'

The NamedImage obj still has references to FileChunk's, I assume that's the reason why it does not get removed by the GC.

I'll try to remove those and report again.

1 Like

So turns out it's a :rabbit2:-whole...what else. I had some RelationValues stored with a references to content objects, which in the end referenced to NamedImages, which were un-toched by any migration because they are no longer traversable.

More exact the RelationValue's parent pointer was still pointing to a deleted content.
Thus those scales remained in the DB as well.

I'm currently not sure, why this is happening. I have a ton of custom code, so most likely I introduced this problem on my own.

But to me it seems, that all objects with zc.relationsfield's do not get removed while packing, because of the parent pointer on the RelationValue

@zopyx My proposal is, that you check your relation catalog for kinda broken relations.
Or follow all references using/adapt the ZODB/scripts/analyze.py script.
I ended up changing the analyze_rec method in order to show and inspect refs:

from ZODB.serialize import referencesf
def analyze_rec(report, record):
    oid = record.oid
    report.OIDS += 1
    if record.data is None:
        # No pickle -- aborted version or undo of object creation.
        return
    if '\x00\x00\x00\x00\x00%f\xc3' in referencesf(record.data):
        import pdb; pdb.set_trace()

Currently I'm running a script which iterates over all transaction records -> load the object -> Check if UID is NOT in portal_catalog -> iterate over all fields --> check for Relations --> Remove parent pointer from RelationValue to the object -> Remove the relation --> Pack DB -> Check if object has been collected by the gc.

I follow up with a update once I'm done there and report the result.

Update:

I got almost rid of all my binary data in the ZODB by removing all references pointing to deleted objects.
I found references to deleted objects in:

  • RelationValue parent pointer
  • Relation catalog various attributes
  • IntIds of RelationValue's -> KeyReferenceToPersistent object attribute.

Key point here ist. I had big structures on the Website. Like nested Folder with 100's of objects, which remained in the DB, because the top object still had a reference stored. Which means means I got in the meanwhile rid of 100'000 of objects. NamedImage/FileChunk obviously used the moste space. But in terms of object count they were only the tip of the iceberg.

Remove/Unindex relations, which are basically broken:

        portal_catalog = self.portal.portal_catalog
        relation_catalog = getUtility(ICatalog)

        for token in relation_catalog.findRelationTokens():
            rel = relation_catalog.resolveRelationToken(token)
            if not rel.__parent__:
                relation_catalog.unindex(rel)
            elif not portal_catalog.unrestrictedSearchResults(UID=rel.__parent__.UID()):
                relation_catalog.unindex(rel)
            elif not rel.from_object:
                relation_catalog.unindex(rel)

Iterate over literally everything and remove references (RelationValue) if object (iterate over all fields) is no longer in UID Index.
I used this as single source of truth for objects which should stay in the DB.

from zope.component.hooks import setSite
from plone.uuid.interfaces import IUUID
from plone.dexterity.interfaces import IDexterityContent
from plone.dexterity.utils import iterSchemata
from zope.schema import getFieldsInOrder
from z3c.relationfield.interfaces import IRelation
from z3c.relationfield.interfaces import IRelationChoice
from z3c.relationfield.interfaces import IRelationList
from z3c.relationfield import RelationValue
import transaction
import datetime

plone = app.Plone
setSite(plone)



def remove_parent(obj, field):
  value = field.get(field.interface(obj))
  if not value:
    return
  if isinstance(value, RelationValue):
    value.__parent__ = None
  else:
    for relation in value:
       relation.__parent__ = None


def remove_refs(obj):
  for schemata in iterSchemata(obj):
    for name, field in getFieldsInOrder(schemata):
        if IRelation.providedBy(field) or IRelationChoice.providedBy(field):
          remove_parent(obj, field)
          field.set(field.interface(obj), None)
        if IRelationList.providedBy(field):
          remove_parent(obj, field)
          field.set(field.interface(obj), [])


counter = 0
for tx in plone._p_jar.db()._storage.iterator():
  for record in tx:
    counter += 1
    if counter % 100000 == 0:
        print datetime.datetime.now(), counter
        transaction.commit()
    obj = plone._p_jar[record.oid]
    if IUUID(obj, None) and IDexterityContent.providedBy(obj):
        if not plone.portal_catalog.unrestrictedSearchResults(UID=obj.UID()):
            remove_refs(obj)
            print '%r' % record.oid, obj

transaction.commit()

To cleanup the IntIds I borrowed the code from collective.relationhelpers/api.py at 4241db5596dfa2ec5948ea2a2f43396f04a0c53d · collective/collective.relationhelpers · GitHub :blush:

I'm now running some analytics, to see whats left and what caused that problem.
But it totally got rid almost of all binary data in the ZODB and shrunk my ZODB size by quite a bit.

Update:
I also had a attribute called event_information on some content, which basically stored the the content of a zope lifecycle event ?? Almost 100% positive this was custom code.

UPDATE 2

I can verify the issue now with a script and a naked Plone installation.

Environment:
Python 3.9.9
Plone.5.2.7

My buildout.cfg:

[buildout]
extends =
    http://dist.plone.org/release/5-latest/versions.cfg

parts =
    instance
    zopepy

[instance]
recipe = plone.recipe.zope2instance
user = admin:admin
http-address = 8081
eggs =
    Plone
    plone.app.mosaic


[zopepy]
recipe = zc.recipe.egg
eggs =
    ${instance:eggs}
interpreter = zopepy
scripts = zopepy

[versions]
zc.buildout = 2.13.6
setuptools = 51.3.3

Install:

$ python3 -m venv .
$ ./bin/pip install zc.buildout==2.13.6 setuptools==51.3.3
$ ./bin/buildout

Make sure there is a empty Data.fs (delete it if there is one)
Download and run the script from prove_relation_value_gc_issue.py · GitHub

$ ./bin/instance run prove_relation_value_gc_issue.py -s demo

It raises an error since there are relations left in the DB, thus also an unwanted Document. Since they reference onto each other the will never removed from the DB.

Instead of running the script you can verify the issue manually as well.
run

./bin/zopepy /path/to/ZODB-5.6.0-py2.7.egg/ZODB/scripts/analyze.py var/filestorage/Data.fs

After removing the Second document with the relation and packing the DB, those objects were still there:

...
z3c.relationfield.relation.RelationValue             1       162   0.0%  162.00
...
plone.app.contenttypes.content.Document              3      4332   0.1% 1444.00
...

The issue is in z3c.relationfield/event.py at 0.9.0 · zopefoundation/z3c.relationfield · GitHub
It sets the __parent__ attribute on relations via zope lifecycle events.

@zopyx I'm pretty sure you had content with relations and images and somewhere in that structure are "leftover" objects, which cannot be garbage collected.

7 Likes