Any way to track down the content object a blobstorage asset belongs to? (RelStorage)

Hi all

I've got some assets in my blobstorage directory that are quite large in size, ie. 120mb image files. I'd like to track down these images in the CMS to update the updated image to have some better compression. Is there any way that this could be achieved?

I'm also running into an issue where either the blobs are duplicated by some issue, or someone has uploaded the asset into the CMS twice, ie:

neil:0xfa neil$ ls -lah
total 228888
drwxr-xr-x 4 neil staff 128B 5 Aug 08:59 .
drwxr-xr-x 11 neil staff 352B 5 Aug 08:59 ..
-rwxr-xr-x 1 neil staff 6B 5 Aug 08:59 .lock
-rwxr-xr-x 1 neil staff 112M 5 Aug 08:59 0x03bcb5e9d8a1e288.blob
neil:0xfa neil$ md5 0x03bcb5e9d8a1e288.blob
MD5 (0x03bcb5e9d8a1e288.blob) = eb49f2caf865671f87409e8c72ac7c54

neil:0xfa neil$ cd ..
neil:0x2d neil$ cd 0xbd
neil:0xbd neil$ ls -lah
total 228888
drwxr-xr-x 4 neil staff 128B 5 Aug 08:59 .
drwxr-xr-x 11 neil staff 352B 5 Aug 08:59 ..
-rwxr-xr-x 1 neil staff 6B 5 Aug 08:59 .lock
-rwxr-xr-x 1 neil staff 112M 5 Aug 08:59 0x03bcb5e9d8a1e288.blob
neil:0xbd neil$ md5 0x03bcb5e9d8a1e288.blob
MD5 (0x03bcb5e9d8a1e288.blob) = eb49f2caf865671f87409e8c72ac7c54

Thanks :slight_smile:

there is a size index in the portal catalog. You can make a collection show you the largest items as long as they are File objects.

@djay That's the quick and simple solution if your Data.fs and blobstorage are in sync. But I've been wondering the same a few times on a much lower ZODB level.

I have seen a few blobstorage over the years where I have the suspicion that blob files present are no longer related to a record in the ZODB due to whatever corruption or bugs, but is there a straightforward way to open a Data.fs in a python script, scan through all records and log which blobstorage objects should be present. And afterwards scan all blobstorage files present and prune the blobstorage?

1 Like

this check (and can also delete) the references from zodb and blobstorage.

A simple and low tech way to work on this issue could be to start the ftp-server and download the whole website to the filesystem. Then you can work with filenames and foldernames instead of references.
You can also use all the commandline tools on the filesystem.

To find duplicates there is an "fdupes" command. It works well for images and documents that have been uploaded multiple times.

I wrote a script that does that. It scans through all the blob files and build its "paths" in the ZODB by walking through the back references. This way one can identify to which content items each blob belongs. Example:

INFO    [15:21:11] diagnose_blobs: Processing blob 164224 of 164224 (100%)...
Blob path: REDACTED/var/blobstorage/0x00/0x00/0x00/0x00/0x01/0x7c/0x15/0x64/0x03ca32f679680244.blob
Blob hash: (d41d8cd98f00b204e9800998ecf8427e,0)
oid: 0x017c1564
id: None
obj: <ZODB.blob.Blob object at 0x7f5df0e6f140>
path: None
oid_path: 0x00/0x01/0x11/0x0c486e/0x0c488b/0x105ff7/0x0c4887/0x0c4a28/0x0f45da/0x017c141c/0x017c1440/0x017c1530/0x017c1561/0x017c1564
id_path: Root/Zope/Plone/FOLDER0/_tree/None/FOLDER1/_tree/_firstbucket/NEWS_ITEM0/__annotations__/None/None/_blob

It was not straightforward for me, because I've never worked with ZODB at such a low level. Maybe I did things harder than it should be, because I was learning as I went.

I was very impressed because my DB had about 160k blobs, and none of them was there without a reason. Kudos to the ZODB folks!

On the other hand I learned that image scales accumulated, because the site makes heavy use of them. At this moment I'm writing a script to run periodically and remove old image scales.

Anyway, if there is enough interest I can polish the script and publish it. There are some ad-hoc things that I must remove first.

2 Likes

That's really neat @rafaelbco - Any chance you could share a copy of that script?

it could be you have lots of blobs because of versioning; do you remember this bug?

I just found another one that could be related:

Coming soon... :wink:

Any update on this?

Something on my application is getting out of control, and my blobstorage directory is about 300gb by now.

I'm sorry, did not have time to clean up and publish my script yet, but in the mean time I found this: https://github.com/minddistrict/mdtools.relstorage

I've just come to realize that plone.scaling will re-generate it's scaled images whenever an object is modified, but the old blobstorage files are never removed

As per https://github.com/plone/plone.scale/blob/master/plone/scale/storage.py#L182

When we detect that an object has been modified we re-scale the image, and store the new scaled object. Nowhere in this mix are the old blobstorage files removed, meaning that the size of a ZODB will forever continue to grow over time.

How are people handling this scenario?

A possible solution is to only store the scales in a RAM cache:

Other solution is to clear the generated images periodically.

Unfortunately I don't have working code available for any of those
solutions.

you've to pack the zodb if you don't need the old objects. You cannot remove the scales because if you do an undo, you've to go back to an object with scales.

You can use a mount zodb with a policy to pack it every day, where you store the frequently changing images.

Parlty because of this thread I started collecting some of our scripts and put them in a separate repo on our public github account: https://github.com/zestsoftware/plonescripts/blob/master/purge_image_scales.py

It's a bit tricky LICENSE wise, I also wanted to put other scripts there which we've been using that have been created by others but the licensing of those isn't really clear. . But at least the purge images scales that @mauritsvanrees created is there now.

One warning with purging image scales: you also might have to invalidate your proxy cache (like Varnish) when you throw away all the scales if you have one running in front of your site depending on your caching settings. For example if your content pages are cached in the proxy for an hour (with 'unique' paths to path/to/image.jpg/@@images/a8e66ec4-e2e7-4cc7-aa65-e785a7ba3dd15.jpeg) but your content images/files are revalidated against the backend you have a lot of 404 on images for a while. I've seen this happening 2-3 times 'in the wild', but still have to recreate it in a controlled setting. :slight_smile:

@rafaelbco Still interested in your blobstorage scannign script, I quickly scanned mtools.relstorage documentation but I didn't read in the docs it can detect 'stale' blob files.

2 Likes

The script removes every annotation on an object:

 for brain in catalog.unrestrictedSearchResults():
         [...]
         ann = AnnotationStorage(obj)
         [...]
         for key, value in ann.items():
            if value['modified'] < final_date.millis():
                # This may easily give an error, as it tries to remove
                # two keys: del ann[key]
               del ann.storage[key]

No, the AnnotationStorage here is from plone.scale: from plone.scale.storage import AnnotationStorage. So it only iterates over scale annotations.

1 Like

I've just made public a script which scans the blobs and prints information about each one, along with other useful code:

It was tested on a Plone 4.3 instance using filestorage/blobstorage, i.e no Relstorage. I think it will work or at least can be adapted to work with Relstorage.

@neilf @fredvd sorry for taking so long. If you need help using it I'll be glad to help.

As I said before:

It was not straightforward for me, because I've never worked with ZODB at such a low level. Maybe I did things harder than it should be, because I was learning as I went.

3 Likes

Update to this:
I've spent some time digging into this further, and it turns out that the zodbpack on RelStorage was in fact removing old scales of objects. There was a time that my zodbpack config didn't have the blob directory specified, so it removed old transactions but not their corresponding blob files. As a result, I had about 40gb of blob files that didn't belong to any object within my CMS.

I have migrated about 60gb of my blobstorage files to S3 and updated the loadBlob method of RelStorage's blobhelper to restore all the files it ACTUALLY needs from S3 instead, meaning my blobstorage directory is now free of the blob files that do not belong to any objects in the database. I've managed to get my blobstorage directory from 60gb down to 14gb, which is a huge relief.

It looks nice!
I am trying to give it a spin but I see:

Version and requirements information containing funcsigs:
  [versions] constraint on funcsigs: 1.0.2
  Requirement of rbco.caseclasses: funcsigs<=0.4.999,>=0.4
While:
  Installing instance.
Error: The requirement ('funcsigs<=0.4.999,>=0.4') is not allowed by your [versions] constraint (1.0.2)

I will make a PR to https://github.com/rafaelbco/rbco.caseclasses/ if I am able to fix this :slight_smile: