Huge (and constantly increasing) DATA.FS

Hello,
We run a Plone 4.3 website; it is used by a municipality in order to let citizens to upload documents (basically PDF, DOC, JPG and plain text files), exchange datas with P.A. and so on.
This was originally developed on Plone 3 and then upgraded; the upgrade itself was done by the previous maintainer, so I don't know what has been done exaclty.
We have an average usage for 1000 new "items" per day.
So the user files are stored in the filesystem, in each user directory, while "common" files are stored as BLOBS.

The website itself didn't change much in the last months.
Despite this, DATA.FS is costantly increasing, at an average rate of 10/20Mb per minute!
I can't understand this behaviour, since AFAIK in Plone 3 all files were stored in DATA.FS file, but starting from Plone 4 they should be stored only into the filesystem!

From ZOPE Management Interface I can see that "portal_purgepolicy" is set to "-1" (no expiration) but this refers also to user uploaded content?

"Should be stored" or "are stored"? did you verify that?

First I would look into the logs, check for POST/PUT requests and their size...check here for anomalies.

-aj

"Plone 4 they should be stored only into the filesystem!"

note: only when you're using NamedBlobImage and plone.namedfile.NamedBlobFile instead of NamedImage and NamedFile.

Just guessing. Unfortunately, several pepole worked on this project before and it lacks full documentation. :frowning:
Now I'm facing this problem, and I don't have much experience with Plone/Zope, so I'm trying to figure out where to investigate.
I still can't understand how Zope modifications history works: if I attach a file (PDF, JPG and so on) to my content, then I modify the content itself 10 times, I get 10 copies of the same file too? Where are these copies stored?

in recent Plone, there is a single copy of the file stored in the blob storage. The object attributes (description, dates, etc) are duplicated with each version but there is a pointer to the unchanged blob data in each version.
Old object versions can be reclaimed by packing the Zodb.

In your Plone version ? I have no idea being a late comer to Plone. My guess from reading old posts on the Net is that this changed with the migration from Archetypes to Dexterity. As your site comes from Plone 3 it was certainly Archetypes based. But was it migrated to Dexterity too when upgraded to Plone 4 ? You should be able to look at recently created objects in the ZMI to see if the object type are ATxxxx you are still using Archetypes.
If yes it's posible that you could reclaim a lot of space by migrating to Dexterity (see for example https://blog.niteo.co/dexterity-vs-archetypes/)

Unfortunately for a big and constantly growing site it's easier said than done given that migrating to Dexterity takes a huge time - it is copying every object in the db to create a dexterity version and deleting the Archetypes object after - and can fail if invalid objects do exist (if your db is old and has suffered many indignities there are probably some). So if you have this opportunity take it but staging and testing before !

I checked it out and I found out that In my Plone installation the objects are definitely of the AT type.
So I understand the benefits of converting them to Dexterity, but it could be a big issue since we have got a huge database... and we can't afford to put the website down for a long time.

By the way, we can be happy just stopping the "uncontrolled" growth of DATA.FS... so, do you think that setting "portal_purgepolicy" to a fixed value could help?

It will do no harm...You can also take a look at portal_historiesstorage to get some statistics on CMF versions.

You should also check if Zodb is packed from time to time (that's a separate concern)

I have fired up my reference VM of Plone 4.3 and there is blob support for Archetypes. It seems it was added to Plone 3 as an option and was standard with Plone 4. Since it was a migration it may be that something went wrong, however i can't say if if it's even possible that you could still store your files in Zodb in Plone 4. From what I have seen in the readme it seems that if a plone.app.blob directory exists in your buildout-cache, the blob support should be automatic but I don't speak from experience. If there is any doubt remaining a simple way is to block user access, add a 'big' file and look at the size of the blobstorage directory, if it grows it works :slight_smile:

There is a online migration available: please take a look at https://github.com/plone/plone.app.blob#migrating-existing-content for further instructions.

We have decreased the size of the Data.fs adjusting the value of portal_purgepolicy and deleting the contents of the portal_historiesstorage. It looks like Plone does not delete the old contents of the portal_historiesstorage tools when you set a different number on portal_purgepolicy.

Remember that when executing this steps you will delete all old versions of the contents (those shown on @@history view of each object), and you will not be able to revert to an older version.

Make a backup of your site before running this.

First of all, you need to start zope in debug mode:

$ ./bin/instance debug

Then, execute the following sentences. Here "Plone" is the id of the Plone Site, exchange it with the proper id.

>>> from zope.site.hooks import setSite
>>> setSite(app.Plone)
>>> app.Plone.portal_historiesstorage._shadowStorage._storage.clear()
>>> app.Plone.portal_historiesstorage.zvc_repo._histories.clear()
>>> import transaction
>>> transaction.commit()

Then exit with Ctrl-D.

Then you will need to pack the Data.fs to effectively remove the objects and decrease the size of the Data.fs:

$ ./bin/zeopack -D 0

We took this idea from here: https://github.com/plone/Products.CMFEditions/issues/28#issuecomment-113591175

1 Like

To investigate the copied histories further in a more webmaster friendly way, check out and install the add'on collective.revisionmanager. Install it in a copy of your site (don't test this in production), clean up the versions, do a zeopack and see if your database decreases (a lot) in size.

Just changing the maximum number of copies to keep in portal_purgepolicy doesn't help, maybe on saving an individual content item then that items' history is truncated, but you'd want to do a full cleanup. Another important notion is that content deleted from the actual site is 'orphaned' in the historiesstorage until you remove them there (with c.revisionmanager or see the statistics of deleted items in the ZMI in the portal_historiesstorage overview)

Allthough I do doubt if Original Posters problem with seeing a database increasing 10/20Mb a minute is primarily caused by versioning, as already suggested I'd check logs, requests etc. first.

What could also help with finding the cause doing a few python traceback dump of all the threads in the running Zope process in a short distance. Sending USR1 signal dumps the stacktraces to standard output in the latest Zopes. That might give you a clue in which code space activity is occurring. See Martijn Pieters' answer at: https://stackoverflow.com/questions/1032813/dump-stacktraces-of-all-active-threads

This technique helped me once to figure out why a popup directory listing to select content took more than 10 seconds for some folders: the server side view was reverse pulling all words for every item from from the searchabletext in portal_catalog. Cool with a folder full of indexed PDF's. The only thing in the tracebacks every 1-2 seconds during the run would be certain catalog functions.

It's a bit crude though as a profilng means, but works without setting up something like Products.ZopeProfiler.

1 Like