System Architecture: Handle a big system with lots of old data

Background: I have a Plone 4.3 system with lots of content types (~30 mixed Archetype and Dexterity) and complex references between them. For better searching and working with these objects, we index many field in the catalog. Life time of most objects is around 1 month, which mean after that time these object will only be stored as archive for reference and statistic purpose. The problem we have now is time by time, more and more old objects (thousand objects more every week) in the system it make the system too heavy, catalog growth too big (re-index the portal_catalog may take hours), and influence the performance of the system. We are finding a solution how to avoid the effect of too many old, archived objects to the performance of the catalog, but still can access these data if needed. I come up with some ideas:

  1. Export old data to a portable format (XML, JSON, CSV) and access these data using old school tools like Excel ...: This hard to maintain all the integrity data, references between different objects type. High chance of missing, loosing data when export. Also, difficult to access, manipulate data later.
  2. Export old data to a portable format which is able to import back to the system later: This is even harder
  3. Keep these object in the system but some how "unindex" them in the catalog, but how? I dont know how it could be done in plone.
  4. [Preferable solution] Having a "archive system" with all the code are the same as production system, but conect a huge Database of all time. This archive system doesnt need to be perormance well, all it does is store ALL data in the easy access method. The production then will work with a recent database (last 2 year for example). The question is, how to partially sync the latest data of the production system to the archive system? At the moment, we use ZODB with RelStorage storing in Postgresql.

Have you faced the similar problem? How did you deal with it? Does any of the above solution make sense? I am appreciate all the comments and helps.

You may move your old objects to a folder on a different mountpoint and perhaps create a dedicated catalog for this archive.

-aj

Is this possible to create dedicate catalog for a folder? May you give me a documentation somewhere about this?

In theory you can have extra catalogs but everything in the plone UI uses it's one catalog. There isn't a way for example to get plone search to combine the results of two catalogs. You'd have to create some customisations for dealing with your single archive folder. We solved a large catalog problem like this in a plomino app by spliting our archive into a seperate plomino app with a seperate catalog, however that is not regular plone content and we control the UI in that case, not plone.

It would be an interesting usecase to see if Plone can be made to make use of per folder catalogs.

You might gain a lot more by using elasticsearch or solr as that will move all the ram and indexing cpu outside of each plone instance, at least for the text index which likely your biggest one. I'm not sure on the real world speed up and lowering of ram however.

Only a little entry: in the Plone Dev Docs and Products.ZCatalog, of course.

I never had such issues but I think @djay proposal on using Elasticsearch or Solr makes a lot of sense for your use case.

check:

We already have 2 extra catalogs for handling indexing 3 most complex content types, and that help a lot already. All the complex search are perform on these catalog. portal_catalog is only for handling Plone default stuffs. But, the thing is, when old objects are in the system all these 3 catalogs size are increasing. So I think the idea of having dedicated catalog for a folder is excellent, since I can move old objects to a folder and the be indexed by the portal_catalog, and the 2 extra catalogs mentioned above will only index the recent, active objects. However, I have not found a way to achieve it yet.

About the idea of elastic search, from my little knowledge about it. Elasticsearch is very good in full text search, however there are not much full text search in our system. Most of the time, we use Faceted search, so I'm not sure if using elasticsearch here could help.

You could conceivably override all indexers with new ones that first check whether the object is contained in your archive path, raise a DontIndex if so, and only delegate to the "normal" indexers for non-archived content.

1 Like

I assume that your "archive" objects are no longer modified. In this case, I would unindex them from the normal catalog[s] and index them instead in corresponding archive catalog[s]. In this way, your "normal" catalogs would remain rather small.

Your archive catalogs will grow but they are likely used rarely. If used, they may content with normal operations for ZODB cache space. In case, this should be a problem, you could place your archive catalogs into a separate ZODB, thus giving them their own ZODB cache which could be controlled independent from the "normal" one.

@cuongnda while you might not use textindex often, you will have one as plone requires it and it's likely it will be large considering all your content. During indexing it will take up CPU as all indexes need to be updated. Any text search will bring that index into memory and therefore compete with other content in the cache. Having a cache too small for your indexes is normally the reason behind slow operations. So either you reduce your indexes, perhaps by offloading some to elasticsearch, or you increase your cache size. Having an additional catalog will likely only help if you seperate it into its own mount so it has its own cache so when its used it doesn't push out other often used objects out of your main cache (as @dieter mentioned)

@dieter What is the technique could be used to "unindex" an object in a specific folder from a catalog?

Following the current discussion, I assume that you put the objects to be archived explicitly into the special folder, likely via some script. In this script, you can call the object's "unindexObject" method in order to unindex the object from the "normal" catalog[s].

The base implementation of "unindexObject" comes from "Products.CMFCore.CMFCatalogAware.CatalogAware". It determines the catalog to be used (via a call to the method "_getCatalogTool") and calls its "unindexObject". In a Plone context, "unindexObject" comes from a different place and (likely) deals with several catalogs. However, the basics are likely comparable: determine the relevant catalogs and call their "unindexObject". If your archived objects should be indexed in archive catalogs, you can use this technique, to index them there.

1 Like