Best practices on reindexing the catalog

hvelarde · May 5, 2017, 12:03am

we are working on some upgrade steps for an add-on to make some changes on stuff like updating static resources and interfaces used and provided by a Dexterity-based content type.

as part of the upgrade process we need to reindex all items in the catalog for this specific content type and that could be tricky sometimes as sites could have tens of thousands of object instances.

my main concerns here are memory usage, speed and resiliency.

this is the code we have right know:

def get_valid_objects(brains):
    """Generate a list of objects associated with valid brains."""
    for b in brains:
        try:
            obj = b.getObject()
        except KeyError:
            obj = None

        if obj is None:  # warn on broken entries in the catalog
            logger.warn(u'Invalid reference: {0}'.format(b.getPath()))
            continue
        yield obj

def reindex_news_articles(setup_tool):
    """Reindex News Articles to fix interfaces."""
    logger.info(u'Reindexing the catalog.')
    results = api.content.find(portal_type='collective.nitf.content')
    logger.info(u'Found {0} news articles'.format(len(results)))
    for obj in get_valid_objects(results):
        obj.reindexObject()
    logger.info('Done.')

I think we already have minimized any issues associated with possible catalog inconsistencies and also this upgrade step seems to be very memory efficient as we are using a generator to access the objects (we haven't tested this yet).

now, my concern is speed: is there any way to make this run faster using some transaction trick?

do you have different approaches on dealing with this kind of stuff?

alert · May 5, 2017, 6:57am

Calling a transaction.savepoint(optimistic=True) after you reindex a batch of 1000 objects should save more memory.

There are some examples in p.a.upgrade, e.g:

https://github.com/plone/plone.app.upgrade/blob/c9490e81056e9df379225daa232d27eb795a6c77/plone/app/upgrade/v41/betas.py#L45

dieter · May 5, 2017, 7:01am

So called subtransactions are a standard way to control the memory use in large transactions (internally, a subtransaction is nowadays implemented by a checkpoint) -- especially, large transactions in relation to catalog operations. You call transaction.commit(1) to end a subtransaction (and start a new one).

Usually, modified objects cannot be freed from the (ZODB) cache until the transactions ends and cache flushes are only performed at transaction boundaries. A subtransaction causes modification to be saved to temporary file and thus, allowing the modified objects to be removed from the cache. At a subtransaction boundary, it is also checked whether the cache gots too full and some objects should get flushed.

For catalog operations, you usually count the operations and insert a subtransaction commit every n operations.

hvelarde · May 5, 2017, 3:22pm

thank you for your comments, that was what I was looking for; I ended up with this code:

def reindex_news_articles(setup_tool):
    """Reindex news articles to fix interfaces."""
    logger.info(u'Reindexing the catalog. ')
    results = api.content.find(portal_type='collective.nitf.content')
    logger.info(u'Found {0} news articles'.format(len(results)))
    n = 0
    for obj in get_valid_objects(results):
        obj.reindexObject()
        n += 1
        if n % 1000 == 0:
            transaction.savepoint(optimistic=True)
            logger.info('{0} items processed.'.format(n))

    transaction.savepoint(optimistic=True)
    logger.info('Done.')

from your responses I also deduced there's no way to speed up the process (transaction.savepoint() will make it even a little bit slower).

davisagli · May 5, 2017, 3:42pm

Do you really need to reindex all the indexes? If not you can speed up quite a bit by being more selective: catalog.catalog_object(obj, idxs=['some_index'])

If it is only an index that needs to be updated and not catalog metadata you can speed up even more like this: catalog.catalog_object(obj, idxs=['some_index'], update_metadata=False)

When reindexing lots of items I usually do a full commit every 20-100 items rather than savepoints. This reduces the chance that a single long-running transaction will conflict at the end. (So this tip is more about robustness than speed or resource use.) Keep in mind that this leaves the catalog with some items reindexed and some items not so you'll have to decide on a case-by-case basis whether that is a problem. I've occasionally had a site that was so busy with edits that I had to catch ConflictErrors and do retries for each batch.

hvelarde · May 5, 2017, 4:21pm

thank you, very much, that's pretty interesting; I modified again my method and ended up with this:

def reindex_news_articles(setup_tool):
    """Reindex news articles to fix interfaces."""
    test = 'test' in setup_tool.REQUEST  # used to ignore transactions on tests
    logger.info(u'Reindexing the catalog. ')
    catalog = api.portal.get_tool('portal_catalog')
    results = api.content.find(portal_type='collective.nitf.content')
    logger.info(u'Found {0} news articles'.format(len(results)))
    n = 0
    for obj in get_valid_objects(results):
        catalog.catalog_object(obj, idxs=['object_provides'], update_metadata=False)
        n += 1
        if n % 1000 == 0 and not test:
            transaction.commit()
            logger.info('{0} items processed.'.format(n))

    if not test:
        transaction.commit()
    logger.info('Done.')

one more question: as part of another upgrade step I also have to update the layout of some of those same objects. I was looking on the catalog but seems this information is not included in any place, right?

I'm doing that with a simple obj.setLayout('foo') and I want to double check if that will also affect the catalog in any way (I think it doesn't).

davisagli · May 5, 2017, 4:49pm

It's of course possible that someone has added a custom index of getLayout to their site, but you're right that it's not typically indexed or included in catalog metadata. I would probably mention the change in the changelog and let an integrator deal with it themself if they added that custom index.

tkimnguyen · May 7, 2017, 2:03pm

This is great stuff. Someone cough @hvelarde cough could add it to the docs...

tkimnguyen · November 15, 2017, 8:36pm

Come on @hvelarde where's the pull request? or at least the new issue?

hvelarde · November 15, 2017, 8:48pm

I'll try to update the docs tomorrow

hvelarde · November 16, 2017, 1:01pm

done, now it's up to you:

UPDATE: it's merged now on 3 branches.

tkimnguyen · November 16, 2017, 2:48pm

Thanks @hvelarde – you make me proud, even when I want to strangle you

tkimnguyen · November 16, 2017, 3:19pm

Ah, I did not know about this catalog_object – am always learning from you, David

Wraps the object with workflow and accessibility information just before cataloging.

But catalog_object is not specifically needed to be selective about the index you want to update. In what cases is it necessary to wrap the object before updating the index(es)?

hvelarde · November 16, 2017, 3:27pm

it happens all the time: if you check the code of reindexObject() you'll see it calls internally catalog_object(); the main advantage is you don't have to lookup the portal_catalog tool on every single object.

tkimnguyen · November 16, 2017, 5:01pm

OK I see that calling catalog_object directly skips a bit of logic at https://github.com/zopefoundation/Products.CMFCore/blob/master/Products/CMFCore/CatalogTool.py#L312

zopyx · November 16, 2017, 6:02pm

Good to know that I am not the only one to die that way

-aj