we are working on some upgrade steps for an add-on to make some changes on stuff like updating static resources and interfaces used and provided by a Dexterity-based content type.
as part of the upgrade process we need to reindex all items in the catalog for this specific content type and that could be tricky sometimes as sites could have tens of thousands of object instances.
my main concerns here are memory usage, speed and resiliency.
this is the code we have right know:
def get_valid_objects(brains):
"""Generate a list of objects associated with valid brains."""
for b in brains:
try:
obj = b.getObject()
except KeyError:
obj = None
if obj is None: # warn on broken entries in the catalog
logger.warn(u'Invalid reference: {0}'.format(b.getPath()))
continue
yield obj
def reindex_news_articles(setup_tool):
"""Reindex News Articles to fix interfaces."""
logger.info(u'Reindexing the catalog.')
results = api.content.find(portal_type='collective.nitf.content')
logger.info(u'Found {0} news articles'.format(len(results)))
for obj in get_valid_objects(results):
obj.reindexObject()
logger.info('Done.')
I think we already have minimized any issues associated with possible catalog inconsistencies and also this upgrade step seems to be very memory efficient as we are using a generator to access the objects (we haven't tested this yet).
now, my concern is speed: is there any way to make this run faster using some transaction trick?
do you have different approaches on dealing with this kind of stuff?
So called subtransactions are a standard way to control the memory use in large transactions (internally, a subtransaction is nowadays implemented by a checkpoint) -- especially, large transactions in relation to catalog operations. You call transaction.commit(1) to end a subtransaction (and start a new one).
Usually, modified objects cannot be freed from the (ZODB) cache until the transactions ends and cache flushes are only performed at transaction boundaries. A subtransaction causes modification to be saved to temporary file and thus, allowing the modified objects to be removed from the cache. At a subtransaction boundary, it is also checked whether the cache gots too full and some objects should get flushed.
For catalog operations, you usually count the operations and insert a subtransaction commit every n operations.
Do you really need to reindex all the indexes? If not you can speed up quite a bit by being more selective: catalog.catalog_object(obj, idxs=['some_index'])
If it is only an index that needs to be updated and not catalog metadata you can speed up even more like this: catalog.catalog_object(obj, idxs=['some_index'], update_metadata=False)
When reindexing lots of items I usually do a full commit every 20-100 items rather than savepoints. This reduces the chance that a single long-running transaction will conflict at the end. (So this tip is more about robustness than speed or resource use.) Keep in mind that this leaves the catalog with some items reindexed and some items not so you'll have to decide on a case-by-case basis whether that is a problem. I've occasionally had a site that was so busy with edits that I had to catch ConflictErrors and do retries for each batch.
thank you, very much, that's pretty interesting; I modified again my method and ended up with this:
def reindex_news_articles(setup_tool):
"""Reindex news articles to fix interfaces."""
test = 'test' in setup_tool.REQUEST # used to ignore transactions on tests
logger.info(u'Reindexing the catalog. ')
catalog = api.portal.get_tool('portal_catalog')
results = api.content.find(portal_type='collective.nitf.content')
logger.info(u'Found {0} news articles'.format(len(results)))
n = 0
for obj in get_valid_objects(results):
catalog.catalog_object(obj, idxs=['object_provides'], update_metadata=False)
n += 1
if n % 1000 == 0 and not test:
transaction.commit()
logger.info('{0} items processed.'.format(n))
if not test:
transaction.commit()
logger.info('Done.')
one more question: as part of another upgrade step I also have to update the layout of some of those same objects. I was looking on the catalog but seems this information is not included in any place, right?
I'm doing that with a simple obj.setLayout('foo') and I want to double check if that will also affect the catalog in any way (I think it doesn't).
It's of course possible that someone has added a custom index of getLayout to their site, but you're right that it's not typically indexed or included in catalog metadata. I would probably mention the change in the changelog and let an integrator deal with it themself if they added that custom index.
Ah, I did not know about this catalog_object – am always learning from you, David
Wraps the object with workflow and accessibility information just before cataloging.
But catalog_object is not specifically needed to be selective about the index you want to update. In what cases is it necessary to wrap the object before updating the index(es)?
it happens all the time: if you check the code of reindexObject() you'll see it calls internally catalog_object(); the main advantage is you don't have to lookup the portal_catalog tool on every single object.