We are trying to reduce the number of keywords on a site that has a lot of content. Currently the site has 181.313 objects on it and 71.193 different keywords which make some operations in the site really slow.
We want to clean up the keywords like this:
normalize all keywords removing spare spaces
remove all keywords used less than 5 times
transform all keywords to lowercase
In order to reduce the lost of useful keywords I want to transform to lowercase before removing the ones used less than 5 times, but this will take a lot of time as the only way I know to achieve this is editing every single object and transforming all keywords there and this is going to be very inefficient:
keywords = catalog.uniqueValuesFor('Subject')
for k in keywords:
results = catalog(Subject=k)
for b in results:
obj = b.getObject()
subject = list(obj.Subject())
subject.remove(k)
subject.append(k.lower())
obj.setSubject(tuple(subject))
catalog.catalog_object(obj, idxs=['Subject'], update_metadata=False)
yes, now that you mention it I think is better just to iterate over all objects instead and convert all keywords to lowercase like this:
results = catalog()
for b in results:
obj = b.getObject()
keywords = list(obj.Subject())
lowercase = [k.lower() for k in keywords]
if keywords == lowercase:
continue
obj.setSubject(tuple(lowercase))
catalog.catalog_object(obj, idxs=['Subject'], update_metadata=False)
that has to be faster because I'm doing less catalog reindexes.
so, there no other way? no hidden catalog methods for doing this?
If you want to update all the keywords you're going to have to edit all the objects; no more easy shortcuts that I see.
It doesn't sound like there would be huge negative consequences if this gets committed in batches, so I would just commit every 100 items or so and let it take the time it takes.
Slightly related: If you want to postpone/disable all catalog reindexing operations for a while (and do a clear and rebuild catalog at the end), there's collective.noindexing. I think its use case at the time was migrations from Plone 3 to 4.X with multiple full reindexes running.
But you have to know the details of what is happening, if you are dependent on subsequent
ment catalog queries on those same indexes you're temporarily not updating to find content to process it's not much use.