Renaming catalog keywords

We are trying to reduce the number of keywords on a site that has a lot of content. Currently the site has 181.313 objects on it and 71.193 different keywords which make some operations in the site really slow.

We want to clean up the keywords like this:

  • normalize all keywords removing spare spaces
  • remove all keywords used less than 5 times
  • transform all keywords to lowercase

In order to reduce the lost of useful keywords I want to transform to lowercase before removing the ones used less than 5 times, but this will take a lot of time as the only way I know to achieve this is editing every single object and transforming all keywords there and this is going to be very inefficient:

keywords = catalog.uniqueValuesFor('Subject')
for k in keywords:
    results = catalog(Subject=k)
    for b in results:
        obj = b.getObject()
        subject = list(obj.Subject())
        subject.remove(k)
        subject.append(k.lower())
        obj.setSubject(tuple(subject))
        catalog.catalog_object(obj, idxs=['Subject'], update_metadata=False)

Is there an alternative for renaming keywords?

Well one thing I would do is check first if the keyword (k) is already lowercase.

1 Like

Also, once you have an object you need to modify turn all keywords into lowercase and remember the object uid to skip it in a later iteration.

yes, now that you mention it I think is better just to iterate over all objects instead and convert all keywords to lowercase like this:

results = catalog()
for b in results:
    obj = b.getObject()
    keywords = list(obj.Subject())
    lowercase = [k.lower() for k in keywords]
    if keywords == lowercase:
        continue
    obj.setSubject(tuple(lowercase))
    catalog.catalog_object(obj, idxs=['Subject'], update_metadata=False)

that has to be faster because I'm doing less catalog reindexes.

so, there no other way? no hidden catalog methods for doing this?

There used to be Products.PloneKeywordManager. I don't know if it still works. It adds a control panel to do all the renaming stuff.

yes, I know, but it becomes unusable with such a high number of keywords.

we need to give some love to it by adding these functions there.

If you want to update all the keywords you're going to have to edit all the objects; no more easy shortcuts that I see.

It doesn't sound like there would be huge negative consequences if this gets committed in batches, so I would just commit every 100 items or so and let it take the time it takes.

1 Like

this gist shows how to normalize keywords:

this gist shows how to clean up keywords:

in my case it took around 15 minutes to run the first one and 20 minutes to run the second.

share and enjoy!

2 Likes

Slightly related: If you want to postpone/disable all catalog reindexing operations for a while (and do a clear and rebuild catalog at the end), there's collective.noindexing. I think its use case at the time was migrations from Plone 3 to 4.X with multiple full reindexes running.

But you have to know the details of what is happening, if you are dependent on subsequent
ment catalog queries on those same indexes you're temporarily not updating to find content to process it's not much use.