Potential memory corruption during migration Plone 4.2 --> 5.2

dieter · February 13, 2020, 7:58am

I am currently analyzing problems observed with a huge migration from Plone 4.2 to 5.2. Those problems look as follows: a search is performed (successfully) but the later iteration over the result brain sequence fails with a KeyError: the search has apparently returned a catalog id, the catalog does not know about. That could be explained with an index inconsistency; however, in this case the problem should occur deterministically - which is not the case: the same search against the same catalog followed by the same iteration usually succeeds and fails only occasionally.

For the moment, I assume a memory corruption causing a non-deterministic change of the catalog id.

The migrated system is huge - with dozens of external extensions implemented in C. Any of them might be responsible for a potential memory corruption. However, I could not find symptoms in the Plone 4.2 installation; the problem is apparently new in the Plone 5.2 setup. In principle, the new Plone 5.2 setup should use the same external extension versions as the Plone 4.2 setup (not yet verified); in this case, some of the C extensions coming with Plone 5.2/Zope 4 might be responsible.

I will keep you informed.

dieter · February 13, 2020, 8:14am

This is not the case: the versions of some important C extensions have not been pinned and changed between the two setups. Maybe, those new versions have introduced problems.

jugmac00 · February 13, 2020, 8:18am

After migrating from Zope 2 to Zope 4 I encountered following errors:

Traceback (innermost last):
  Module ZPublisher.WSGIPublisher, line 156, in transaction_pubevents
  Module ZPublisher.WSGIPublisher, line 338, in publish_module
  Module ZPublisher.WSGIPublisher, line 256, in publish
  Module ZPublisher.mapply, line 85, in mapply
  Module ZPublisher.WSGIPublisher, line 62, in call_object
  Module Products.StatisticsCenter.StatisticsCenter, line 146, in runStatsGatherJobsMethod
  Module ZTUtils.Lazy, line 201, in __getitem__
  Module Products.ZCatalog.Catalog, line 128, in __getitem__
KeyError: -1474289243

I cleared the index and reindexed - then it worked... for maybe a day without of errors.

Then I saw the same errors again.

Then I cleared/reindexed it again, and from then on I have not encountered any more errors.

My app is a low traffic intranet site.

I am not aware which libraries use a C extension... but I could have a look if that is of interest for you.

dieter · February 13, 2020, 8:23am

Thank you for the feedback. The symptom in your description is the same one as in my problem. However, it would be important to know whether in your case (as in mine) the problem occurs non-deterministically: i.e. the same request sometimes fails and sometimes succeeds without intervening reindexing.

The symptom is normal for (some types of) index inconsistencies but it is grave when it happens non-deterministically (because then, the index is not inconsistent).

jugmac00 · February 13, 2020, 8:52am

I updated my post, so maybe it is clearer.

The request is triggered by a cronjob via XML-rpc.

After updating from Zope 2 to 4 I saw KeyErrors in my log, so I thought maybe there was a bad timing with the cron jobs and the Zope 4 update, so let's clear the index and reindex .

This seemingly helped... the request could be triggered manually without an error.

Then, a day or two later, I saw the very same error again in my logfile.

I decided to clear/reindex again, executed the request manually, it worked, and then I wanted to take some time to debug the problem.

But after the second round of "clear/reindex" the problem never happened again. So I stopped debugging/thinking about it.

jugmac00 · February 13, 2020, 10:03am

I just had a look in my logfile... The error is back again

dieter · February 13, 2020, 10:23am

jugmac00 via Plone Community wrote at 2020-2-13 10:13 +0000:

I just had a look in my logfile... The error is back again

Then you could try whether the error is persistent (i.e. starting
from one point in time all following requests fail)
or non-deterministic (requests fail sometimes and sometimes succeed).

Should you see non-deterministic behaviour, we likely have the
same problem.

The C extensions with different versions in my Plone 4.2 and 5.2
setups are "Jpype1", "M2Crypto" and "cryptographie".

jugmac00 · February 13, 2020, 11:43am

Thank you for the explanation.

My error is persistent/deterministic. So it is not the same problem.

Could you give me a hint about possible reasons and how to debug or how to fix my problem?

As this only affects one index, and as there are only < 100 instances indexed, I could skip the catalog and directly get the instances from the db. But actually I'd like to know what is going wrong. There were no problems on Zope 2, but only now on Zope 4.

I am grateful for any hints.

zopyx · February 13, 2020, 1:40pm

I have reproducable ZCatalog corruptions as part of an ongoing huge migration which a Plone site started from scratch. Still no idea why...workaround is to reindex the whole site as part of the last migration phase.

icemac · February 14, 2020, 8:46am

In union.cms I have the following exception in a FieldIndex, it appears in a ZODB which was newly created in Zope 4 using Python 3. It is persistent. It goes away when reindexing the whole catalog but there are new errors for newly added content. Thus my exception seems to have nothing to do with the exception of Dieter.

Traceback (most recent call last):
  Module ZPublisher.WSGIPublisher, line 156, in transaction_pubevents
    yield
  Module ZPublisher.WSGIPublisher, line 338, in publish_module
    response = _publish(request, new_mod_info)
  Module uc.wsgi.wsgi, line 37, in ucms_publish
    return ZPublisher.WSGIPublisher.publish(request, module_name)
  Module ZPublisher.WSGIPublisher, line 256, in publish
    bind=1)
  Module ZPublisher.mapply, line 85, in mapply
    return debug(object, args, context)
  Module ZPublisher.WSGIPublisher, line 62, in call_object
    return obj(*args)
  Module uc.cms.defaultsite.skin.shortlinks, line 34, in redirect
    url = self._lookUpShortId(shortid)
  Module ...browser.shortlink, line 14, in _lookUpShortId
    for r in result:
  Module ZTUtils.Lazy, line 201, in __getitem__
    value = data[index] = self._func(self._seq[index])
  Module Products.ZCatalog.Catalog, line 128, in __getitem__
    return self.instantiate((key, self.data[key]),
KeyError: -1099272463

dieter · February 14, 2020, 9:30am

My problem is understood - and has fortunately nothing to do with a memory corruption. Its real cause is request caching by Products.PluginIndexes.unindex.UnIndex. It uses for this caching a request key of the form '_catalogcache_{0}_{1}'.format(catalog.getId(), id(catalog)). Our migration script it started with bin/client1 run and this implicitly arranges for the creation of a request object. The script then iterates over the portals (more than 1000) and migrates each portal in turn. Apparently, for some of those portals p, id(p.portal_catalog._catalog) is identical and the index returns buggy cached results from a formerly migrated catalog - and naturally not understood by the current one.
Workaround: clear the request before each portal migration.

I never understood why UnIndex must use request caching: it should be quite rare that the same request issues the same query against the same index more than once. This "feature" is dangerous, especially for scripts for which there is no explicit request. This is aggravated by UnIndex forgetting to invalidate the cache on modifications: thus the second "search" in a "search, modify, search" sequence can give outdated results.

1letter · February 14, 2020, 9:46am

@dieter Can you explain your statement, please.

Workaround: clear the request before each portal migration.

dieter · February 14, 2020, 10:28am

Jan via Plone Community wrote at 2020-2-14 09:56 +0000:

@dieter Can you explain your statement, please.

Workaround: clear the request before each portal migration.

We start the migration script with bin/client1 run.
The script has the structure:

for p in <portals>:
  ... migrate portal ...

The workaround is to give it instead the following structure:

for p in <portals>:
  app.REQUEST.clear()
  ... migrate portal ...

mauritsvanrees · February 14, 2020, 11:40am

Possibly collective.catalogcleanup can be of help in detecting and optionally fixing problems. But it does not check indexes, except I think the UID index.

jugmac00 · February 18, 2020, 10:10am

My problem returns every two days or so. As a workaround I directly get all (only 100 or so) objects from the database instead of using the catalog.

I'd really prefer to see a different solution, but I still don't know where to look.

jugmac00 · February 18, 2020, 10:21am

I just found the following error in my log - which may be the cause for my problems:

2020-02-18 11:00:24,682 ERROR   [Zope.UnIndex:208][waitress] FieldIndex: unindex_object tried to retrieve set DateTime('2020/01/08 06:04:00 GMT+2') from index 'nextCheck' but couldn't.  This should not happen.

dieter · February 23, 2020, 12:47pm

I have not checked the code but the message feels like an inconsistency
between the "_unindex" ("doc -> index value") and "_index"
(index value -> doc_id set") data structures of the index -- and, as
the message indicates, it should not happen (i.e. we likely see
a bug).

Modern ZCatalog versions have a special index type for datetimes
where the time resolution is limited and the result represented compactly
(and an "int" when I remember right). Maybe, there is some
conversion problem (similar to rounding). What index type has
your "nextCheck"?

jugmac00 · February 24, 2020, 7:10am

"nextCheck" is a FieldIndex - probably for historical reasons - it was that way as long as I remember.

So, do you suggest to use a DateIndex instead?

I wonder why this always worked nicely, just with the update to Zope 4 the problems have started.

dieter · February 24, 2020, 8:12am

(Note: I replied already to the corresponding email from the forum - but this reply seems to have been lost somewhere. You it resurface, there will be redundancy.)

The typical date indexes are very broad (lots of "terms", each
with few associated documents); thus, typical queries cause large
intermediate "or"s; in addition, Zope's DateTime has large pickles
and therefore require a lot of space in the ZODB.
These potential (!) problems have motivated the design of
a specialized DateIndex with the following features:

configurable resolution (you rarely need subsecond resolution)
to get the index narrower (smaller intermediate "or"s)
uses an integer to represent the datetime (less storage requirements
in the ZODB)

I do not think that switching to a DateIndex will fix your (current)
problems. I mentioned DateIndex because its resolution reduction
is similar to "rounding" and there might be edge cases
causing potential index corruption.

You wrote earlier that your site is not very busy. Maybe, you
do not need to pack its (ZODB) storage frequently.
In this case, you could use dm.historical (it allows to
restore historical state back until the most recent packing)
to find the transaction which has introduced the inconsistency
together with the (still good) state before and after (now bad)
the transaction.