Error after copying ZODB to new server: "SystemError: new style getargs format but argument is not a tuple" (deleting zec files solves the problem)

We run into a strange problem when copying the ZODB from one server to another.

Background: our CI system automatically migrates a client intranet with 10k users and 50 GB data. After the migration was successful and our internal tests pass, we copy the database over to our dev server, so that we can always check the latest migrated version.

As said, this process is fully automated and it should lead to the same results. Though, every second or third time, we end up with a broken ZODB and different error messages that we see, for instance:

Traceback (innermost last):
  Module ZPublisher.WSGIPublisher, line 162, in transaction_pubevents
  Module ZPublisher.WSGIPublisher, line 359, in publish_module
  Module ZPublisher.WSGIPublisher, line 262, in publish
  Module ZPublisher.mapply, line 85, in mapply
  Module ZPublisher.WSGIPublisher, line 63, in call_object
  Module zope.browserpage.simpleviewclass, line 41, in __call__
  Module Products.Five.browser.pagetemplatefile, line 126, in __call__
  Module Products.Five.browser.pagetemplatefile, line 61, in __call__
  Module zope.pagetemplate.pagetemplate, line 135, in pt_render
  Module Products.PageTemplates.engine, line 367, in __call__
  Module z3c.pt.pagetemplate, line 176, in render
  Module chameleon.zpt.template, line 307, in render
  Module chameleon.template, line 214, in render
  Module chameleon.utils, line 75, in raise_with_traceback
  Module chameleon.template, line 192, in render
  Module b83dcd251489db587a0223b02eddc82b, line 1982, in render
  Module 41467522d3013025e833d171640eff27, line 370, in render_master
  Module zope.contentprovider.tales, line 76, in __call__
  Module zope.viewlet.manager, line 155, in update
  Module zope.viewlet.manager, line 161, in _updateViewlets
  Module plone.app.layout.viewlets.httpheaders, line 12, in update
  Module plone.app.layout.viewlets.common, line 74, in update
  Module plone.memoize.view, line 59, in memogetter
  Module plone.app.layout.globals.portal, line 68, in navigation_root_url
  Module plone.memoize.view, line 59, in memogetter
  Module plone.app.layout.globals.portal, line 64, in navigation_root_path
  Module plone.app.layout.navigation.root, line 38, in getNavigationRoot
  Module plone.registry.registry, line 47, in get
  Module ZODB.Connection, line 795, in setstate
  Module ZODB.serialize, line 634, in setGhostState
SystemError: new style getargs format but argument is not a tuple

The solution to this problem is to delete all ".zec" and ".zec.lock" files from the /tmp directory and restart zeo and the instances. After this, the problem is gone in 100% of the cases.

A few things are strange IMHO:

a) why are the zec files in /tmp instead of in the instance dirs? We are using pm2 to run the processes, so this might be causing this. Not sure if that is related to the problem though

b) why do we see random failures (every second or third instance is broken) and why do we see random error traces?

c) why do I see only one zec file in the /tmp file? We run 20 other instances with different configurations on the same server. Even if they would use the same file and override each other, we would see multiple files (because some use "instance", some zeo setups with "instance1", "instance2" etc. From the filenames, it seems pm2 is taking care of this.

I know PM2 is not widely used in the Plone community so it is easy to suspect this to be the problem. Though we never had a single issue with it in prod and it is more widely used and has a larger dev community compared to supervisor or any other process managers from the Python world. Therefore I'd rather suspect a ZODB copy problem or a combination of both.

We will investigate the issue further. Though, I wanted to share this in case someone else runs into it...Maybe someone has a hunch what is causing this.

Maybe something related to different Python versions? We use Python 3.7.9 on the jenkins node and 3.7.5 on the dev server...

.zec files are the persistent zeo cache files, when you enable the cache with zeo-client-client option. IIRC there's another parameter where you can say how many object updates the zeoserver object counter and the last zeo-client-cache stored object can differ before a local zeo-client persistent cache is thrown away completely.

Maybe there is a binary difference between Python 3.7.5 and 3.7.9 (unlikely), or somehow your local zeo-client cache is still used when it should be thrown away because the zeoserver database has been replaced completely.

Perhaps Python pickle protocol 4 vs 5?

Plone Foundation Code of Conduct