We run into a strange problem when copying the ZODB from one server to another.
Background: our CI system automatically migrates a client intranet with 10k users and 50 GB data. After the migration was successful and our internal tests pass, we copy the database over to our dev server, so that we can always check the latest migrated version.
As said, this process is fully automated and it should lead to the same results. Though, every second or third time, we end up with a broken ZODB and different error messages that we see, for instance:
Traceback (innermost last): Module ZPublisher.WSGIPublisher, line 162, in transaction_pubevents Module ZPublisher.WSGIPublisher, line 359, in publish_module Module ZPublisher.WSGIPublisher, line 262, in publish Module ZPublisher.mapply, line 85, in mapply Module ZPublisher.WSGIPublisher, line 63, in call_object Module zope.browserpage.simpleviewclass, line 41, in __call__ Module Products.Five.browser.pagetemplatefile, line 126, in __call__ Module Products.Five.browser.pagetemplatefile, line 61, in __call__ Module zope.pagetemplate.pagetemplate, line 135, in pt_render Module Products.PageTemplates.engine, line 367, in __call__ Module z3c.pt.pagetemplate, line 176, in render Module chameleon.zpt.template, line 307, in render Module chameleon.template, line 214, in render Module chameleon.utils, line 75, in raise_with_traceback Module chameleon.template, line 192, in render Module b83dcd251489db587a0223b02eddc82b, line 1982, in render Module 41467522d3013025e833d171640eff27, line 370, in render_master Module zope.contentprovider.tales, line 76, in __call__ Module zope.viewlet.manager, line 155, in update Module zope.viewlet.manager, line 161, in _updateViewlets Module plone.app.layout.viewlets.httpheaders, line 12, in update Module plone.app.layout.viewlets.common, line 74, in update Module plone.memoize.view, line 59, in memogetter Module plone.app.layout.globals.portal, line 68, in navigation_root_url Module plone.memoize.view, line 59, in memogetter Module plone.app.layout.globals.portal, line 64, in navigation_root_path Module plone.app.layout.navigation.root, line 38, in getNavigationRoot Module plone.registry.registry, line 47, in get Module ZODB.Connection, line 795, in setstate Module ZODB.serialize, line 634, in setGhostState SystemError: new style getargs format but argument is not a tuple
The solution to this problem is to delete all ".zec" and ".zec.lock" files from the /tmp directory and restart zeo and the instances. After this, the problem is gone in 100% of the cases.
A few things are strange IMHO:
a) why are the zec files in /tmp instead of in the instance dirs? We are using pm2 to run the processes, so this might be causing this. Not sure if that is related to the problem though
b) why do we see random failures (every second or third instance is broken) and why do we see random error traces?
c) why do I see only one zec file in the /tmp file? We run 20 other instances with different configurations on the same server. Even if they would use the same file and override each other, we would see multiple files (because some use "instance", some zeo setups with "instance1", "instance2" etc. From the filenames, it seems pm2 is taking care of this.
I know PM2 is not widely used in the Plone community so it is easy to suspect this to be the problem. Though we never had a single issue with it in prod and it is more widely used and has a larger dev community compared to supervisor or any other process managers from the Python world. Therefore I'd rather suspect a ZODB copy problem or a combination of both.
We will investigate the issue further. Though, I wanted to share this in case someone else runs into it...Maybe someone has a hunch what is causing this.
Maybe something related to different Python versions? We use Python 3.7.9 on the jenkins node and 3.7.5 on the dev server...