Several POSKeyErrors in different sites

We have currently a bunch of Plone sites published with Volto.
They have different Plone versions (from 5.2.4 to 5.2.7) and have different setups (classic zeoserver+clients and some with docker+relstorage), but share a common codebase and content-types.

In the past days i've found 3 different websites with corrupted databases with some POSKeyErrors.
Two of them have the "classic" setup (no relstorage) and i've found that they had errors because they raised POSKeyError when trying to move a folder or during scheduled packs.

I tried to inspect zodb following some guides (PloneChix: The Definitive Guide to POSKeyError and POSKeyError Debugging and Fixing Cookbook — Six Feet Up).

First of all i tried to launch fstest on the Data.fs and it reported nothing.
Then i tried with fsrefs.py and i had a list of broken refs like this:

oid 0x897af BTrees.OOBTree.OOBucket
last updated: 2022-02-10 10:39:35.804342, tid=0x3e6187f98c3e399
refers to invalid objects:
	oid 0x089655 missing: '<unknown>'
	oid 0x089656 missing: '<unknown>'
	oid 0x089657 missing: '<unknown>'
	oid 0x089658 missing: '<unknown>'
	oid 0x089659 missing: '<unknown>'
	oid 0x08965a missing: '<unknown>'
	oid 0x082356 missing: '<unknown>'
	oid 0x082357 missing: '<unknown>'
	oid 0x082358 missing: '<unknown>'

Also zodbverify tells me something:

> ./bin/zodbverify -f var/filestorage/Data.fs
INFO:zodbverify:Scanning ZODB...
Traceback (most recent call last):
  File "./bin/zodbverify", line 316, in <module>
    sys.exit(zodbverify.__main__.main())
  File "/Users/cekk/.buildout/eggs/zodbverify-1.1.0-py3.8.egg/zodbverify/__main__.py", line 45, in main
    verify_zodb(storage, debug=options.debug)
  File "/Users/cekk/.buildout/eggs/zodbverify-1.1.0-py3.8.egg/zodbverify/verify.py", line 32, in verify_zodb
    oid, tid, data, next_ = storage.record_iternext(next_)
  File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 1418, in record_iternext
    data, tid = load_current(self, oid)
  File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/utils.py", line 385, in load_current
    r = storage.loadBefore(oid, maxtid)
  File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 582, in loadBefore
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x08ee8c

going deeper:

> ./bin/zodbverify -f var/filestorage/Data.fs -o 0x08ee8c
INFO:zodbverify:Could not load object
INFO:zodbverify:Traceback (most recent call last):
  File "/Users/cekk/.buildout/eggs/zodbverify-1.1.0-py3.8.egg/zodbverify/verify_oid.py", line 41, in verify_oid
    obj = connection.get(oid)
  File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 247, in get
    p, _ = self._storage.load(oid)
  File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/mvccadapter.py", line 153, in load
    r = self._storage.loadBefore(oid, self._start)
  File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 582, in loadBefore
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x08ee8c

Traceback (most recent call last):
  File "./bin/zodbverify", line 316, in <module>
    sys.exit(zodbverify.__main__.main())
  File "/Users/cekk/.buildout/eggs/zodbverify-1.1.0-py3.8.egg/zodbverify/__main__.py", line 43, in main
    verify_oid(storage, options.oid, debug=options.debug)
  File "/Users/cekk/.buildout/eggs/zodbverify-1.1.0-py3.8.egg/zodbverify/verify_oid.py", line 58, in verify_oid
    pickle, state = storage.load(oid)
  File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 545, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x08ee8c

I started also to inspect logs and i've found that one day our daily pack script started to fails because of POSKeyErrors:

2022-02-15T05:00:03 (:8028) pack(time=1644292803.8878546) started...
2022-02-15T05:01:57 (:8028) pack(time=1644292803.8878546) complete
2022-02-15T05:01:57 (:8028) disconnected
2022-02-16T00:09:27 (:8028) disconnected
2022-02-16T00:09:33 Connected server protocol
2022-02-16T00:09:33 received handshake 'Z5'
2022-02-16T00:09:58 (:8028) disconnected
2022-02-16T00:10:03 Connected server protocol
2022-02-16T00:10:03 received handshake 'Z5'
2022-02-16T05:00:02 Connected server protocol
2022-02-16T05:00:02 received handshake 'Z5'
2022-02-16T05:00:03 (:8028) pack(time=1644379203.3475773) started...
2022-02-16T05:00:11 Error raised in delayed method
Traceback (most recent call last):
  File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/fspack.py", line 268, in findReachableAtPacktime
    pos = oid2curpos[oid]
  File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/fsIndex.py", line 109, in __getitem__
    return str2num(self._data[key[:6]][key[6:]])
KeyError: b'\xe2W'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/xxx/eggs/ZEO-5.2.2-py3.8.egg/ZEO/StorageServer.py", line 1051, in run
    result = self._method(*self._args)
  File "/xxx/eggs/ZEO-5.2.2-py3.8.egg/ZEO/StorageServer.py", line 278, in _pack_impl
    self.storage.pack(time, referencesf)
  File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 1226, in pack
    pack_result = self.packer(self, referencesf, stop, gc)
  File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 1179, in packer
    opos = p.pack()
  File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/fspack.py", line 405, in pack
    self.gc.findReachable()
  File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/fspack.py", line 191, in findReachable
    self.findReachableAtPacktime([z64])
  File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/fspack.py", line 273, in findReachableAtPacktime
    raise KeyError(oid)
KeyError: b'\x00\x00\x00\x00\x00\x08\xe2W'
2022-02-16T05:00:11 (:8028) disconnected

So i looked into the undo form and found that it was made an undo some days before. Probably the pack started breaking some days after because it's set with "days=7".
Anyway, i tried to revert all the transactions until that undo (included the undo) and it fixed the database.
After that action, the zodb scripts didn't found any error and the pack worked like a charm.

Another use-case: we have a set of websites that are on PostgreSQL. I tried to export one with zodbconvert and use it locally as Data.fs. It's a common operation that i do for debug.
There is one site that appears as corrupted because when i try to access to it i have a set of

2022-03-17 16:23:39,849 ERROR   [ZODB.Connection:808][waitress-0] Couldn't load state for BTrees.OOBTree.OOBTree 0x1450
Traceback (most recent call last):
  File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/mvccadapter.py", line 153, in load
    r = self._storage.loadBefore(oid, self._start)
  File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 567, in loadBefore
    h = self._read_data_header(pos, oid, _file)
  File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/format.py", line 152, in _read_data_header
    raise CorruptedDataError(oid, s, pos)
ZODB.FileStorage.format.CorruptedDataError: Error reading oid 0x1450.  Found b'\x00\x00\x0cYqNh\x00\x86qOQ\x86qP.\x00\x00\x00\x00\x00\x00\x14P\x03\xe6\xef[\xa9\x8b\x97w\x00\x00\x00\x00\x00!\xad\xdb\x00\x00' at 147965412

I can navigate the exported site and it looks good, but after the first transaction, when i restart it, i cannot start it anymore because of a "ZODB.FileStorage.FileStorage.CorruptedTransactionError".

Digging this database with previous tools, i don't find anything with fsrefs.py and fstest.py, but zodbverify found some errors like the other database.

I mentioned that are Volto sites because i don't know if there is some issue related to restapi interaction (but i doubt that).
The common thing with these databases is that we made some undo (i can't easily revert these undos in all three databases because they're very old transactions).

I tried entering into debug mode to try to get more infos about some of these oids, but i didn't get anything.
For some of these POSKeys i can get the objects and i've found that they're values from relation fields (a list of relations) that have some RelationValue items broken.

I tried to delete these objects/relations in different ways like resetting the field or rebuild relation catalog with collective.relationhelpers but nothing worked.

Does someone had similar problems and knows a way to fix them?

Could it really be an undo problem that does not works well with contents with relation fields?

Andrea Cecchi via Plone Community wrote at 2022-3-21 15:30 +0000:

...
ZODB.FileStorage.format.CorruptedDataError: Error reading oid 0x1450. Found b'\x00\x00\x0cYqNh\x00\x86qOQ\x86qP.\x00\x00\x00\x00\x00\x00\x14P\x03\xe6\xef[\xa9\x8b\x97w\x00\x00\x00\x00\x00!\xad\xdb\x00\x00' at 147965412

CorruptedDataError indicates an error related to the physical structure
of the data file. Buggy hardware or (in your case) a bug
in zodbconvert could cause such errors.

Maybe, the error above has a different (less dramatic) cause.
For it, the exception is raised because it expects at the
given position a specific "oid" but finds something else.
Beside a physical problem, a bad index file might be the cause.
Usually, FileStorage detects and ignores outdated index files
but copying might let the heuristice fail.
As a general rule, delete an index file when you copy over the
data file.

...
Could it really be an undo problem that does not works well with contents with relation fields?

"undo" might be buggy - but those bugs would not specifically
target relation fields.

What the "undo" of transaction "T" should do:
reset the state of all objects modified by "T" to that
just before "T".

This operation will restore a consistent state for the set
of objects modified by "T".
Other objects might become inconsistent -- if they have been
changed based on now reverted state.
Thus, an undo may introduce subtle inconsistencies.