We have currently a bunch of Plone sites published with Volto.
They have different Plone versions (from 5.2.4 to 5.2.7) and have different setups (classic zeoserver+clients and some with docker+relstorage), but share a common codebase and content-types.
In the past days i've found 3 different websites with corrupted databases with some POSKeyErrors.
Two of them have the "classic" setup (no relstorage) and i've found that they had errors because they raised POSKeyError when trying to move a folder or during scheduled packs.
I tried to inspect zodb following some guides (PloneChix: The Definitive Guide to POSKeyError and POSKeyError Debugging and Fixing Cookbook — Six Feet Up).
First of all i tried to launch fstest on the Data.fs and it reported nothing.
Then i tried with fsrefs.py and i had a list of broken refs like this:
oid 0x897af BTrees.OOBTree.OOBucket
last updated: 2022-02-10 10:39:35.804342, tid=0x3e6187f98c3e399
refers to invalid objects:
oid 0x089655 missing: '<unknown>'
oid 0x089656 missing: '<unknown>'
oid 0x089657 missing: '<unknown>'
oid 0x089658 missing: '<unknown>'
oid 0x089659 missing: '<unknown>'
oid 0x08965a missing: '<unknown>'
oid 0x082356 missing: '<unknown>'
oid 0x082357 missing: '<unknown>'
oid 0x082358 missing: '<unknown>'
Also zodbverify tells me something:
> ./bin/zodbverify -f var/filestorage/Data.fs
INFO:zodbverify:Scanning ZODB...
Traceback (most recent call last):
File "./bin/zodbverify", line 316, in <module>
sys.exit(zodbverify.__main__.main())
File "/Users/cekk/.buildout/eggs/zodbverify-1.1.0-py3.8.egg/zodbverify/__main__.py", line 45, in main
verify_zodb(storage, debug=options.debug)
File "/Users/cekk/.buildout/eggs/zodbverify-1.1.0-py3.8.egg/zodbverify/verify.py", line 32, in verify_zodb
oid, tid, data, next_ = storage.record_iternext(next_)
File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 1418, in record_iternext
data, tid = load_current(self, oid)
File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/utils.py", line 385, in load_current
r = storage.loadBefore(oid, maxtid)
File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 582, in loadBefore
raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x08ee8c
going deeper:
> ./bin/zodbverify -f var/filestorage/Data.fs -o 0x08ee8c
INFO:zodbverify:Could not load object
INFO:zodbverify:Traceback (most recent call last):
File "/Users/cekk/.buildout/eggs/zodbverify-1.1.0-py3.8.egg/zodbverify/verify_oid.py", line 41, in verify_oid
obj = connection.get(oid)
File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 247, in get
p, _ = self._storage.load(oid)
File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/mvccadapter.py", line 153, in load
r = self._storage.loadBefore(oid, self._start)
File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 582, in loadBefore
raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x08ee8c
Traceback (most recent call last):
File "./bin/zodbverify", line 316, in <module>
sys.exit(zodbverify.__main__.main())
File "/Users/cekk/.buildout/eggs/zodbverify-1.1.0-py3.8.egg/zodbverify/__main__.py", line 43, in main
verify_oid(storage, options.oid, debug=options.debug)
File "/Users/cekk/.buildout/eggs/zodbverify-1.1.0-py3.8.egg/zodbverify/verify_oid.py", line 58, in verify_oid
pickle, state = storage.load(oid)
File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 545, in load
raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x08ee8c
I started also to inspect logs and i've found that one day our daily pack script started to fails because of POSKeyErrors:
2022-02-15T05:00:03 (:8028) pack(time=1644292803.8878546) started...
2022-02-15T05:01:57 (:8028) pack(time=1644292803.8878546) complete
2022-02-15T05:01:57 (:8028) disconnected
2022-02-16T00:09:27 (:8028) disconnected
2022-02-16T00:09:33 Connected server protocol
2022-02-16T00:09:33 received handshake 'Z5'
2022-02-16T00:09:58 (:8028) disconnected
2022-02-16T00:10:03 Connected server protocol
2022-02-16T00:10:03 received handshake 'Z5'
2022-02-16T05:00:02 Connected server protocol
2022-02-16T05:00:02 received handshake 'Z5'
2022-02-16T05:00:03 (:8028) pack(time=1644379203.3475773) started...
2022-02-16T05:00:11 Error raised in delayed method
Traceback (most recent call last):
File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/fspack.py", line 268, in findReachableAtPacktime
pos = oid2curpos[oid]
File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/fsIndex.py", line 109, in __getitem__
return str2num(self._data[key[:6]][key[6:]])
KeyError: b'\xe2W'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/xxx/eggs/ZEO-5.2.2-py3.8.egg/ZEO/StorageServer.py", line 1051, in run
result = self._method(*self._args)
File "/xxx/eggs/ZEO-5.2.2-py3.8.egg/ZEO/StorageServer.py", line 278, in _pack_impl
self.storage.pack(time, referencesf)
File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 1226, in pack
pack_result = self.packer(self, referencesf, stop, gc)
File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 1179, in packer
opos = p.pack()
File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/fspack.py", line 405, in pack
self.gc.findReachable()
File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/fspack.py", line 191, in findReachable
self.findReachableAtPacktime([z64])
File "/xxx/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/fspack.py", line 273, in findReachableAtPacktime
raise KeyError(oid)
KeyError: b'\x00\x00\x00\x00\x00\x08\xe2W'
2022-02-16T05:00:11 (:8028) disconnected
So i looked into the undo form and found that it was made an undo some days before. Probably the pack started breaking some days after because it's set with "days=7".
Anyway, i tried to revert all the transactions until that undo (included the undo) and it fixed the database.
After that action, the zodb scripts didn't found any error and the pack worked like a charm.
Another use-case: we have a set of websites that are on PostgreSQL. I tried to export one with zodbconvert and use it locally as Data.fs. It's a common operation that i do for debug.
There is one site that appears as corrupted because when i try to access to it i have a set of
2022-03-17 16:23:39,849 ERROR [ZODB.Connection:808][waitress-0] Couldn't load state for BTrees.OOBTree.OOBTree 0x1450
Traceback (most recent call last):
File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
p, serial = self._storage.load(oid)
File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/mvccadapter.py", line 153, in load
r = self._storage.loadBefore(oid, self._start)
File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/FileStorage.py", line 567, in loadBefore
h = self._read_data_header(pos, oid, _file)
File "/Users/cekk/.buildout/eggs/ZODB-5.6.0-py3.8.egg/ZODB/FileStorage/format.py", line 152, in _read_data_header
raise CorruptedDataError(oid, s, pos)
ZODB.FileStorage.format.CorruptedDataError: Error reading oid 0x1450. Found b'\x00\x00\x0cYqNh\x00\x86qOQ\x86qP.\x00\x00\x00\x00\x00\x00\x14P\x03\xe6\xef[\xa9\x8b\x97w\x00\x00\x00\x00\x00!\xad\xdb\x00\x00' at 147965412
I can navigate the exported site and it looks good, but after the first transaction, when i restart it, i cannot start it anymore because of a "ZODB.FileStorage.FileStorage.CorruptedTransactionError".
Digging this database with previous tools, i don't find anything with fsrefs.py and fstest.py, but zodbverify found some errors like the other database.
I mentioned that are Volto sites because i don't know if there is some issue related to restapi interaction (but i doubt that).
The common thing with these databases is that we made some undo (i can't easily revert these undos in all three databases because they're very old transactions).
I tried entering into debug mode to try to get more infos about some of these oids, but i didn't get anything.
For some of these POSKeys i can get the objects and i've found that they're values from relation fields (a list of relations) that have some RelationValue items broken.
I tried to delete these objects/relations in different ways like resetting the field or rebuild relation catalog with collective.relationhelpers but nothing worked.
Does someone had similar problems and knows a way to fix them?
Could it really be an undo problem that does not works well with contents with relation fields?