A POSKeyError in my index

I've been through POSKeyErrors before, but this one gets me.

First, I read this, ZODB Database debugging ā€” Starzel.de and to a lesser extent, this: Best practice documentation on ZODB Debugging

#1 I'm using relstorage, so the zodbverify script isn't working
#2 I'm happy to blow these oid's away, but:

>>> del app._p_jar[p64(0x01fbca10)]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: 'Connection' object does not support item deletion

Is this a relstorage feature also?

and the "coolest" part of my corruption is this:

>>>catalog = app.PloneSite.portal_catalog
>>> catalog._catalog
<Products.ZCatalog.Catalog.Catalog object at 0x7f8e7d458200 oid 0xb15 in <Connection at 7f8e82874b80>>
>>>[rid for rid in catalog._catalog.data]
WARNING:relstorage:POSKeyError on oid 33278480: no tid found; Current transaction is 282214376604304025; Recent object tids: []
ERROR:ZODB.Connection:Couldn't load state for BTrees.IOBTree.IOBucket 0x01fbca10
Traceback (most recent call last):
  File "/plone_install/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 72, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "/plone_install/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x01fbca10
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<console>", line 1, in <listcomp>
  File "/plone_install/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 72, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "/plone_install/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x01fbca10

Now, I might have been onto something with @pbauer path #4 - replacing the oid with a dummy:

>>> from persistent import Persistent
>>> from ZODB.utils import p64
>>> import transaction
>>> broken_oid = 0x01fbca14
>>> dummy = Persistent()
>>> dummy._p_oid = p64(broken_oid)
>>> dummy._p_jar = app._p_jar
>>> app._p_jar._register(dummy)
>>> app._p_jar._added[dummy._p_oid] = dummy
>>> transaction.commit()

But this isn't a content object, it's an index entry, so I got something even more bizarre now - this comes from an actual browser view from @search:

  Module Products.ZCatalog.ZCatalog, line 625, in searchResults
  Module Products.ZCatalog.Catalog, line 1091, in searchResults
  Module Products.ZCatalog.Catalog, line 634, in search
  Module Products.ZCatalog.Catalog, line 564, in _search_index
  Module Products.ZCTextIndex.ZCTextIndex, line 210, in query_index
  Module Products.ZCTextIndex.ParseTree, line 132, in executeQuery
  Module Products.ZCTextIndex.BaseIndex, line 216, in search_phrase
  Module ZODB.Connection, line 795, in setstate
  Module ZODB.serialize, line 634, in setGhostState
SystemError: new style getargs format but argument is not a tuple

Question, what dummy 'type' do I need to replace the oid with to get the POSKeyError to go away, because Persistent() is not the correct answer here it seems.

Update: it is certainly in my metadata index, which is really fun.

I would like to document / update the procedure:

zodbpack --prepack with relstorage

Plone 5.2 Python3

First, add stuff to buildout...

parts += relstorage-zodbpack-conf
         relstorage-zodbpack

Make sure to match your Relstorage Version on Production
I put in "RelStorage==2.1.1" in my eggs and buildout didn't give me an error :slight_smile:

[relstorage-zodbpack]
recipe = zc.recipe.egg:scripts
eggs =
    RelStorage==2.1.1
scripts = zodbpack
initialization =
    config = '${buildout:directory}/etc/zodbpack.conf'
arguments = config

you need a config file for zodbpack. It's basically the xml straight from your zcml.conf

This recipe creates a config file for you - if you want. Just make sure that:

  • pack-gc true
  • create-schema false
  • point to your broken database.
[relstorage-zodbpack-conf]
recipe = collective.recipe.template
input = inline:
    <relstorage>
        pack-gc true
        create-schema false
      <mysql>
        host locahost
        port 3306
        db plone_borked
        user plone
        passwd plone
      </mysql>
    </relstorage>
output = ${buildout:directory}/etc/zodbpack.conf

Now, the zodbpack script is broken, you have to fix it. Create a ticket in github while you're at it.

zodbpack: error: argument config_file: can't open 'h': [Errno 2] No such file or directory: 'h'

that 'h' is because the full path to my config file is '/home/flipmcf/path/to/instance'. Yes folks, Strings are iterables in python.

edit bin/zodbpack. remove 'config' argument

if __name__ == '__main__':
    sys.exit(relstorage.zodbpack.main())

then, run bin/zodbpack --prepack etc/zodbpack.conf

The 'bug' is that args aren't passed correctly into relstorage.zodbpack.main(), so we will just put everything in argv.

no module named mysql and/or MySQLdb

Unable to use the driver 'None' for the database 'mysql'. Available drivers are: [].

Throw this into
egg += mysqlclient
PyMySQL

One of those will fix it.
But, to be honest, I really hacked at the ./bin/zodbpack script and just added egg paths I found in ./bin/instance until it 'worked'. :man_shrugging:

Back to my unique POSKeyError, As shown by this little guy on ./bin/instance debug - it's the IOBTree iterator that's blowing up while reading metadata:

generator = catalog.getAllBrains()
while True:
    try:
        brain = generator.__next__()
        ## brain.getObject()
    except:
        import pdb; pdb.set_trace()

I DO NOT get the object, the comment is there because that's typically what you do to generate a POSKeyError, but no 'getObject()' is needed to make it blow up.

the POSKeyError is generated during the __next__() call on the generator. At this point, I'm inside the IOBTree "walk" implementation, and I haven't opened the C code there much. I'd like to... I mean, that's cool stuff, but I'm kind of busy right now.

References:
https://zodb.org/en/latest/articles/multi-zodb-gc.html

flipmcf via Plone Community wrote at 2022-8-31 20:27 +0000:

...
the POSKeyError is generated during the next() call on the generator. At this point, I'm inside the IOBTree "walk" implementation, and I haven't opened the C code there much. I'd like to... I mean, that's cool stuff, but I'm kind of busy right now.

I see two options:

  1. find the object[s] with the dangling reference[s] and fix it/them.
    The BTrees package contains a module which allows to
    (recursively) "split" a tree into its persistent subobjects.

    This may be instructive but will be tedious

  2. clear and rebuild the catalog

    This should be straight forward but will not give you any
    insight into the potential corruption cause.

1 Like

Sigh.... Not working. At least with this method:
ZMI -> Portal Catalog
Clicking on the 'Catalog' tab gives an error

  Module DocumentTemplate.DT_In, line 482, in renderwb
  Module ZTUtils.Lazy, line 201, in __getitem__
  Module Products.ZCatalog.Catalog, line 122, in __getitem__
SystemError: <built-in function isinstance> returned a result with an error set

Advanced tab "clear and rebuld" I "hear" that others have tried, but haven't tried myself. I'm going to click it myself right now!

Update: Broken.

Traceback (innermost last):
  Module ZPublisher.WSGIPublisher, line 162, in transaction_pubevents
  Module ZPublisher.WSGIPublisher, line 371, in publish_module
  Module ZPublisher.WSGIPublisher, line 266, in publish
  Module ZPublisher.mapply, line 85, in mapply
  Module Products.PDBDebugMode.wsgi_runcall, line 60, in pdb_runcall
  Module Products.CMFPlone.CatalogTool, line 533, in manage_catalogRebuild
  Module Products.CMFPlone.CatalogTool, line 519, in clearFindAndRebuild
  Module OFS.FindSupport, line 173, in ZopeFindAndApply
  Module OFS.FindSupport, line 173, in ZopeFindAndApply
  Module OFS.FindSupport, line 118, in ZopeFindAndApply
  Module ZTUtils.Lazy, line 201, in __getitem__
  Module plone.folder.ordered, line 102, in <lambda>
  Module plone.folder.ordered, line 67, in _getOb
AttributeError: 0x073afa

It's extremely instructive and very tedious, yes.

>>> obj
<BTrees.OIBTree.OIBTree object at 0x7fc9ec0ff840 oid 0x1fba3d8 in <Connection at 7fc9f19efc70>>
>>> [x for x in obj.items('410','411')]
[('41017-20000917.html', 4793), ('41018-20000917.html', 4794), ('41019-20000917.html', 4795), ('41020-20000917.html', 4796), ('41096-20000918.html', 4797), ('41097-20000918.html', 4798), ('41098-20000918.html', 4799)]
>>> [x for x in obj.items('411','412')]
WARNING:relstorage:POSKeyError on oid 33268864: no tid found; Current transaction is 282221264778342741; Recent object tids: []
ERROR:ZODB.Connection:Couldn't load state for BTrees.OIBTree.OIBucket 0x01fba480
Traceback (most recent call last):
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 66, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x01fba480
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 66, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x01fba480

But for some very strange reason, I get some kind of enjoyment out of this.

The economist in me wants the catalog rebuild to work.
The scientist in me wants to slice more b-trees and see if I can prune a single bad element out of it, and commit the transaction.

I've also been following the line of parents getting hints.


Related, how do you delete an oid using relstorge? I have 2.1.1 specifically.

flipmcf via Plone Community wrote at 2022-8-31 22:52 +0000:

...

It's extremely instructive and very tedious, yes.

>>> obj
<BTrees.OIBTree.OIBTree object at 0x7fc9ec0ff840 oid 0x1fba3d8 in <Connection at 7fc9f19efc70>>
>>> [x for x in obj.items('410','411')]
[('41017-20000917.html', 4793), ('41018-20000917.html', 4794), ('41019-20000917.html', 4795), ('41020-20000917.html', 4796), ('41096-20000918.html', 4797), ('41097-20000918.html', 4798), ('41098-20000918.html', 4799)]
>>> [x for x in obj.items('411','412')]
WARNING:relstorage:POSKeyError on oid 33268864: no tid found; Current transaction is 282221264778342741; Recent object tids: []
ERROR:ZODB.Connection:Couldn't load state for BTrees.OIBTree.OIBucket 0x01fba480

While this gives information about where in the tree the corruption
happened, it does not tell you which persistent (sub-) object you
must fix to get rid of the dangling reference.

Keep in mind that a BTree can be complex, more precisely that
it may consists of several persistent subobjects.
One (or maybe several) of those subobjects contains a dangling reference.
You need tricks to get it fixed.
BTrees.check contains infrastructure to access the subobjects
of a BTree. If the dangling reference is not in a leaf (called "bucket"),
then a fix must operate on the pickle state of the subobject.

While you may be able to remove the dangling reference (along those lines),
it will (almost surely) lose information; i.e. the catalog remains
inconsistent.

I think your best approach is the "clear and rebuild catalog".
This will give you consistency again.

...
WARNING:relstorage:POSKeyError on oid 33268864: no tid found; Current transaction is 282221264778342741; Recent object tids:

This warning indicates an error on a very low level
(relstorage or lower). I would not trust your relstorage
setup.

...
Related, how do you delete an oid using relstorge? I have 2.1.1 specifically.

ZODB storages may have a deleteObject method -- to support
"external garbage collection" (therefore, this method
is defined in the IExternalGC interface).
Unless you implement "external garbage collection",
it is highly unlikely that you should use this method
(even if relstorage may implement it).
Keep in mind that deleting an object still reachable from
the ZODB root will create a dangling reference.
This will result in a POSKeyError when you use the dangling
reference.

With the ZODB, you usually do not delete objects. Instead,
you change their "container" to no longer contain/reference them.
Once an object gets unreachable (from the ZODB root),
some periodic maintenance
(called packing for a FileStorage; no idea how it is called
for relstorage) will remove it.

EDIT: Clear and Rebuild throws a POSKeyError

You write POSKeyError here but below I see AttributeError.

Traceback (innermost last):
...
 Module Products.CMFPlone.CatalogTool, line 519, in clearFindAndRebuild
 Module OFS.FindSupport, line 173, in ZopeFindAndApply
 Module OFS.FindSupport, line 173, in ZopeFindAndApply
 Module OFS.FindSupport, line 118, in ZopeFindAndApply
 Module ZTUtils.Lazy, line 201, in __getitem__
 Module plone.folder.ordered, line 102, in <lambda>
 Module plone.folder.ordered, line 67, in _getOb
AttributeError: 0x073afa

With Plone, some object containers use the catalog to
implement their object* methods, among them
objectItems used by ZopeFindAndApply and therefore clearFindAndRebuild.
Thus, a broken catalog may cause clearFindAndRebuild to fail.

There are 2 ways to rebuild the catalog:

  1. clearFindAndRebuild locates all objects via calls to
    objectItems, then clears the catalog completely and reindexes
    all found objects.

  2. a second approach assumes that the catalog's knowledge about
    the objects to be indexed is mostly okay.
    It keeps this information, then clears the catalog
    and reindexes the remembered objects ignoring no
    longer existing ones.

You tried the first approach -- and failed.
You may try the second approach and see what happens.
Of course, no guarantee that this works --
if your storage is sufficiently broken, only a backup will
save you.

2 Likes

This is helpful @dieter. Thanks.

FWIW - we've got an instance up from backup that's about a week old. I've been given one more day to see if I can repair. My confidence is low, but it's a great learning experience.

First, let me close this confusion. The AttributeError on clearFindAndRebuild is a POSKeyError that was likely caught and re-raised as an AttributeError. I'd rather not go down that hole, because I don't think it helps. But I'm just clearing up the confusion. I was just lazy and only copied the exception printed to the webpage.

Here's what I get out of a foreground process when trying "Clear and Rebuild" from the ZMI.

Starting server in PID 3043372.
Serving on http://0.0.0.0:8884
2022-09-01 11:08:32,846 WARNING [relstorage:557][waitress-1] POSKeyError on oid 473850: no tid found; Current transaction is 2
82226532437355741; Recent object tids: []
2022-09-01 11:08:32,847 ERROR   [ZODB.Connection:38][waitress-1] Couldn't load state for BTrees.OOBTree.OOBucket 0x073afa
Traceback (most recent call last):
  File "../cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 66, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "../cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x073afa
> ../cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py(587)load()
-> raise POSKeyError(oid)
(Pdb) c
2022-09-01 11:08:38,595 ERROR   [Zope.SiteErrorLog:17][waitress-1] 1662044918.5948790.6028358595821579 https://mysite/portal_catalog/manage_catalogRebuild
Traceback (innermost last):
  Module ZPublisher.WSGIPublisher, line 162, in transaction_pubevents
  Module ZPublisher.WSGIPublisher, line 371, in publish_module
  Module ZPublisher.WSGIPublisher, line 266, in publish
  Module ZPublisher.mapply, line 85, in mapply
  Module Products.PDBDebugMode.wsgi_runcall, line 60, in pdb_runcall
  Module Products.CMFPlone.CatalogTool, line 533, in manage_catalogRebuild
  Module Products.CMFPlone.CatalogTool, line 519, in clearFindAndRebuild
  Module OFS.FindSupport, line 173, in ZopeFindAndApply
  Module OFS.FindSupport, line 173, in ZopeFindAndApply
  Module OFS.FindSupport, line 118, in ZopeFindAndApply
  Module ZTUtils.Lazy, line 201, in __getitem__
  Module plone.folder.ordered, line 102, in <lambda>
  Module plone.folder.ordered, line 67, in _getOb
AttributeError: 0x073afa
> ../cache/eggs/plone.folder-3.1.0-py3.8.egg/plone/folder/ordered.py(67)_getOb()
-> raise AttributeError(e)

Follow me down this rabbit hole. We are on the same page, and that makes me feel pretty confident.

We have only a handful of oids that are really troubling us, and I'll focus on one:

0x01fba480 33268864 (hex / decimal)

First, I zodbpack --prepack and I find the parent

mysql> SELECT zoid FROM object_ref where to_zoid = 33268864;
+----------+
| zoid     |
+----------+
| 33268696 |
| 33268863 |
+----------+

And I will work these in order, first 33268696

from ZODB.utils import p64
parent = 33268696
obj = app._p_jar.get(p64(parent))

<BTrees.OIBTree.OIBTree object at 0x7f0930838840 oid 0x1fba3d8 in <Connection at 7f0936135700>>

obj is the parent of the broken oid

Path 1:

https://btrees.readthedocs.io/en/latest/_modules/BTrees/check.html

>>> from BTrees.check import check
>>> check(obj)
WARNING:relstorage:POSKeyError on oid 33268864: no tid found; Current transaction is 282226532437355741; Recent object tids: []
ERROR:ZODB.Connection:Couldn't load state for BTrees.OIBTree.OIBucket 0x01fba480
Traceback (most recent call last):
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 66, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x01fba480
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/BTrees-4.9.2-py3.8-linux-x86_64.egg/BTrees/check.py", line 462, in check 
    Checker(btree).check()
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/BTrees-4.9.2-py3.8-linux-x86_64.egg/BTrees/check.py", line 371, in check 
    self.walk()
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/BTrees-4.9.2-py3.8-linux-x86_64.egg/BTrees/check.py", line 354, in walk  
    keys, values = crack_bucket(obj, is_mapping)
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/BTrees-4.9.2-py3.8-linux-x86_64.egg/BTrees/check.py", line 231, in crack_bucket
    state = b.__getstate__()
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 66, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x01fba480

yay!

Path2:

I like this, and it's what I was doing. It's an OIBTree, so I manually bisected and found the one Bucket that was broken: "41977-20001002.html"

so, check it out, let's create an entirely new parent, without that value. I'm so close here....

from BTrees.OIBTree import OIBTree
head = list(obj.iteritems('-','410'))
tail = list(obj.iteritems('41977.'))
new = OIBTree()
[new.insert(*i) for i in head]
[new.insert(*i) for i in tail]

new is now a new OIBTree containing only the working buckets.

>>> check(new)
>>> len(new)
3752

Let's get this repaired BTree called new back into the database:

new._p_oid = p64(parent)
new._p_jar = app._p_jar
app._p_jar._register(new)
app._p_jar._added[new._p_oid] = new
import transaction
transaction.commit()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/../cache/eggs/transaction-3.0.1-py3.8.egg/transaction/_manager.py", line 257, in commit
    return self.manager.commit()
  File "../rfasite/rfasite/cache/eggs/transaction-3.0.1-py3.8.egg/transaction/_manager.py", line 134, in commit
    return self.get().commit()
  File "../cache/eggs/transaction-3.0.1-py3.8.egg/transaction/_transaction.py", line 282, in commit
    reraise(t, v, tb)
  File "../cache/eggs/transaction-3.0.1-py3.8.egg/transaction/_compat.py", line 49, in reraise 
    raise value
  File "../cache/eggs/transaction-3.0.1-py3.8.egg/transaction/_transaction.py", line 273, in commit
    self._commitResources()
  File "../cache/eggs/transaction-3.0.1-py3.8.egg/transaction/_transaction.py", line 456, in _commitResources
    reraise(t, v, tb)
  File "../cache/eggs/transaction-3.0.1-py3.8.egg/transaction/_compat.py", line 49, in reraise 
    raise value
  File "../cache/eggs/transaction-3.0.1-py3.8.egg/transaction/_transaction.py", line 430, in _commitResources
    rm.commit(self)
  File "/../cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 497, in commit
    self._commit(transaction)
  File "/../cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 546, in _commit
    self._store_objects(ObjectWriter(obj), transaction)
  File "/../cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 609, in _store_objects
    self._cache[oid] = obj
ValueError: A different object already has the same oid

And that's where I get stuck.... I'm not sure how to clear the cache or get past this one.

I tried a few things last night, I didn't document them all. I dove into the pickleCache, I see it's got a c-implementation also. It's hard to hack, and I'm not sure I should.

I know this didn't work:

from ZODB.Connection import resetCaches
resetCaches()
transaction.commit()

(same)

Path 3:

This second approach is interesting... especially if there are ready-made scripts to walk the database.

Lastly, back to 41977-20001002.html - got the actual object rendering on view/edit from the browser. I was able to delete it this morning from Plone. Still got some 500's on the containing folder, but maybe this is also progress, and a lot easier than bisecting btrees.

flipmcf via Plone Community wrote at 2022-9-1 16:19 +0000:

...

check(obj)
WARNING:relstorage:POSKeyError on oid 33268864: no tid found; Current transaction is 282226532437355741; Recent object tids:
ERROR:ZODB.Connection:Couldn't load state for BTrees.OIBTree.OIBucket 0x01fba480

Thus, you have lost a bucket (a tree leaf).
That's sad: a tree where only internal nodes are missing can
in many cases be reconstructed from the leafs (which may still be
accessible).

...
so, check it out, let's create an entirely new parent, without that value. I'm so close here....

from BTrees.OIBTree import OIBTree
head = list(obj.iteritems('-','410'))
tail = list(obj.iteritems('41977.'))
new = OIBTree()
[new.insert(*i) for i in head]
[new.insert(*i) for i in tail]

new is now a new OIBTree containing only the working buckets.

>>> check(new)
>>> len(new)
3752

Let's get this repaired BTree called new back into the database:

new._p_oid = p64(parent)
new._p_jar = app._p_jar
app._p_jar._register(new)
app._p_jar._added[new._p_oid] = new
import transaction
transaction.commit()
Traceback (most recent call last):
...
   self._store_objects(ObjectWriter(obj), transaction)
 File "/../cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 609, in _store_objects
   self._cache[oid] = obj
ValueError: A different object already has the same oid

And that's where I get stuck.... I'm not sure how to clear the cache or get past this one.

You can try a minimizeCache/cacheGC (or similarly named) on the connection.
It removes instances from the cache which are not marked as modified.

Alternatively, you can use low level cache API to remove the object from
the cache.

A more direct way may be: remember the partially recovered items,
then clear the broken tree and update from the saved items.
This assumes that "clear"ing is optimized (simply overrides the
top level structure).

It is possible that you must perform this on the top level
tree, not the identified parent. Some tree nodes
have (externally) invisible references to other tree nodes
which may not survive the clearing.

from ZODB.utils import p64
parent = 33268696
obj = app._p_jar.get(p64(parent))
head = list(obj.iteritems('-','410'))
tail = list(obj.iteritems('41977.'))
newlist = head+tail
obj.clear()
[obj.insert(*i) for i in newlist]
import transaction
transaction.commit()

Woah... it worked, I think. Here's the low-level test:

exit()

leave the debugger, back to shell.

./bin/instance debug
Starting debugger (the name "app" is bound to the top-level Zope object)
from ZODB.utils import p64
parent = 33268696
obj = app._p_jar.get(p64(parent))
[x for x in obj.values()]
[a, list, of, values, the, tree, is, iterable, without, POSKeyError]

um... wow...

The saga continues..

Yeah, blowing away whole buckets of a btree and recommitting them does get rid of POSKeyErrors. lol.

But key errors and attribute errors then crop up. Hopefully more fixable.

So, as I race the other team manually repairing from a 10-day old backup, my current logic is thus:

  1. to fix this the best way - we need to "Clear and Rebuild"
  2. Clear and rebuild throws a poskey error on 0x073afa
  3. so, let's kill that reference.

(see above... bucket killer stuff)

So, now that I re-committed the offending btree with the reference, I run "Clear and Rebuild" and viola! no POSKeyError! Yay!

But Boo:

mysite/portal_catalog/manage_catalogRebuild
Traceback (innermost last):
  Module ZPublisher.WSGIPublisher, line 162, in transaction_pubevents
  Module ZPublisher.WSGIPublisher, line 371, in publish_module
  Module ZPublisher.WSGIPublisher, line 266, in publish
  Module ZPublisher.mapply, line 85, in mapply
  Module Products.PDBDebugMode.wsgi_runcall, line 60, in pdb_runcall
  Module Products.CMFPlone.CatalogTool, line 533, in manage_catalogRebuild
  Module Products.CMFPlone.CatalogTool, line 519, in clearFindAndRebuild
  Module OFS.FindSupport, line 173, in ZopeFindAndApply
  Module OFS.FindSupport, line 173, in ZopeFindAndApply
  Module OFS.FindSupport, line 118, in ZopeFindAndApply
  Module ZTUtils.Lazy, line 201, in __getitem__
  Module plone.folder.ordered, line 102, in <lambda>
  Module plone.folder.ordered, line 67, in _getOb
AttributeError: '41543-20000925.html'
> ../cache/eggs/plone.folder-3.1.0-py3.8.egg/plone/folder/ordered.py(67)_getOb()
-> raise AttributeError(e)

But maybe I can now recover from this without oids....

Time for dinner. I'll be back!

Lordy

2 Likes

Whatever your brains have, the both of you, I hope is catching

flipmcf via Plone Community wrote at 2022-9-1 22:55 +0000:

...

  1. to fix this the best way - we need to "Clear and Rebuild"
  2. Clear and rebuild throws a poskey error on 0x073afa
  3. so, let's kill that reference.

You are aware that almost surely something should be at
the place where the POSKeyError is raised.
This implies that some objects will not be found when you
remove the dangling reference -- consequence: the rebuilt catalog
will not be consistent.

At your place, I would first try the second (alternative catalog
rebuild approach.

...
So, now that I re-committed the offending btree with the reference, I run "Clear and Rebuild" and viola! no POSKeyError! Yay!

But Boo:

mysite/portal_catalog/manage_catalogRebuild
Traceback (innermost last):
...
 Module Products.CMFPlone.CatalogTool, line 519, in clearFindAndRebuild
...
 Module plone.folder.ordered, line 102, in <lambda>
 Module plone.folder.ordered, line 67, in _getOb
AttributeError: '41543-20000925.html'
> ../cache/eggs/plone.folder-3.1.0-py3.8.egg/plone/folder/ordered.py(67)_getOb()
-> raise AttributeError(e)

An exception you have previously associated with a POSKeyError.

Well (if you've read this far) last week we did end up restoring from a week-old backup just before the POSKeyErrors began occurring in the logs, then manually replacing about 200 objects with ZEXP export import. Customer is happy.

But for some reason, I like pain and documentation, so I'm continuing the thread.

Can you define "not consistent"? I am aware that data is being lost.
My goal is to remove the 500 errors, then go back an replace missing data manually.

I'm also aware that I'm killing an entire bucket (a BTree Leaf) which is a list of references. How sure are you that 'check()' is reporting a lost bucket rather than a lost datum inside a bucket? I'm also worried that deleting an entire bucket might actually prune the tree. Do these buckets contain pointers to left and right buckets?

Rebuilding the catalog is the best approach, I would never argue with that. But I think in this very specific case, the catalog itself is corrupt, which is making that impossible.

I think we tried this, but it failed. If you could be a bit more specific on this approach, I would like to try it.

So, this is what I tried for the "second approach" which relies on this assumption. If you know of a different approach, please let me know.

I typically start here:

>>> catalog = app.PloneSite.portal_catalog
>>> catalog
<CatalogTool at /PloneSite/portal_catalog>

I attempted to walk the entire catalog by iterating
catalog.getAllBrains()
and this throws a POSKeyError eventually, in the actual iterator. See above. That's why we are talking about BTrees in the first place.

I also tried a lower level attempt:

>>> catalog._catalog
<Products.ZCatalog.Catalog.Catalog object at 0x7fd1c9ec8900 oid 0xb15 in <Connection at 7fd1cf2f74c0>>
>>> catalog._catalog.data
<BTrees.IOBTree.IOBTree object at 0x7fd1cb94f740 oid 0x17ad6c9 in <Connection at 7fd1cf2f74c0>>

>>> [x for x in catalog._catalog.data.iteritems()]

WARNING:relstorage:POSKeyError on oid 33278480: no tid found; Current transaction is 282270056828755268; Recent object tids: []
ERROR:ZODB.Connection:Couldn't load state for BTrees.IOBTree.IOBucket 0x01fbca10
Traceback (most recent call last):
  File "rfasite/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 66, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x01fbca10
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<console>", line 1, in <listcomp>
  File "/rfasite/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 66, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "/rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x01fbca10

And I think this is the "alternative method" you are referring to.

I'm not trying to be intentionally dense here, but actually try all of the suggestions. As I said, I think this is a special kind of corruption I was blessed with.

You could try crystal meth.

But I think this is actually a symptom of ADHD - it's called 'hyperfocus' and it can effect your relationships with family and friends and cause complaints at work.

Actually, it's very similar to crystal meth in many respects, so I stand by my first suggestion.

flipmcf via Plone Community wrote at 2022-9-9 14:39 +0000:

...

Can you define "not consistent"?
The catalog content correctly (and completely) reflects
the content objects.

...
I'm also aware that I'm killing an entire bucket (a BTree Leaf) which is a list of references. How sure are you that 'check()' is reporting a lost bucket rather than a lost datum inside a bucket?

You got a POSKeyError for a bucket object;
if a value would have been lost, you would have gotten a POSKeyError
for a different class.

I'm also worried that deleting an entire bucket might actually prune the tree. Do these buckets contain pointers to left and right buckets?

They contain a pointer to the next bucket.

A (well populated; simplifications for small trees) BTree has
two access structures:

  1. a tree structure (used for locating keys)
  2. a leaf/bucket chain (pointer from the root to the first leaf;
    pointer from a leaf to the following leaf)
    (used for iteration)

The loss of a bucket can have two reasons:

  1. an internal node is corrupt - i.e. contains a wrong bucket reference
    in this case, no bucket needs to be actually lost;
    it may be that just the reference is wrong
  2. the references are correct but the bucket has been
    deleted (e.g. by a broken pack implementation or
    a bad deleteObject call (buggy "external garbage collection").

In the second case, both key access as well as iteration can be
broken.

Rebuilding the catalog is the best approach, I would never argue with that. But I think in this very specific case, the catalog itself is corrupt, which is making that impossible.

I explained that there are 2 different approaches
to rebuild the catalog. You tried one of them (and failed);
the other one might (or might not) succeed.

...
I attempted to walk the entire catalog by iterating
catalog.getAllBrains()
and this throws a POSKeyError eventually, in the actual iterator. See above. That's why we are talking about BTrees in the first place.

The second approach needs to iterate only over the (so called)
uids of the catalogued objects (which are actually object paths);
the getAllBrains likely iterates instead over the tree
representing the metadata.
One may succeed while the other one fails.

catalog._catalog
catalog._catalog.data

That's (almost surely) the metadata tree.

And I think this is the "alternative method" you are referring to.

No.

The catalog ZMI used to have a tab with 2 different
"clear and rebuild" actions. Those 2 actions represent
the 2 approaches.
It is a long ago that I visited this tab - something might have
changed in the meantime.
But the API methods corresponding to those actions should still be there.

Once again, I'm over-complicating things. Thanks for clarifying.

We have been doing the "Clear and Rebuild".
We're here now without a POSKeyError because of the kadosectomy (the surgical removal of a bucket) described above.

The "Update Catalog" action quickly results in a POSKeyError.

2022-09-09 13:16:24,138 WARNING [relstorage:557][waitress-6] POSKeyError on oid 33278481: no tid found; Current transaction is 282270056828755268; Recent object tids: []
2022-09-09 13:16:24,138 ERROR   [ZODB.Connection:38][waitress-6] Couldn't load state for BTrees.IOBTree.IOBucket 0x01fbca11
Traceback (most recent call last):
  File "/rfasite/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 66, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "/rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x01fbca11
> rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py(587)load()
-> raise POSKeyError(oid)

----
portal_catalog/manage_catalogReindex
Traceback (innermost last):
  Module ZPublisher.WSGIPublisher, line 162, in transaction_pubevents
  Module ZPublisher.WSGIPublisher, line 371, in publish_module
  Module ZPublisher.WSGIPublisher, line 266, in publish
  Module ZPublisher.mapply, line 85, in mapply
  Module Products.PDBDebugMode.wsgi_runcall, line 60, in pdb_runcall
  Module Products.ZCatalog.ZCatalog, line 251, in manage_catalogReindex
  Module Products.PDBDebugMode.zcatalog, line 35, in refreshCatalog
  Module ZODB.Connection, line 791, in setstate
  Module perfmetrics._metric, line 66, in perfmetrics._metric._AbstractMetricImpl.__call__
  Module relstorage.storage, line 587, in load
ZODB.POSException.POSKeyError: 0x01fbca11
> rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py(587)load()
-> raise POSKeyError(oid)

Maybe resolving that error with a kadosectomy would help here too, probably resulting in a key or attribute error just as we see above.


Additionally, "Clear The Catalog" works, is very fast, and destroyed the site content. Objects can be found by direct url, but all folder listings are gone. 'Rebuild' now returns quickly and does nothing.

Fortunately, you can do a ZMI->Database->Undo and Undo the 'portal_catalog/manage_catalogClear' and get back to a site that occasionally throws 500's.

Dear reader, if you're this far in this thread, just restore from backup and manually replace your content. This is a purely academic thread at this point.

flipmcf via Plone Community wrote at 2022-9-9 17:45 +0000:

...
The "Update Catalog" action quickly results in a POSKeyError.

portal_catalog/manage_catalogReindex
Traceback (innermost last):
 Module ZPublisher.WSGIPublisher, line 162, in transaction_pubevents
 Module ZPublisher.WSGIPublisher, line 371, in publish_module
 Module ZPublisher.WSGIPublisher, line 266, in publish
 Module ZPublisher.mapply, line 85, in mapply
 Module Products.PDBDebugMode.wsgi_runcall, line 60, in pdb_runcall
 Module Products.ZCatalog.ZCatalog, line 251, in manage_catalogReindex
 Module Products.PDBDebugMode.zcatalog, line 35, in refreshCatalog
 Module ZODB.Connection, line 791, in setstate
 Module perfmetrics._metric, line 66, in perfmetrics._metric._AbstractMetricImpl.__call__
 Module relstorage.storage, line 587, in load
ZODB.POSException.POSKeyError: 0x01fbca11
> rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py(587)load()
-> raise POSKeyError(oid)

I have a variant of the above (likely, it will not work
but the traceback above could give a bit of hope).
The Catalog instance has 2 symmetric attributes: uids and paths
(or similarly spelled): one maps uids (actually object paths) to record ids
(that are integers), the other one reverses the map.
Each of those attributes allows to determine the objects paths for
the catalogued objects.
You could quickly check whether the approach has a chance
by calling list(attr.items()). It this does not produce a POSKeyError
for one of the attributes, you could proceed as follows:

  • Determine the list of objects paths from one of the attributes
    and save in a list (decoupled from the attribute)

  • Clear the catalog

  • Call catalog_object for each the formerly catalogued objects

It's a good approach if any of these iterables can be iterated, saved, and restored. Here's the results:

>>> catalog = app.PloneSite.portal_catalog
>>> catalog._catalog
<Products.ZCatalog.Catalog.Catalog object ...
>>> catalog._catalog.uids
<BTrees.OIBTree.OIBTree object at 0x7f83e85e15c0 oid 0x17ad6ca in <Connection at 7f83ed686700>>
>>> catalog._catalog.paths
<BTrees.IOBTree.IOBTree object at 0x7f83e85e17c0 oid 0x17ad6cb in <Connection at 7f83ed686700>>

Is this what you are referring to?

>>> list(catalog._catalog.uids.items())
WARNING:relstorage:POSKeyError on oid 25090606: no tid found; Current transactio
n is 282300381144058965; Recent object tids: []
ERROR:ZODB.Connection:Couldn't load state for BTrees.OIBTree.OIBucket 0x017eda2e
Traceback (most recent call last):
  File "rfasite/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 72, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x017eda2e
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "rfasite/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 72, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x017eda2e

uids is broken... :frowning_face:

 >>> list(catalog._catalog.paths.items())
WARNING:relstorage:POSKeyError on oid 33278481: no tid found; Current transactio
n is 282300381144058965; Recent object tids: []
ERROR:ZODB.Connection:Couldn't load state for BTrees.IOBTree.IOBucket 0x01fbca11
Traceback (most recent call last): File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 66, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x01fbca11
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/ZODB-5.6.0-py3.8.egg/ZODB/Connection.py", line 791, in setstate
    p, serial = self._storage.load(oid)
  File "src/perfmetrics/metric.py", line 66, in perfmetrics._metric._AbstractMetricImpl.__call__
  File "/home/rfaweb1/develop/plone24/rfasite/rfasite/cache/eggs/RelStorage-2.1.1-py3.8-linux-x86_64.egg/relstorage/storage.py", line 587, in load
    raise POSKeyError(oid)
ZODB.POSException.POSKeyError: 0x01fbca11

:sob:

And this is after a few btree kadosectomies, I didn't restore to the original broken database.

0x017eda2e is a new oid I haven't seen until iterating the uids.

flipmcf via Plone Community wrote at 2022-9-14 15:52 +0000:

It's a good approach if any of these iterables can be iterated, saved, and restored. Here's the results:

...
>>> catalog._catalog.uids
<BTrees.OIBTree.OIBTree object at 0x7f83e85e15c0 oid 0x17ad6ca in <Connection at 7f83ed686700>>
>>> catalog._catalog.paths
<BTrees.IOBTree.IOBTree object at 0x7f83e85e17c0 oid 0x17ad6cb in <Connection at 7f83ed686700>>

Is this what you are referring to?

It is.

...
And this is after a few btree kadosectomies, I didn't restore to the original broken database.

0x017eda2e is a new oid I haven't seen until iterating the uids.

Sad!

But it demonstrates: you do not have an isolated POSKeyError problem
but a massive one. It does not affect a single data structure
but several. Something went considerable wrong.

Yes. A very brief analysis suggests this was a mosaic layout customization by a user and we don't have details on what actually occurred. The layout customization went badly and they wanted to revert. This was likely at /korean (subsite / folderish type) or /korean/in-focus (section / folderish type) An administrator visited the ZMI->Database->undo and did an undo on the layout customization. Then the corruption began. It was first noticed on the search page when searching for specific keywords in specific indexes. (Author field) nearly a week later. Reviewing the logs shows that the poskey errors began shortly after the zmi undo, but did not create a user-related error until a week later. There are suspicions that the corruption did spread over time, but it's not confirmed.

There are probably more details, like exactly what the mosaic layout issue was, what the undo was (the admin was not experienced in the nuances of zmi undo) and if the admin and user were simultaneously working to resolve the issue.

It appears, when following some of these oids up the parent relationships, that they are mosaic tiles / blocks that list content.

Again, this was already restored with a backup of the relstorage database (mysqldump or similar) (always keep backups!) but it's an interesting corruption to play with on the bench.

The missing content item with a url of /korean/in-focus/41543-20000925.html is also quite interesting. After the surgery I have done above, I can @@search for '41543-20000925.html' and get a result, probably the searchabletext index, but the url is 404. Additionally, I can tell this is content from 2000-09-25 - unless the user intentionally altered that url, which is unlikely. This is extremely old content and actually pre-dates Plone and was imported from another cms (brickolage), then survived a plone2->plone4 migration and a plone4->5 migration, which is quite an achievement, considering the original cms was not utf-8 but ISO-2022-KR encoded korean. Thank @aclark for that work!

The content was restored to production ok, tho. https://www.rfa.org/korean/in_focus/41543-20000925.html