Deadlock issue with solr and zodb and high volume writing

Hi,

I wonder if over people have also stumbled over this in the past and have some additional insights.
We perform a large weekly data upgrade that gets broken down into many small work packages coordinated with rabbitmq.
Sometimes, this process gets stuck.
Today I noticed that if I have two workers on different servers that are both stuck, I can unstuck one worker by killing the other.
Getting a stacktrace reveals this:

Aug 06 18:08:47 server docker[10250]: End of dump
Aug 06 18:08:47 server docker[10250]: waiter.acquire()
Aug 06 18:08:47 server docker[10250]: File "/opt/rh/python27/root/usr/lib64/python2.7/threading.py", line 340, in wait
Aug 06 18:08:47 server docker[10250]: self.replies_cond.wait()
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/ZODB3-3.10.5-py2.7-linux-x86_64.egg/ZEO/zrpc/connection.py", line 791, in wait
Aug 06 18:08:47 server docker[10250]: r_args = self.wait(msgid)
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/ZODB3-3.10.5-py2.7-linux-x86_64.egg/ZEO/zrpc/connection.py", line 763, in call
Aug 06 18:08:47 server docker[10250]: return self.rpc.call('vote', trans_id)
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/ZODB3-3.10.5-py2.7-linux-x86_64.egg/ZEO/ServerStub.py", line 249, in vote
Aug 06 18:08:47 server docker[10250]: self._server.vote(id(txn))
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/ZODB3-3.10.5-py2.7-linux-x86_64.egg/ZEO/ClientStorage.py", line 1097, in tpc_vote
Aug 06 18:08:47 server docker[10250]: s = vote(transaction)
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/ZODB3-3.10.5-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 781, in tpc_vote
Aug 06 18:08:47 server docker[10250]: rm.tpc_vote(self)
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 446, in _commitResources
Aug 06 18:08:47 server docker[10250]: self._commitResources()
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 329, in commit
Aug 06 18:08:47 server docker[10250]: return self.get().commit()
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/transaction-1.1.1-py2.7.egg/transaction/_manager.py", line 89, in commit
Aug 06 18:08:47 server docker[10250]: transaction.commit()
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/Zope2-2.13.24-py2.7.egg/Zope2/App/startup.py", line 301, in commit
Aug 06 18:08:47 server docker[10250]: transactions_manager.commit()
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 146, in publish
Aug 06 18:08:47 server docker[10250]: response = publish(request, module_name, after_list, debug=debug)
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
Aug 06 18:08:47 server docker[10250]: environ, debug, request, response)
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
Aug 06 18:08:47 server docker[10250]: response=b)
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
Aug 06 18:08:47 server docker[10250]: Thread 140204941870848 (GET /VirtualHostBase/http/c47d375fea8f:80/Plone/VirtualHostRoot/@@zamqp-consumer/conn_structure/(default)/zlag.pdb.structure/5cbe9234577d44c5869de5dffae3b225):
Aug 06 18:08:47 server docker[10250]: r, w, e = select.select(r, w, e, client_timeout())
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/ZODB3-3.10.5-py2.7-linux-x86_64.egg/ZEO/zrpc/client.py", line 52, in client_loop
Aug 06 18:08:47 server docker[10250]: self.__target(*self.__args, **self.__kwargs)
Aug 06 18:08:47 server docker[10250]: File "/opt/rh/python27/root/usr/lib64/python2.7/threading.py", line 757, in run
Aug 06 18:08:47 server docker[10250]: self.run()
Aug 06 18:08:47 server docker[10250]: File "/opt/rh/python27/root/usr/lib64/python2.7/threading.py", line 804, in __bootstrap_inner
Aug 06 18:08:47 server docker[10250]: self.__bootstrap_inner()
Aug 06 18:08:47 server docker[10250]: File "/opt/rh/python27/root/usr/lib64/python2.7/threading.py", line 777, in __bootstrap
Aug 06 18:08:47 server docker[10250]: Thread 140204950263552:
Aug 06 18:08:47 server docker[10250]: r, w, e = select.select(r, w, e, client_timeout())
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/ZODB3-3.10.5-py2.7-linux-x86_64.egg/ZEO/zrpc/client.py", line 52, in client_loop
Aug 06 18:08:47 server docker[10250]: self.__target(*self.__args, **self.__kwargs)
Aug 06 18:08:47 server docker[10250]: File "/opt/rh/python27/root/usr/lib64/python2.7/threading.py", line 757, in run
Aug 06 18:08:47 server docker[10250]: self.run()
Aug 06 18:08:47 server docker[10250]: File "/opt/rh/python27/root/usr/lib64/python2.7/threading.py", line 804, in __bootstrap_inner
Aug 06 18:08:47 server docker[10250]: self.__bootstrap_inner()
Aug 06 18:08:47 server docker[10250]: File "/opt/rh/python27/root/usr/lib64/python2.7/threading.py", line 777, in __bootstrap
Aug 06 18:08:47 server docker[10250]: Thread 140204672300800:
Aug 06 18:08:47 server docker[10250]: waiter.acquire()
Aug 06 18:08:47 server docker[10250]: File "/opt/rh/python27/root/usr/lib64/python2.7/threading.py", line 340, in wait
Aug 06 18:08:47 server docker[10250]: self.not_empty.wait()
Aug 06 18:08:47 server docker[10250]: File "/opt/rh/python27/root/usr/lib64/python2.7/Queue.py", line 168, in get
Aug 06 18:08:47 server docker[10250]: record = self._queue.get()
Aug 06 18:08:47 server docker[10250]: File "/opt/app-root/src/eggs/raven-5.11.1-py2.7.egg/raven/transport/threaded.py", line 147, in _target
Aug 06 18:08:47 server docker[10250]: self.__target(*self.__args, **self.__kwargs)
Aug 06 18:08:47 server docker[10250]: File "/opt/rh/python27/root/usr/lib64/python2.7/threading.py", line 757, in run
Aug 06 18:08:47 server docker[10250]: self.run()
Aug 06 18:08:47 server docker[10250]: File "/opt/rh/python27/root/usr/lib64/python2.7/threading.py", line 804, in __bootstrap_inner
Aug 06 18:08:47 server docker[10250]: self.__bootstrap_inner()
Aug 06 18:08:47 server docker[10250]: File "/opt/rh/python27/root/usr/lib64/python2.7/threading.py", line 777, in __bootstrap
Aug 06 18:08:47 server docker[10250]: Thread 140204663908096:
Aug 06 18:08:47 server docker[10250]: Threads traceback dump at 2018-08-06 16:08:47

It looks like the request has been handled and during committing the transaction, we get into a deadlock that stays there indefinitely.
We use solr, so I am not sure we really have a problem with the zeo server or with solr in that TPC commit thing.
I can reproduce this sometimes.