Need some help to avoid ConflictError

jugmac00 · May 21, 2019, 8:52am

Setting

a software license management system based on Zope 2.13.29 / Python 2.7.15 / ZEO / ZODB3 = 3.10.5
one off script which upgrades roundabout 2000 software licenses + creates delivery notes + updates company data + create build infos (for the software) and packs everything in a zip file (called zil) for further processing
script gets triggered via browser.

Test run on dev machine
On my dev machine script ran without a problem.

first test run on staging
On staging the script ran completely - than hit a ConflictError at the very end, and started again right from the beginning (one request = one transaction).

2019-05-17T15:34:34 INFO ZPublisher.Conflict ConflictError at /CompanyCenter/perform_mass_delivery: database conflict error (oid 0x6c8a1d, class BTrees.OOBTree.OOBucket, serial this txn started with 0x03cfc5fb1c25c788 2019-05-17 12:43:06.597088, serial currently committed 0x03cfc6232ad84677 2019-05-17 13:23:10.041756) (1 conflicts (0 unresolved) since startup at Thu May 16 11:24:35 2019)

second test run on staging
For the next test run, I put a transaction.commit() after each license upgrade.

After about the half of the licenses, another ConflictError occured, but this time the same license got retried, worked, and then also the rest of the licenses got upgraded.

At the same time the ConflictError occured, also a long running cron job via XML-RPC was triggered - which does not act on licenses, but on Calendar data, so I am unsure whether this caused the problem.

2019-05-17T17:03:11 INFO ZPublisher.Conflict ConflictError at /CompanyCenter/perform_mass_delivery: database conflict error (oid 0x6cc634, class BTrees.OOBTree.OOBucket, serial this txn started with 0x03cfc68726d9b288 2019-05-17 15:03:09.105558, serial currently committed 0x03cfc687300f5311 2019-05-17 15:03:11.264030) (5 conflicts (1 unresolved) since startup at Fri May 17 16:35:08 2019)

When I had a look at the oid, I got a BTrees.OOBTree.OOBucket - without deeper knowledge at first I expected to get a business object like a license or a company or ... which was tried to write on twice.

The bucket contains...

[('C2WGKES3NZBI', 'DeliveryNotePDF'), ('C2WGKES4VDVZ', 'BillingPdfPart'), ('C2WGKEWEYHYI', 'Licence'), ('C2WGKEWF6UNI', 'LicenseProfile'), ('C2WGKEWGVXYG', 'DeliveryNotePDF'), ('C2WGKEWIHBW4', 'BillingPdfPart'), ('C2WGKFA26DPJ', 'BillingPdfPart'), ('C2WGKFAY22TM', 'Licence'), ('C2WGKFAZMYZM', 'LicenseProfile'), ('C2WGKFAZVBHH', 'DeliveryNotePDF'), ('C2WGKFEFYQCI', 'Licence'), ('C2WGKFEGMMVU', 'LicenseProfile'), ('C2WGKFEGUFEK', 'DeliveryNotePDF'), ('C2WGKFEIFTFG', 'BillingPdfPart'), ('C2WGKFIS5NHJ', 'Licence')]

The zero index of the tuples are custom identifiers.

one off script

    def perform_mass_delivery(self):
        """This method gets triggered by the browser."""
        ziller_log.info("beginning zil generation")
        licenses = self.db.Licenses.get_licenses_for_update()
        self._process_all_licenses(licenses)
        return "Success!"

    def _process_all_licenses(self, licenses):
        ziller_log.info("licenses to be updated: %s" % len(licenses))
        for i, license in enumerate(licenses):
            ziller_log.info("about updating license no %s of %s" % (i, len(licenses)))
            self._try_zil_generation(license)
            # best place for transaction.commit?

    @staticmethod
    def _try_zil_generation(license):
        try:
            # a lot is going on in generate_zil_for_initial_deliery
            # including the zip file generation - which is not covered by the transaction
            file_name, successor = license.generate_zil_for_initial_delivery()
        except Exception:
            ziller_log.error("license: " + license.getId(), exc_info=True)
        else:
            ziller_log.info("license: " + license.getId() + " successfully updated")
            ziller_log.info("location of ziller file: %s" % file_name)
            ziller_log.info("successor: %s" % successor.getId())

Many questions.....

Why did the ConflictError in the first test run happen exactly at the end of the run? Coincidence?
Why do I get an OOBucket from an oid and not a business object - like a single license?
What exactly does the ConflictError mean? Problem when writing to the bucket or writing to a single business object?
Is it common that a bucket contains so many objects? I always thought a good "hash table" contains one or zero values for a given key.
Where exactly did the ConflictError occur? Unfortunately there is no line number in the traceback.
When I look at my code again, I also think that "except Exception" is a bad decision. Should I at least catch ConflictError and re-raise?
Iff the except Exception would have caught the above ConflictError, than the log message would have to start with "license ..." - but it does not.

The ConflictError gets logged as "INFO" - which line of my code triggered that message?

When a ConflictError occurs the transaction gets rolled back - but the created zil/zip file is still on disk. What is the best way to clean it up? In the except block?
When I redo the 2nd test run with the same conditions, will the ConflictError occur at the very same license or is this non deterministic?
Why does it read "1 unresolved" at the second test run, when the license upgrade indeed got retried and finally resolved?
Why do ConflictErrors even occur when I am the only user?

For the production license upgrade, I plan to:

add a transaction.commit after each license upgrade
deactivate all cron jobs
deactivate nginx/haproxy, so I am the only user of the application (via lynx)
...

... any other hints/tipps/improvement suggestions for the above code or how to proceed?

Thank you very much for your help!
Jürgen

rafaelbco · May 21, 2019, 6:51pm

I suggest you try to ask about this on the ZODB mailing list.

What you posted is not off-topic, but I think you'll have more success at having a solution there.

djay · May 22, 2019, 7:39am

@jugmac00 you asked too many questions for me to answer specifically but the gist is this. The longer a request takes (ie the more you do) and the more different data it touches that the same as what some other request at a similar time touches, then hte more chance of a conflict. But there is lots of zope documentation on this are you should really read up it as conflicterrors are very well explained in teh documentation.

Best way of solving this is to as little as possible in the request and make it as short as possible. I had a similar situation. Don't do pdf generation or generation of anything in the same transaction if the workflow will instead allow you to hand it off to a task queue like c.taskqueue or p.a.async and then email or store the result there. This not only reduces conflicts but allows your app to scale way more and is the only way to handle a huge influx of requests.
You can also use task queues to break the your big batch process into lots of little ones with a some kind of combining step at the end.

2nd best way/hack is put a write queue into your loadbalancer like haproxy. Easiest way is identify all POST requests and put them into a back end much less instances like 1. That means your requests your write requests will happen sequentially and be much slower but you won't get write conflicts. You can improve this by using clever loadbalancing to make only similar POST requests/writes be serialised. such as using sticky cookies or a path based balancing policy.

Another solution is to be careful about what you are writing to and reading from. Avoid the catalog where possible as its a central place writes occur so increases the chance of a conflict. chances are have counters or such to provide unique ids. use UUIDs instead of some other way to provide uniqueness than some central counter or data structure.

jugmac00 · May 23, 2019, 3:07pm

I suggest you try to ask about this on the ZODB mailing list.

Thank you very much - I did not know this mailing list, and I was surprised as it seems to be quite active - opposed to the many dead Zope mailing lists.

I cross posted my question(s) there and will link it here once my message gets approved.

jugmac00 · May 23, 2019, 3:24pm

you asked too many questions for me to answer specifically

The reason behind the many questions is that I do not want a solution only, but I try to really understand what is going on.

Thanks for taking your time to give me some tipps!

The longer a request takes (ie the more you do) and the more different data it touches that the same as what some other request at a similar time touches, then hte more chance of a conflict.

In my setup there should be no other requests - well, except for the cron job which I forgot to deactivate.

But there is lots of zope documentation on this are you should really read up it as conflicterrors are very well explained in teh documentation.

I certainly tried to research the problem and found several entries at StackOverflow or old mailing list archives, but most of them mentioned either those parallel requests or especially problems with sessions, which do not apply here.

I did not find much documentation about conflict errors - could you point me to some relevant documents?

Some documentation propose to implement _p_resolveConflict but I am already having a hard time to identify the object which is problematic, as the log only shows a bucket where the ConflictError occurs.

Best way of solving this is to as little as possible in the request and make it as short as possible.

Yes but... I have to update like 2000 licenses and create the PDFs for them. I can't tell my colleagues to hit the upgrade button 2000 times

I had a similar situation. Don't do pdf generation or generation of anything in the same transaction if the workflow will instead allow you to hand it off to a task queue like c.taskqueue or p.a.async and then email or store the result there.

This sounds quite complicated and I still do not know whether this would help in my case - as there should be no concurrent thread.

2nd best way/hack is put a write queue into your loadbalancer like haproxy.

This most certainly won't help in my special case, as there are no parallel requests coming in.

I will even shutdown HA-Proxy and Nginx to ultimately make sure I am the only one creating a request on the Zope server via curl or lynx.

Another solution is to be careful about what you are writing to and reading from. Avoid the catalog where possible as its a central place writes occur so increases the chance of a conflict. chances are have counters or such to provide unique ids. use UUIDs instead of some other way to provide uniqueness than some central counter or data structure.

Thanks - that is a good hint. But my setup consists of dozens of small catalogs (every business object has one) ... hm, I could patch the indexer to not do anything if all other means fail.

Thank you again for your thoughts!

There is still plenty of time until I have to perform the mass upgrade, so I am sure I'll find a solution.

djay · May 24, 2019, 3:42am

there are other requests otherwise you won't not be getting conflict errors. are you saying you deactive your whole website when doing this batch job? If something is runnign its possible that its reading something that your batch job is writing to... like the central catalog. That can cause a conflict if your batch is taking a long time and the batch is doing catalog queries or writes.

honestly the best way to resolve that is to use c.taskqueue or p.a.async and then divide your your batch into small chunks where the transaction lasts less time. like 10 per job.
You can also do the same thing by creating a script you run via bin/instance run or by explicitly using transaction.commit() in your code. (not using subtransactions doesn't help in this case even though it might appear it could. it only helps reduce memory usage).

So in summary. either work out how to write or read from less stuff in your batch or break your batch up into smaller chunks.

jugmac00 · May 29, 2019, 10:14am

Thanks again for taking your time!

there are other requests otherwise you won't not be getting conflict errors.

I forgot about the cronjobs which hit the server via XML-RPC - but I will also deactivate them when I do the batch job.

are you saying you deactive your whole website when doing this batch job?

Yes. I will deactivate Nginx, so no external call can interfere with the batch job, and also deactivate the cron jobs. I will hit the Zope worker directly with lynx

This is a one time batch job, and I just do it on a Sunday, when nobody works (the application is only used by co-workers).

I think to keep things simple, I will try the transaction.commit() route - once more. I already did it, but still had this once ConflictError - but maybe the cronjob interfered. I will do another test run next week.

jugmac00 · July 8, 2019, 8:27am

As a quick feedback and to close this discussion.

I did the mass update on Friday night (I was not allowed to to it on Sunday because of working hours act), and it worked like a charm - thanks again for your feedback.

I did the following:

added a transaction.commit after each update
added a except ConflictError and re-raised it again
shut down Nginx so no concurrent requests could hit the server
temp. deactivated the cron jobs which also act on the db

tkimnguyen · July 8, 2019, 3:34pm

We (Plone) are happy to host discussion about Zope here (hence the new Zope tag)