Catalog errors in ZRS replica

Hi,

I use ZRS to replicate a 4.3.4 site using zrs=2.4.4 over a VPN. In the replica's catalog, most of the records are correct but some are broken - if I copy the path index from a record in portal_catalog in the ZMI and paste it into the browser (with the necessary domain changes), I get Page Not Found. I ran a Clear and Rebuild on the master but it didn't fix the problem on the replica. Watching the tcpdump of the replication port does show activity so I presume the rebuilt catalog does get replicated. I could maybe restart the replica in R/W mode and rebuild the catalog but that would defeat the purpose of replication. Anyone seen this before?

Cheers
Mike

I've seen replication errors with older versions of ZRS, before it got open sourced.
Unfortunately I don't remember the details, IIRC, I had problems with unpickling, so my problem was a different one

You did not write if you compared the catalog entries on production and test.
Also, I suggest you also write a script to count the errors to see if the number of errors increases other time. This helped us to realise, our problem was happening in the synchronisation, because number of error increased slowly.

Thanks for the good idea, Patrick, I'll create that script to track the number of problem records over time. Regarding the test/production comparison, I assume you mean between master and slave: I have found no issues on the master as yet but I'll use that script to check the entire catalog on both instances.

Yes, master/slave not prod/test :slight_smile:

Update: I've created a script that calls getObject on all the brains in the replica and I see just under 10% of the objects are missing. Now I'm looking for a way to 'touch' these objects on the master that will activate ZRS to 'restore' the missing objects. Running convert on a pdf file (I'm using c.documentviewer) worked but not all missing objects are pdfs. Renaming object IDs also worked but I want my IDs to remain as is - I guess to could rename them twice (add a prefix and then remove it again afterwards). Any other ideas?

FWIW, I've been using ZRS for 5 years, under heavy load, over VPNs, etc and never have had any problems like this.

Good to know, Nathan. That means it's more likely to be a problem in my code on those instances. BTW, do you have any replicas that are accessible by read only users?

Yes, that's how we use it. Public users only have access to our read-only replicas.

vangheem, how did you make Plone start with read only replicas? Last I tried, it wouldn't even start up.

This worked for me http://stackoverflow.com/questions/18742513/plone-switching-to-zrs-using-plone-recipe-zeoserver-on-plone-4-3-1

a great feature you can add is this one:

zeo-client-read-only-fallback = on

that will tell your instances to fall back to read-only mode in case they can't write to the database. so, if the master ZRS fails they can connect to the slave without failures.

I was checking yesterday why we can't upgrade to zc.zrs 2.5.x and saw that Jim changed the dependency to ZODB instead of ZODB3 (without adding a changelog entry, BTW).

is Zope 2 strongly tied to ZODB3? a migrating could be possible or not?

I will try writing a blog post about this some time and make sure it's documented well in docs.plone.org.

You also need to do enable-product-installation = off in https://pypi.python.org/pypi/plone.recipe.zope2instance config and use something like https://pypi.python.org/pypi/wildcard.readonly/1.0 to handle potential write on read problems with your read-only clients.

2 Likes