RelStorage (with PostgreSQL) slowness, how to debug it

gforcada · November 10, 2023, 5:00pm

Hi all

for a few days we migrated our website from ZEO Data.fs to RelStorage (with PostgreSQL). The migration went fine but now we are noticing that the website is almost 2x slow that what it used to be

The configuration looks like this on zope.conf:

<relstorage>
    keep-history false
    blob-dir var/blobstorage
    blob-cache-size 2gb
    cache-local-mb 200
    cache-local-dir var/relstorage-cache
    <postgresql>
        dsn dbname='datafs' user='relstorage' host='aa.bb.cc.dd' password='xxx'
    </postgresql>
</relstorage>

I have an assorted list of findings, no conclusions at all so far (that's why I'm here ).

py-spy

I used to inspect the website for performance with py-spy and when I started doing the same, now with RelStorage in the mix, I get the following warnings from py-spy:

py-spy> 36.69s behind in sampling, results may be inaccurate. Try reducing the sampling rate

And interestingly enough, trying a request, with py-spy tracing, takes ages (minutes), stopping py-spy gets the pages so much more faster.

Looks like py-spy is somewhat blocking the process?

If I have enough patience and wait for a request to be finished, the flamegraph, oh surprise, it shows that RelStorage code paths are the ones that take the more samples...

Cython ?

Then I thought that, given that RelStorage is using Cython to optimize its code, we might not benefit from that, and instead we get the plain python version... turns out zc.buildout/pip installs a wheel file:

Fetching RelStorage 4.0.0a1 from: https://files.pythonhosted.org/packages/9d/26/a49cb7df2be4697ac0b67a7892e06cef8a87b133cf95e4c48e7191faebc7/RelStorage-4.0.0a1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl#sha256=b4f2d4e967c02bfba6370aec38ad4cec86dc89daaa24cc12491a1c9f1a4d03a5

Here I was testing the alpha release, to check if indeed, a wheel is downloaded, we are using RelStorage==3.5.0 with psycopg2-binary = 2.9.8 and PostgreSQL server is at version 15.4.

Persistent caching

On the configuration we state cache-local-dir var/relstorage-cache so that cache is persisted across restarts. Well, turns out that not for us

$ ls -lh var/relstorage-cache/
-rw-r--r-- 1 s-derfreitag service  16K Nov 10 14:58 relstorage-cache2-datafs.sqlite3

16kb ?!?! Am I missing anything to get the persistent caching to work?

I filed an upstream report for this.

Client Caching Tracing

I had a look at the tracing functionality and it is working, +90% hit rate does not sound that bad:

                   loads    hits  inv(h)  writes hitrate
Nov 10 14:58-58       0       0       0       0     n/a
Nov 10 14:58:59 ==================== Restart ====================
Nov 10 14:58-59   14948       6       0   14942    0.0%
Nov 10 15:00-14  587030  341392       0  245641   58.2%
Nov 10 15:15-29  566485  553804       0   12875   97.8%
Nov 10 15:30-44  557521  518122       0   39837   92.9%
Nov 10 15:45-59  514945  506393       0    8772   98.3%
Nov 10 16:00-14  426124  385536       0   40673   90.5%
Nov 10 16:15-29  430301  422432       0    7900   98.2%
Nov 10 16:30-44  554803  523956       0   30850   94.4%
Nov 10 16:45-59  207864  202747       0    5126   97.5%
Nov 10 17:00-14  574693  530792       0   43904   92.4%
Nov 10 17:15-29  946277  924465       0   21922   97.7%
Nov 10 17:30-44  456647  440533       0   16114   96.5%
Nov 10 17:45-48  183321  173959       0    9365   94.9%

Read 6,518,881 trace records (221,640,602 bytes) in 28.7 seconds
Versions:   0 records used a version
First time: Fri Nov 10 14:58:59 2023
Last time:  Fri Nov 10 17:48:47 2023
Duration:   10,188 seconds
Data recs:  6,022,058 (92.4%), average size 442 bytes
Hit rate:   91.7% (load hits / loads)

        Count Code Function (action)
            1  00  _setup_trace (initialization)
      496,822  20  load (miss)
    5,524,137  22  load (hit)
      497,921  52  store (current, non-version)

So the RAM cache does work, but the persistent one does not, but still it's noticeably slower, and we are using the history free version because we read that it's faster than the history preserving

I'm a bit clueless, any kind soul has any input/ideas to try?

nielssteenkrogh · November 13, 2023, 11:16am

We are not in production yet - but findings from my local setup.
On restart (after deleting the sqlite cache file) instantly grows to this:
841472 Nov 12 23:51 relstorage-cache2-easymirror_gastro.sqlite3

So maybe you should inspect more there

gforcada · November 13, 2023, 7:42pm

Thanks, I was suspecting as much, that we have something in our own add-ons or the third party ones we integrate that do not play nice with the database...

Fortunately we found one of the problems with our website slowness: zodb-cache-size

We had set it at 70k objects, but our website has +12M objects stored, so it was, by any means, way too small.

Doing some simple A/B testing with that setting on different Plone workers with 150k/300k already shows a noticeable improvement without using too much RAM (still below 3Gbs of used + cached data).

We will continue to bump that number until we see how good it gets.

Then we will have to have a serious conversation with our add-ons and the database closing