How to warm up caches

Current setup:
1 VM with Nginx + HAProxy
1 VM with Zope
1 VM with ZODB

After a deployment I manually click through the data intensive pages to warm up the caches.

The internal app is very low traffic (50 users), but nevertheless quite important.

I plan to setup a second VM with a second Zope instance - as a failover and also as a possible way to debug production data.

After a restart/deployment, I'd like to avoid to manually click through my app on two Zope instances.

Is there anything you could recommend how to warm up caches automatically after restart/deployment?

Thank you!

I would recommend to use two (or more zope instances). This way you can restart one of the instances while the other instance is available to handle requests. You already use HAProxy, so the load balancer will see if one of the instances is down and stop sending requests.

You could automate your deployment and do a rolling restart of the instances with a warm up. For example:

  • Restart 1st instance:
  • Wait until instance is up
  • Run script to warm up the cache
  • Continue to 2th instance

To warmup the cache you could use a script which loads sitemap.xml.gz (or a set of predefined URLs) and hit all or several pages. Using bash/curl/wget or python/requests it should be easy to crawl the sitemap or a similar file.

2 Likes

The Ansible Playbook's restart_script role contains examples of how to do most of this. Take a look at https://github.com/plone/ansible-playbook/blob/master/roles/restart_script/templates/restart_single_client.sh.j2 for the Jinja2 template of a script to take a single instance offline, restart it, warm its cache, and put it back online.

1 Like

Thank you very much!

Especially in Steve's script there are some good starting points.

I will have to make up my mind how this will work between different virtual machines, and also I'd possibly have to think about how to authenticate (possibly via basic auth).

I won't be able to start working on this immediately, but I will report back then.

Thanks again!

I'm looking for the same thing now.
These seem to be the best bets:
A standalone Python script: Script that (re)starts a Plone instance, loads its main page and all links within the Plone instance that are on the main page (some filters provided) · GitHub
An addon which doesn't seem to work on newer Plone: GitHub - collective/collective.warmup

"zeo-client-client" could help, too:

Thinking aloud. It should be possible to warm the cache with the restapi.
Thoughts?

Did you try to analyse what is the cache content? You could populate directly the cache with a script at startup, inserting the keys and values.

At the first thought it seems simpler to just call pages and let Plone do the work the first time.

https://docs.plone.org/manage/deploying/performance/ramcache.html

1 Like

A simple 'poor man' warmup one liner we use in a bash script for rolling restarts is the following: it fetches the homepage with wget, parses the links and feeds them into a bash loop that also fetches the links from the homepage.

This (or even just fetching the Plone homepage) is enough to avoid the 'first request' time out that can happen in larger sites if the Zope server starts up but the Plone Site hasn't been requested and everything has to be loaded/initialised only then (like the portal_catalog).

wget -O - http://localhost:8080/plone/ | grep -o '<a href="http[^"]*"' | sed 's/<a href="//;s/"$//' | grep "localhost" | while read line; do wget -P /tmp/warmup --delete-after $line ; done

Beware that if you're caching getURL or url related values in RamCache, using localhost will fill the cache with localhost instead of the site url. Or use the portal_url as cache key.

using localhost will fill the cache with localhost instead of the site url.

I have never had or seen issues with this, even if these url's are cached. The public url of the Plone websites we maintain are all accessed through VHM url's (VirtualHostMonster), so any localhost urls are rewritten by Zope to the public domain name when serving the html back to the proxy server.

I mean this:

@ram.cache(lambda *args: str(time() // (60 * 60 * 24))
def myfunction()
     for brain in context.portal_catalog(portal_type='Document'):
                 r = {'title': brain.Title,
                       'url': brain.getURL(),
                }
               a.append(r)
    return a

If you run it the first time from a script pointing to localhost, you will get the localhost url cached for 24 hours. So when you visit the portal, the browser view using this cached method will get the localhost urls instead of the public site address.

The best option in this cases is to cache just the UUID and then get the values from the catalog in the uncached part of the browser view. Moreover, to consider the case that a document is deleted/added/modified, the caching key function should collect all the UUIDs and ModificationDate of Documents and do an md5.

I used a simple restart shell script that invoked curl during a rolling restart on the pages we knew were the most popular

So would api calls warm the cache? (I know I can just do a test). My scenario is a bunch of pages that only logged in users can access. The "poor man's" cache warming seems to be more appropriate for public facing pages. Again, I haven't tested any of this yet but I'm reasoning that if curl/wget can't call it, it can't be cached.

I'm sure API calls warm SOMETHING (the ZEO client, the Zope cache, etc.), assuming you've logged in and that user can access the pages in question. Probably not the same cache that would get loaded or read by someone coming in via a browser though.

:thinking: in my mind, at least it's "warming" authenticated content.