BTreeFolder2 questions

spanish · November 23, 2021, 2:33pm

Thanks to both of you.

I will try these things ( dm.zope.profile & Py-Spy ) after reaching one million of users and dexterity persons.

Thanks!

Manuel

dieter · November 24, 2021, 10:29am

A further option is the use of a "request monitor" from the package haufe.requestmonitoring. This is a tool to recognize long running requests and log information for them. Usually, you would use the DumpTraceback monitor; it dumps traceback (more precisely call chain) information to the log file for long running requests. Subsequent tracebacks often provide valuable information about the cause for the large time.

jensens · November 24, 2021, 11:51am

@dieter OT here, but do you think it is possible to combine the py-spy way of getting timing data from a running instance with haufe.requestmonitoring in a way that the dump in the log contains timing data?

spanish · November 24, 2021, 3:25pm

Yes, no problem: I'll try everything and put the results here. I believe that on Sunday I will have been able to reach double million (one million users and one million dexterity persons). Later I will try the things you recommend (but remember that I am only a learner, I am not an expert in Python, Zope and Plone, like you - it may take a while ...-). If someone wanted to test with more millions, I could create a Plone 5.2 buildout just for that (5 million double, 10 million double, 100 million double, ...), compress it and put the download link here (with double half million only weighs uncompressed 800 MB more than just installed).

With 500,000 users and 500,000 dexterity persons:
1.- First and subsequent user searches on /@@ usergroup-userprefs: 77 seconds
As you search for users, the entire site (especially searches and collections) slows down tremendously.
Activating cache helps a little ( I will have to study and try Varnish ... ).
2.- User creation (modal window does not close): 4 seconds
3.- User searches:
· First search for newly created user: 77 seconds
· Second and subsequent searches for newly created user: 77 seconds
4.- User deletion:
· First attempt canceled after half an hour (not deleted).
· Second attempt: ??? (at some point it was removed, but I don't know exactly when -user screen frozen-).

I continue to double million ...

Hugs,

Manuel

dieter · November 24, 2021, 5:21pm

The request monitor (in haufe.requestmonitoring) has a small (runtime) footprint: it is not necessary to use the py-spy technique (external observation). The log records produced by the DumpTraceback monitor routinely contain (global) timing data; example:

2021-11-24 11:03:49,858 WARNING [RequestMonitor.DumpTrace:89][Dummy-2] Long running request Request 2 "/++bfd_monitor++debug" running in thread 139712287774464 since 3.565307378768921s
Python call stack (innermost first)
  Module Debug.sleep5, line 2, in <module>
  Module importlib._bootstrap, line 219, in _call_with_frames_removed
  Module importlib._bootstrap_external, line 678, in exec_module
  Module importlib._bootstrap, line 618, in _exec
  Module importlib, line 166, in reload
  Module bfd.monitor.debug, line 94, in do_reload
   - __traceback_info__: Debug.sleep5
 ...

A significant extension of py-spy would be necessary to add timing data to the individual frames of such a call chain: it would need to recognize the start and end of request processing. The request monitor uses ZPublisher.interfaces.IPubEvent subscribers for this. To facilitate the task for py-spy, alternative subscribers could be used to increment counters for started and completed requests; py-spy could monitor those counters to determine whether something relevant (request processing started, was completed) happened. It would shadow a Python dict containing the request description. With this information, py-spy could add the request as (artifical) top level caller and than use its normal profiling facility to add times to the individual call frame entries.

espenmn · November 24, 2021, 6:31pm

Is this when you have checked 'many users' / 'many groups' at /@@usergroup-controlpanel ?

jensens · November 24, 2021, 7:27pm

May you move this (together with the even slow cases in 1-3) to a ticket at Issues · plone/Products.CMFPlone · GitHub because this is a bug - even if it is an extreme edge case. A filled database (I think 500k users are enough) together with instructions what is needed in which versions to get it running (maybe as Gist) would be extreme helpful.

dieter · November 24, 2021, 9:07pm

If you perform the operation via HTTP, you can only cancel the operation by restarting Plone -- otherwise, the operation continues on the server.

If you have a large number of users, it is likely better to store them in a relational database (not the ZODB): the user management in the ZODB uses only very elementary indexes and searches are likely not optimized. When you store user information in a relational database, you can define appropriate indexes which drastically can speed up searches.

wesleybl · November 24, 2021, 11:27pm

User deletion in portals with many folders are time consuming because the process accesses all portal folders to remove local permissions from the user being deleted.

After all, what is your intention? Be aware that having multiple users authenticated in Plone at the same time makes the portal very slow (we don't have cache for authenticated). Even worse if they're creating/editing/deleting content. We did some tests and when we have more than 150 users managing content at the same time, the site becomes unusable.

dieter · November 25, 2021, 8:18am

I do not think that this is true: almost surely, the user information is stored in a BTree; it typically has logarithmic access time and should scale well.

The time spent on authentication essentially depends on the number of requests (one authentication per request) and usually not the number of the concurrent users (unless that number is so huge that it thrashes the ZODB cache).

This is the default. But Products.PluggableAuthService (the base of Plone's authentication) uses OFS.Cache.Cacheable and this can be used to cache corresponding decisions -- for details, look for Cacheable in the source of Products.PluggableAuthService.PluggableAuthService. Cache configuration via the Cache tab (or the ZCachable_manage view).

wesleybl · November 25, 2021, 12:50pm

I meant that it is not only slow because of the authentication process itself, but also because of the navigation of many simultaneous users. Without caching (Varnish/nginx), Plone will have to render the html of every click of every user. If there are too many users navigating at the same time, this will overload Plone.

espenmn · November 30, 2021, 11:03am

Are you talking about storing ONLY the users in the relation database? Would that not still be slow, since deleting a user will require to change permissions on 'folders and files' in ZODB ?

dieter · November 30, 2021, 12:44pm

I do.

This seems to be something new. Until the comment, I was sure that such a cleanup after user deletion was missing. But, deleting users is fairly rare. Thus, I am not too concerned about a higher time for this operation.

espenmn · November 30, 2021, 1:33pm

Is there any docs (or similar) on 'recommend approach' of what Manuel describes here. (something that might (one day) extend to hundreds of thousands of users?

spanish · December 10, 2021, 3:06am

With one million of users and one million of dexterity persons:

The buildout occupies 1.1 GB more than just created.

In normal mode, it takes 3 minutes for the site to start responding and an additional 30 seconds to finish responding (3.5 minutes in total until the default home page loads).

In foreground mode, client1 startup fails many times with the following error:
...
File "/home/plone/52/buildout-cache/eggs/ZEO-5.2.2-py2.7.egg/ZEO/asyncio/client.py", line 825, in wait
self.wait_for_result(self.client.connected, timeout)
File "/home/plone/52/buildout-cache/eggs/ZEO-5.2.2-py2.7.egg/ZEO/asyncio/client.py", line 759, in wait_for_result
raise ClientDisconnected("timed out waiting for connection")
ZEO.Exceptions.ClientDisconnected: timed out waiting for connection

And when it does start up, it takes 30 seconds longer than in normal mode (4 minutes in total before loading the default home page).

Second loading of default home page: 2-3 seconds
First load of an empty collection: 10 seconds
Next loads of empty collections: 2-3 seconds
First load of the /Members folder (one million of users): 8 seconds
Next loads of /Members: 2-3 seconds
First search in /Members: 77 seconds
Immediate subsequent loading of an empty collection: 20 seconds
Next load of /Members: 8 seconds
Next search in /Members: 81 seconds
Next load of /Members: 4 seconds
Next search in /Members: 27 seconds
First load of /persons folder (one million of dexterity items): 41 seconds
Next loads of /persons folder (one million of dexterity items): 8-9 seconds
First load of a normal page: 24 seconds
Next loads of the same normal page: 2-3 seconds

/@@usergroup-userprefs load fast.
First search here: 127 seconds
Next searches: 170 seconds
Create new user: 4 seconds
Delete a user: ??? (attempt canceled after half an hour -not deleted-)

Certainly there is a problem, but:
1.- Not much higher than when it had 500,000 users and 500,000 dexterity persons.
2.- It is likely that the speed will improve a lot in the public part if I use Varnish (I will test this).

I'm going to study your answers ...

Thanks

spanish · December 10, 2021, 3:45am

The user data edition is OK (very fast: 1-2 seconds, normally 1 second).

And the load of /author/username is fast too (it only takes a little longer the first time after modifying user data).

spanish · December 10, 2021, 3:54am

Modifying and loading /persons/username (dexterity item configured like the user folder) is fast too.

spanish · December 10, 2021, 4:00am

Where is the code responsible for this?

We will not use the sharing option and each normal user will only have permissions on their personal folder and its contents: I imagine a lot of time could be saved by avoiding permissions checking on all folders when deleting a user or by creating a quick delete action alongside the normal delete action.

Thanks

dieter · December 10, 2021, 8:31am

Have you shut down your ZEO properly?

The timed out waiting for connection indicates that the client is waiting for a ZEO connection. ZEO can take very long on startup when it has not been properly shutdown. This is caused by FileStorage. It uses an index file (mapping object id to file position of the most current object state). When ZEO is properly shut down, an updated index file is written; if this is not the case, then ZEO must process the complete storage to determine this information.

You have a particular large installation. Such installations require special configuration.

One potential optimization targets the size of the ZODB caches.
In your case, 2 ZODB caches are used: the "connection cache" and the "ZEO client cache". The "connection cache" resides in RAM and maps object ids to the current object. Usually, it is configured via its maximal number of objects. The "ZEO client cache" resides on disk and maps object ids to object pickles; it is configured by disk size. Usually, it resides in a temporary file (and a new cache is created when a ZEO connection is opened), but configuration can make it persistent (usable across connections) - this can speed up startup.
Increasing the size of those caches can speed up your followup operations; a persistent "ZEO client cache" also your first operations.

As already mentioned: you should consider to store your user descriptions in a relational database. The ZODB is a general purpose database designed for ease of use, not for speed. Especially, it does not have special support for highly structured mass data (such as millions of user descriptions). A properly configured relational data base does a better job in such a case than the ZODB. Storing user descriptions in a relational database should reduce the user oriented operation times - apart from user deletion.

Use the tools mentioned in former comments to find out what causes the huge time for user deletion. Likely, all content objects are visited to remove references to the deleted user. This would be fairly new: formerly, no such cleanup happened. To speed it up an index would be necessary mapping user ids to the content objects with references to the respective user. But it likely does not yet exist.

Searching is a form of filtering. The default Plone search component does the filtering in the client: this means that the unfiltered raw (index) data must be transferred from the server to the client to be processed there. There are Plone addons which allow to offload searching to a different system (e.g. Sol/r) which can search much more efficiently.

dieter · December 10, 2021, 8:45am

Likely Products.CMFCore.MembershipTool.MembershipTool.deleteMembers. It has a parameter delete_localroles (default True) which controls local role updates.