Intranets with huge LDAPs

sneridagh · February 25, 2015, 8:16pm

Hi!
I wanted to share with you all the issues I've had to face lately when dealing with intranets connected with Products.PloneLDAP to LDAPs with a large user and group base. I work at the Barcelona Tech University and we have lots of Plone sites connected to our 80000 user and 3000 groups LDAP server. We've been using the "standard" de facto multilayered P.PloneLDAP and its siblings since forever.

The small sites and the public sites are mostly unaware of the problems. The issues arises on largely used intranets. I've been tracing down the queries on the LDAP and the results were very surprising to me, I had to admit it. Let me explain.

Initial plugin config

My initial LDAP plugin config was to enable:

IAuthenticationPlugin
IGroupEnumerationPlugin
IGroupIntrospectionPlugin
IGroupsPlugin
IPropertiesPlugin
IRoleEnumerationPlugin
IRolesPlugin
IUserEnumerationPlugin

Sharing tab view

The sharing tab was the first stop with the debugging process. Let's say you search for a user/group, the plugin queries for it and the result return a large set of user/groups matching the search (because I've not narrowed enough the query yet, or because there are lots of users groups that match the query). Ok, the first query is normal and sane, but then the plugin tries to verify that each and every returned result are ok and performs a query for each of them. A simple operation that could be done with one result could scalate to hundreds of useless queries.

Then, the sharing view uses AJAX… if you are slow enough to type the user/group name… You can provoke a DoS to your server yourself.

Use of getMemberById

The omnipresent getMemberById is the most interesting part. You have to face it sooner or later (by using plone.api or portal_membership or its friends) if you want to retrieve any user properties, and it's widely used everywhere. However, every time it's invoked triggers the PAS pipeline making lots of useless queries only to retrieve a bunch of user information like display names or emails.

Groups and recursive groups (both IGroupsPlugin type)

This is by far the most heavy plugins. Recursive groups (the lambda icon) is a plugin activated by default in PAS that allows to have nested groups in Plone, localy... and in every group activated plugin. The former one is the one responsible for the feature that enables to grant permissions to an LDAP group.

So let's say we invoke a getMemberById for the render of a view, this actions triggers all the PAS pipeline, when it’s the turn of LDAP groups then the legit query is sended to know all the groups where the user is member. Then for all the groups in the response, each group is validated against (again) the LDAP. Let’s say I’m assigned to 30 groups…

Then it’s the turn of the recursive groups plugin that searches inside all the groups that the user has membership and queries every and each of them searching for more nested groups (if any). Thanks God our LDAP group structure is plain… Multiply. I’ve had more than 400 requests to LDAP for one single getMemberById. Insane.

Of course, the RAM cache will do its job and maintain a fragile feeling of “everything is ok”… but it’s temporal, of course. However, the memory consumption of the Zope processes scalates quickly to 1.5Gb per day… forcing us to reboot daily.

Workarrounds and possible solutions

After my research, I’ve concluded that I had activated more plugins that I really needed and deactivate some useless (for my setup) ones, so I left:

IAuthenticationPlugin
IGroupEnumerationPlugin
IGroupIntrospectionPlugin
IGroupsPlugin
IUserEnumerationPlugin

And disabled the recursive group plugin, as I do not need it at all.

I’ve been talking with some plonistas about their point of view (Asko, Ramon) and they told me that they have been struggling with these same issues before for the same scenarios. Asko shared with me some insights on how to fix them:

Sharing views could be fixed to ask LDAP details by AJAX in a way,
which wouldn't block Zope. (Medusa/Asyncore/ZServer related magic I've
blogged before). Of course, it'd still block with HA Proxy configued to
allow only fixed amount of request per instance.)
PAS could be fixed to so that it'd pass through lazily evaluatable
iterators (the greatest blocker for this is that PAS currently sorts the
results and sorting prevents the usage of lazy iterators).
python-ldap should be replaced with more modern library (which
preferrably would support the iterator approach).

I’ve been playing myself with some workarround for the getMemberById by using a paralel user property catalog based on plone.souper and repoze.catalog and maintaining it via events binded to user properties modifications and user creation/deletion. The default way to deal with searching users:

hunter = getMultiAdapter((portal, self.request), name='pas_search')
fulluserinfo = hunter.merge(chain(*[hunter.searchUsers(**{field: query}) for field in ['fullname', 'name']]), ‘userid')

using PAS is triggering all the pipeline for each result returned… This should be somehow re-thinked. I do not know if the approach I’ve used is valid, but it’s a first idea.

Rework all the default views to use such alternate getMemberById should be done too.

Lately I’ve been studying pas.plugins.ldap but it seems it has a blocker issue:

that involves performance issues with large LDAPs.

To rewrite a simple plugin that worked with python-ldap to make the strict (and more sane) use of LDAP is other option.

What do you think? Have you ever faced that issues? If so, which are your workarrounds/approaches?
I thought that was worth to make notice of it and start planning fix some of the issues on the Plone roadmap.

Cheers,
V.

PD: Sorry if I’ve been too much exhaustive in this bloggish-like post!

Alexander_Loechel · February 25, 2015, 8:32pm

Hi Victor,

that is exactly the same problem we have talked at Plone Conf 2014 in Bristol about. datakurre (Asko Soukka) has faced it and several other Universities. datakurre already has tried to order a new ldap pas plugin by chaosflow https://github.com/chaoflow/ldapalchemy to solve it.

Might be a good idea to have a small hangout all together before March 17th where there will be a meeting of the German EDU instititon that uses Plone. They might be willing in working together to make a better solution happening.

By the way, for most Universities that Problem should not exists if the use Single Sign On with Shibboleth, as you do not need any LDAP-Backend for Plone. Thas how I use it at the moment, with a Userdatabase(LDAP) with more than 100.000 Members.

Cheers,
Alexander

pbauer · February 25, 2015, 8:47pm

See LDAP status quo and where to go from here for a post by chaoflow on LDAP in Plone

datakurre · February 26, 2015, 2:17am

And also https://mail.python.org/pipermail/python3-ldap/2014/000095.html about issues chaoflow sees with python3-ldap (besides it possibly having worse performance because of being pure Python instead of openldap).

But chaoflow's ldapalchemy is currently only a proof of concept of using openldap / libldap via CFFI. While it does make ldap library code much more clean and maintainable, it's a moon shot for fixing Plone LDAP issues (and we were not able to fund it further – uni domain is highly regulated and after a certain level of costs there's abundance of bureaucracy).

Something more pragmatic might bring results faster.

Victor wrote a good list about issues with Plone and LDAP and, actually, most those issues are in Plone stack.

Basically, we should:

Do less LDAP queries by combining queries as much as the syntax allows.
Do less LDAP queries by being able to evaluate them in a more lazily manner.

Unfortunately, it's hard to do the first without doing the other, and because of all the layers (LDAPUserFolder, PAS, MembershipTool, PloneLDAP, plone.app.ldap, ...), it's hard to know from where to start.

We are not currently working on Plone LDAP stack, but learning more about optimizing LDAP queries in a non-Plone project (there we are mapping LDAP objects/attributes to Colander schemas and only combining + executing the required queries when those schema nodes are really used - like when a string is being rendered into a template).

datakurre · February 26, 2015, 6:05am

Briefly about "async stream iterators / stream iterator workers", which I've explored as a partial solution to avoid DOSing Plone while doing a lot of LDAP calls.

A few years ago a stopped waiting for WSGI and started embracing asyncore based medusa running our ZServer. In normal use, ZPublisher hides all medusa from us, but with Blobs we got a feature called SteramIterator: a way to pass code through ZPublisher, out of ZServer worker threads to medusa's main thread.

By default, stream iterators are only used for streaming blobs in medusa main thread, so that Zope worker threads are freed to serve next requests.

Yet, with a little wrapper code, stream iterator can be used to execute custom in it's own thread and still return a normal response to the client. Of course, that code should not access ZODB, but nothing should stop it from accessing LDAP or other external services.

I have few public examples:

a blog post about it
[a yet another ZIP-export][2] for Plone, which gets all its data outside Zope worker threads (in the best case, it does requests through your Varnish balancer to distribute the load of collectin zipped files)
[collective.futures][3], which goes a bit further and after doing the queued jobs outside Zope workers threads, it retries the request return values of those jobs as annotations for the request (not super fast, because the same request is handled twice, but works; we are using it e.g. for updating RSS feeds so that connecting to the external RSS service does not block our ZServer threads)

[2]: https://pypi.python.org/pypi/collective.jazzport
[3]: https://pypi.python.org/pypi/collective.futures

thet · February 26, 2015, 9:39am

i have recently fixed the performance issue by caching the results: UGM caching by thet · Pull Request #13 · collective/pas.plugins.ldap · GitHub - the pull isn't accepted, because they want to use bda.cache instead of plone.memoize to be able to have multiple caching backends. but the real fix would be to not query all users - AFAIK it's only used to check if the username is available in the LDAP or not.

however, i use the branch in a current deployment and it solves the performance issues with the 4000+ users LDAP directory.

nevertheless, i'm looking forward for any other development on this topic!

datakurre · February 26, 2015, 6:09pm

Multiple caching backends? Being able to override plone.memoize's caching utility is not enough?

Anyway, after working with caching, would you be able to estimate, how difficult would it be to refactor pas.plugins.ldap (or node.ext.ldap) to work without fetching and knowing all the DNs? I've been pretty pessimistic about that

thet · February 26, 2015, 7:14pm

the reasoning behind was, that you might have a different and dedicated caching backend for your LDAP users than for the plone generic caches.

the caching was quite a small change, you can look at the pull-request.

refactoring - i had the impression, that fetching all DN's is only to check if a login name exists - which is totally useless, because LDAP will authenticate you or not. if that's the case, the refactoring wouldn't take too long. but i might be wrong about that.

jensens · March 25, 2015, 8:55am

i dont want to stick to bda.cache. its completly ok to get rid of it.

but replacing it with plone.memoize is not an option, since i at least have no idea how i assign a different memcached server to it than to the other caches in plone. having a dedicated cache for ldap is performance wise a must in larger setups. sharing the memcached with other plone caches has unpredictable effects on the efficiency of both caches.

in larger setups we have three memcached running: one for plone.memoize, one for LDAP and one for RelStorage.

So as long as we can fullfill this use case with plone.memoize I'am completly ok with merging any PR that removes the bda.cache dependency.

gforcada · April 28, 2015, 10:33pm

I'm not using LDAP at all, but I always thought that one of the biggest strengths of LDAP is its master-slave connection.

If you keep a local slave close to your workers, then you may not need a cache at all.

Again, I'm not using it nor I have experience with it, but I would use that approach before adding caches (well in a sense a slave is a cache somehow...)

jensens · April 29, 2015, 9:44am

We at BlueDynamics Alliance are at the moment refactoring pas.plugins.ldap and much more node.ext.ldap. Latter works now in a Pyramid project with 68000 users (w/o caching). Some optimizations are left (i did not check user-listings and I think we need to work on that). But overall most of the work is done.
For the curious: both packages are having a performance branch. Overall it works, but expect bugs, we're havent finished yet.