NewtDB as a Plone catalog replacement?

petri · March 27, 2017, 7:52am

Jim Fulton commented in the NewtDB google group:

An interesting project would be to try to create a drop-in catalog replacement. I can imagine something that provided the same search API and also set up PostgreSQL indexes when indexes were added, similar to the way Newt now automates setting up full-text search. It would have somewhat different semantics, but could probably provide the existing catalog convenience and radically reduce the weight of supporting search in Plone.

Thoughts?

gforcada · March 27, 2017, 8:20am

That could be a, maybe too difficult, GSoC project

Any search improvement for Plone OOTB experience would be great!

tisto · March 27, 2017, 12:38pm

A Postgres-based indexing/search server might be a nice option, but no replacement for a real indexing server such as Solr or EleasticSearch IMHO. Postgres covers the basics of full-text search these days, but it is still far away from what Solr/ES offers. Since all those options come with the downside of an additional system that needs to be set up, I would prefer Solr/ES at any time.

If you have a system that already uses a relational database, those trade-offs are different of course. If we would move Plone entirely to NewtDB, that would be a different discussion as well. For the Plone we have right now, I don't think it makes a lot of sense.

vangheem · March 27, 2017, 1:10pm

For the Plone we have right now, I don't think it makes a lot of sense.

I agree. This only makes sense if it's a complete replacement.

The advantage of using this approach is that we no longer have a extra data structure(the catalog) that everyone is writing to at the same time on the database so write performance should in increase and conflicts should decrease. Current solr and elasticsearch solutions still use the ZCatalog non-full text quires.

However, the disadvantage is that you lose the caching of the zodb for reads on simple catalog queries that aren't full text. With collective.elasticsearch, it originally completely replaced the catalog. However, I discovered that for simple queries, on most sites, it was faster to use plone's catalog so I changed it to only use elasticsearch for full text queries.

jimfulton · March 27, 2017, 5:58pm

Here's why it makes sense as an optional replacement.

It addresses concerns that data are imprisoned in Plone. It makes the data available to other applications, including non-Python applications.
It can drastically reduce the number of objects managed at the ZODB level. In a Pyramid application I'm helping with >80% of the objects in the database are there to support indexing.
It can drastically improve search performance, depending on the size of the database and the search criteria.

I agree that options to use other sorts of indexes would be useful too. Newt DB provides some helpers to enable that: Newt DB, the amphibious database — Newt DB 0.1 documentation

jimfulton · March 27, 2017, 6:03pm

Yup, YMMV. Of course, this isn't an all or nothing decision.

So far, for the examples I've worked on, Postgres has been far faster than the catalog when the ZODB object cache is cole and slower when hot for some not not all queries. A lot of this depends on the size of your working set.

jimfulton · March 27, 2017, 6:03pm

I happen to be working on this ATM.

tisto · March 27, 2017, 6:40pm

Hi Jim,

first of all, thank you for working on NewtDB and sharing it with us! This is a fantastic option for the future of Zope/Plone-based projects!

This is true only if we would completely rely on NewtDB, right? I mean if you store the index on Postgres, you have all your indexed data but that doesn't necessarily include ALL the data you usually store in the ZODB. I agree that this might be a selling point for certain projects, though.

Sure. This is also true for an external Solr/ES option (at least if you entirely remove the old catalog).

My intention was not to say that using NewtDB is a bad option for storing catalog data in Plone. The very opposite is true, I think it is a very good option! My point is just that Solr/ES are still a better option when it comes to full-text search capabilities.

jimfulton · March 27, 2017, 7:17pm

Wrong. All of the data is reflected as JSON (except by default for BTree data and blob pointers) regardless of where you index data.

jimfulton · March 27, 2017, 7:20pm

I wouldn't be surprised, but Newt can help with that too. And of course, Postgres is good enough for many applications and potentially t easier to manage than managing Postgres and ES.

BTW, when you search PG with newt, you get back objects that integrate with the regular ZODB cache.

pauleveritt · March 28, 2017, 5:48pm

As FYI, we're using Newt in the KARL project, so we have some experience about its impact. In a nutshell: it's pretty insanely good:

All ZODB content reflected into JSONB, automatically
Much faster queries for semi-warm/cold
Much lower memory footprint
Ad-hoc query support

Since we were already on RelStorage atop PG, it was very simple. All of our persistent objects are now reflected, at no work to us, into a JSONB column with indexes. We also have expression indexing, which gives us things such as: the equivalent of hierarchical ACL filtered search results in a single SQL trip to the server, concatenated fields for text extraction, etc.

We were already using (for better or worse) PG for text indexing...I agree that this is not the strongest benefit. Much better than zc.textindex but not competitive with SOLR etc.

Still, Plone would be crazy not to look at this. You're not going to rewrite Plone to get rid of the other parts of the catalog. I think Jim is working on a catalog-like shim. If so...Newt is a better catalog for any site that has a non-trivial amount of content and traffic. If you're already doing RelStorage and PG...think it over.