GSOC 2019: Guillotina Object Server

ramon · February 3, 2019, 11:36am

Summary
Guillotina uses a PostgreSQL-compatible database as the source of truth and Elasticsearch indexing of PostgreSQL JSONB. This project involves creating a new component that handles the storage and indexing in a new, distributed piece of software: Guillotina Object Server. The main values of this project are transactional support, distribution of information, and indexing technology using a low-level language: Rust.

Implementation
The aim is to build a small and fast component that handles transactions and keyword/full-text searches from Guillotina. There is already a basic implementation which shows the overall design of the concept with a Protobuffer protocol, Rust bindings to RocksDB and basic integration with Tantivy (full-text indexer). Based on top of Tokio async framework on Rust and Protobuffer protocol this project has four goals:

Storing python pickles on a Rocksdb mapping the transaction mechanism from Guillotina. It needs to provide the same API from internal storage Guillotina so each object is stored in a secure hierarchical structure.
Storing the indexing information on each transaction on RocksDB and Tantivy to support keyword indexes and full-text index.
Providing an interface to search ok keyword indexes and full-text search with the security checks.
Providing an interface to load security and pickles using Guillotina internal storage API.
Optionally, providing a Raft protocol to sync multiple instances of the object server and distribute the load would be great.

zopyx · February 3, 2019, 4:12pm

Why do you need a second indexing component besides Elasticsearch or is this intended as a replacement for ES?

I do not see any reference in Tantivy for supporting multiple language.

Distributed search, replication and scaling seems to be out of scope of Tantivy.

Why are the search capabilities of Postgres not good enough here? Which problems do you want to solve with Tantivy that can not be solved with Postgres and/or Elasticsearch?

kakshay21 · February 3, 2019, 5:06pm

That's why protobuffer, a language-neutral, platform-neutral, extensible mechanism for serializing structured data. It supports vast languages.
Apart from lower startup time and less memory footprint it has so many cool features
BTW Elastic search is built over Apache Lucene and clearly, Tantivy is quite better in terms of performance. Do check out the benchmark Search benchmark, the game
Regarding the "Distributed search, replication and scaling" part that's one of the downsides.

zopyx · February 3, 2019, 5:46pm

Speed and "cool features" is only one side of the medal.
A search solution that does not support language specific indexing and that does come with a scaling story is in general pointless for building scalable solutions. And all the "cool features" is what you basically get with ES.

kakshay21 · February 3, 2019, 6:35pm

I totally agree with your point, but this is a very easy task to add language indexing support for any languages. Just write a .proto description of the data structure you wish to store for your query in rust and store the query inside this protobuffer. Now you could send this protobuffer as a response instead of JSON/XML and then use it in any language.

kakshay21 · February 3, 2019, 6:48pm

Okay, I just checked protocol buffers does not support rust yet. Now I don't know the reason for using this.

zopyx · February 4, 2019, 3:55am

I am talking of indexing specific support for languages like German, English and so on.

There is not much in this proposal that would bring any significant improvement for Plone or Guillotina.

-aj

zopyx · February 4, 2019, 4:16am

Another objection is storing pickles. I think a lesson from 20 years of ZODB is that it is not necessarily a good idea for storing a language specific serialization format in a database a blobs. Using JSON/JSONB in Postgres is ok but using another external storage again as pickle grave makes little sense in 2019.

Did you look at CrateIO? A distributed database based on Elasticsearch, scalable like hell, an SQL API and a Postgres-protocol compliant binary interface?