Relation between ZCTextIndex and its Lexicon

jugmac00 · December 5, 2019, 11:10am

After reading https://zope.readthedocs.io/en/latest/zopebook/SearchingZCatalog.html I now know that a Lexicon may do normalization, removing stop words, stores the words...

But how does the ZCTextIndex interact with the Lexicon?

Extremely simplified, I understand that an index is like a dictionary, e.g. mapping words to pages/documents.

As I am missing the connection between index/lexicon, I also cannot decide whether it makes sense to have one lexicon per index, or maybe have one global lexicon per site for all indexes.

E.g. when I have an index (or one per attribute) for a Company data class, and then a Employee data class.

Should I have one global lexicon? One for company, one for employee? Or even one for each attribute?

Thanks for any hints!

zopyx · December 5, 2019, 12:13pm

Usually you are good with the default catalog configuration. We never had to adjust the full text configuration. Dedicated lexicons don't make much sense since you can always filter by additional conditions like filtering by portal_type. So a global lexicon is usually good enough. Your mileage may differ when it comes to multilingual content etc.

Apart from that: the full text engine of Plone/Zope is pretty dumb, 20 years old and far away from being state of the art. And it does not handle normalization, stemming or language specific aspects. Also, ranking of content is horribly bad and far away from the current user expectation.

I created TextIndexNG3 for this purpose which was always superior over the Zope full text search engine. However, the project is dead with Plone 5.2 and in particular Plone 5.2 under Python 3.

Nowadays you want to go with Solr or Elasticsearch if you really need a decent full text solution.

jugmac00 · December 5, 2019, 1:02pm

Thanks Andreas, that helps a bit. I am still not sure why a Lexicon has its own word list when (I assume) a index also has a "word -> result" hashmap / dictionary of some kind. Why store the words twice? But maybe that is an implementation detail I could ignore.

And yes, I know of TextIndexNG3, I read your eol announcement ( TextIndexNG3 end of lifetime announcement ).

The application I have to maintain has almost 900 instances of TextIndexNG3 and replacing them brings me quite a big step further to reach my goal, ie update the Zope app to Python 3.

Due to time constraints I still think replacing TextIndexNG3 with ZC.TextIndex is the best way to go for now.

When I have a look at the code which generated the TextIndexNG3 instances, it looks like not much of its features were actually activated - so my best guess is my users won't notice the replaced indexer at all (fingers crossed).

(The first three config options were based on meta class info, the rest were the same for each attribute)

addChars = self.attributeTag(klass, attr, 'txng3_splitter_additional_chars')
useNormalizer = self.attributeTag(klass, attr, 'txng3_use_normalizer')
autoExpand = self.attributeTag(klass, attr, 'txng3_autoexpand')
catalog_record = CatalogRecord(
    fields=(getter,),
    default_encoding='utf-8',
    splitter_additional_chars=addChars,
    languages=('de',),
    use_normalizer=useNormalizer,
    use_stemmer=0,
    use_stopwords=0,
    splitter='txng.splitters.default',
    splitter_casefolding=1,
    dedicated_storage=0,
    autoexpand=autoExpand,
    autoexpand_limit=3,
    query_parser='txng.parsers.en',
    storage='txng.storages.default',
    lexicon='txng.lexicons.default')

zopyx · December 5, 2019, 1:50pm

There is no duplication.

The lexicon maps a term/word to an ID.

The fulltext index basically knows per document (document id) about the word ids contained in a document.

If you search for a particular word then the catalog will lookup the word id first and then ask the related index for all documents (their document ids) having this particular word id.

So the fulltext index is just a datastructure with a mapping document id -> word ids and vice versa.