How does site search calculate 'relevance' for search results?

I've been digging into this to try to answer a client query:

'How does the site search rank results?'

As first pass, it seems that everyone just says, 'well, if you don't provide a sort_on field, it ranks by relevance'.

...but what does that actually mean?

When you step into the catalog source, there's lots of talk about relevance values, but the only place I can find it actually implemented is in Products.ZCTextIndex, where it says:

Actually, ZCTextIndex gives you a choice of two scoring algorithms
from recent literature: the Cosine ranking from the Managing
Gigabytes book, and Okapi from more recent research papers. Okapi
usually does better, so it is the default (but your milage may
vary).

So... Okapi BM25 - Wikipedia seems pretty comprehensive, but...

so it is the default...

hm...

Time to jump into the site catalog and have a look.

Name(s) of attribute(s) indexed: SearchableText
Index type: Okapi BM25 Rank

Which was nice to know, but... then I started reading the massive brain dump in https://github.com/zopefoundation/Products.ZCTextIndex/blob/master/src/Products/ZCTextIndex/OkapiIndex.py#L193, and https://github.com/zopefoundation/Products.ZCTextIndex/blob/master/src/Products/ZCTextIndex/tests/testZCTextIndex.py#L484, and now I'm somewhat less convinced.

Rather than using a stand alone implementation of okapi that I can point to, this is an embedded vaguely tested algorithm that seems to... something...

However, If you compare the assumptions made to some reference implementation (https://apache.googlesource.com/lucene-solr/+/trunk/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java) there are pretty big differences....

So basically:

Does anyone actually know if the implementation in OkapiIndex.py is solid?

Who wrote it? Did they know what they were doing?

Why is it using an adhoc implementation instead of an external library?

Looking at the commit logs I have no idea how to find out this information. The earliest commit seems to have been pulled out of some other package. Where did it come from?

Any help much appreciated~

Plone does not provide any decent fulltext search that one would accept having Google-like searches in mind.

Either you go with SOLR or Elasticsearch.

Our approach with collective.elasticindex:

http://public.zopyx.com/search2.mp4

-aj

1 Like