AdvancedQuery different result set if sorting is applied

frisi · November 15, 2018, 4:59pm

Just stumbled upon this in a unit test (happens with two and three items of a content type).

when i supply a sort_order for a query i get different results. one brain is missing, another one gets duplicated:

>>> [brain.id for brain in catalog.evalAdvancedQuery(qry)]
['item', 'other-item']
>>> [brain.id for brain in catalog.evalAdvancedQuery(qry, (('start', 'asc'),))]
['item', 'item']

tried with advancedquery 3.0.2, 3.0.3, 3.0.4
@dieter do you have any idea why this could happen?

dieter · November 15, 2018, 7:29pm

AdvancedQuery uses two different algorithms for sorting - one of them can cause duplicates (and as many misses as duplicates have been introduced) if a document is indexed under different values (either because the index is a KeywordIndex or the index has become inconsistent). This algorithm is used when the index has few values compared to the number of hits. The algorithm sorts the index values and then generates the documents by intersecting the corresponding document lists with the search result set. A duplicate arise when a document from the search result happens to be in two or more document lists. The misses occur because the sorting stops as soon as the number of hits has been reached.

frisi · November 15, 2018, 8:50pm

thanks for your detailed explanation @dieter.

we're sorting plone.app.events by start date here. the index used for sorting is a DateRecurringIndex.
so duplicate entries for the same brain are by design.

not sure what what you mean with:

do we get rid of duplicates (and therefore also misses) as soon as we have a certain number of objects indexed?
or do we still get them if there are other query parameters (eg path, subject, etc) that limit the result set to a view items before sorting them?

is there a way to get rid of of this behaviour in unittests with just a few events?

frisi · November 16, 2018, 9:23am

as a temporary workaround i patched Products.AdvancedQuery.sorting.IndexSorter by either commenting out the first block (if ns >= 0.1 * ni:) or by adding a condition that the result set must have more than 100 items

class IndexSorter(Sorter):
  '''sorting with respect to an index.'''

  def __init__(self, sortIndex, sortReverse):
    self._sortIndex = sortIndex; self._sortReverse = sortReverse

  def group(self, seq):
    sortIndex = self._sortIndex; sortReverse = self._sortReverse
    ns = len(seq); ni = len(sortIndex)
    if ns > 100 and ns >= 0.1 * ni:
      # use index sorting for result sets larger than 100 items
      ...

i guess the ns>100 only fixes my test with a view items, but if we have more than 100 events (or occurrences of events) we still get duplicate occurrences and other results get omitted.
is that correct @dieter?
what would be the drawback of removing the block completely?

petschki · November 16, 2018, 10:14am

maybe off topic, but in a project I just got rid of AvancedQuery by using Products.ZCatalog==3.1.3 in Plone 4.3.x ... since ZCatalog version 3.x there is support for multi sorting and "not" queries ... maybe worth a try?

dieter · November 16, 2018, 11:18am

Apparently, you have observed this problem in a test suite. In such a setup, one can (almost surely) exclude an index corruption. This would mean that your index indexes the duplicate objects under different values. You should NOT use such an index for sorting, as it is unclear which of the different values should be used for the sorting.

If you restrict sorting to indexes which index an object under at most one value, then you do not need a workaround.

Note also that your workaround removes the duplicates (as you wrote "at least in your test cases") but it does not ensure that the sorting result is meaningful. That you have observed duplicates means that at least some documents (those which resulted in duplicates) have multiple values related to the sort index, i.e. the "keyForDocument" will return something containing different indexed values. These value aggregates will be used for the sorting - which may not at all be meaningful.

frisi · January 29, 2019, 12:40pm

thanks for your suggestion @petschki. i'm not sure if updating Products.ZCatalog to 3.1.3 for plone 4.3.18 (zope 2.13.27) is possible w/o any unwanted side-effects or compatiblitity problems.
do you use it in production @petschki? @icemac: can you thinks of any problems using ZCatalog 3.1.x with Zope 2.13?

Rotonen · January 29, 2019, 12:57pm

At least zCatalog 3.0.x is fine for Plone 4.3.

frisi · January 29, 2019, 12:59pm

The index used is Products.DateRecurringIndex (GitHub - collective/Products.DateRecurringIndex: Zope 2 catalog index with support for indexing of recurring events, following the icalendar standard (part of Plone core)) which is part of plone.app.event

The Index will index the same brain for different dates.
This allows to filter for a certain time-range.
One has to use the expand_events method (plone.app.event/plone/app/event/base.py at 3.2.1 · plone/plone.app.event · GitHub) to turn the result set into objects (and possibly create multiple occurrences of a single brain) and sort them correctly.

Thinking over this again you are completely right.
we need to expand (and sort) the result set using plone.app.event.base.expand_events anyway. we only use the DateRecurringIndex for filtering, there is no need to sort the result set beforehand.

Thanks for your replies @dieter and @petschki

icemac · January 30, 2019, 6:38am

We were using Products.ZCatalog 3.0.2 together with Zope 2.13.28. I have no experience in the wild with 3.1.x.

frisi · January 30, 2019, 10:27am

thanks @icemac! and @Rotonen