Catalog result count by index value (portal_type)

jaroel · November 7, 2017, 1:14pm

Hi,

We've got a use case where we would like to show the amount of search results per index value (portal_type in this case), ie

result_count = {
    'File': 4,
    'Images': 6,
    'Droids': 0,
}

Do we happen to have an existing (performant) solution to this, or do I just need to loop over the result set and add a counter?

zopyx · November 7, 2017, 1:39pm

You should fast (enough) iterating over all brains and building a mapping portal_type -> counter.
This should be fast even for some thousands of brains. Eventually itertools.groupby() might be helpful here.

-aj

davisagli · November 7, 2017, 1:59pm

catalog.Indexes['portal_type'].uniqueValues(withLengths=True)

This is a raw count from inspecting the internals of the index. As such, it requires loading the entire index from the ZODB, and won't work if you also need to apply other indexes to limit the results. But it should be faster than doing a query and looping over the results.

jaroel · November 7, 2017, 2:26pm

As we do depend on other indexes, I ended up using something like

types = (x.portal_type for x in brains)
counter = collections.Counter(types)

which gives me

Counter({'Folder': 2, 'Document': 1, 'Image': 1})

dieter · November 7, 2017, 2:33pm

The catalog search result is in fact a LazyMap, i.e. a sequence (of catalog record ids) and a function. On access to a sequence component, the function is transparently applied to the base component (which gives you the catalog proxy/brain). Using the above internal implementation details, you can access the raw search result (the set of catalog record ids) and "and" it with the sets for the various values in the index your are interested in to obtain the count values.

Whether this is faster than looping over the proxies depends on the size of the result set and the size of the index.

espenmn · November 7, 2017, 4:28pm

Slightly related:

does this, ( like here: http://www.ektedata.no/innhold/oppgavebank )

jaroel · November 7, 2017, 5:40pm

That would be something like this, me thinks.

      # This takes the catalog record ids from the result set (_seq) and
        # compares them with the records in the catalog (_index[portal_type])
        portal_catalog = getToolByName(portal, 'portal_catalog')
        _index = portal_catalog.Indexes['portal_type']._index
        # We get a list of tuples when searching.
        _seq = set(
            x[1] if isinstance(x, (list, tuple)) else x
            for x in self.lazy_resultset._seq
        )
        counter = {}
        for portal_type in _index:
            index_values = set(_index[portal_type])
            counter[portal_type] = len(_seq & index_values)

I think this should perform better for large result sets, though I haven't benchmarked this

djay · November 8, 2017, 6:07am

@jaroel There is a builtin function which gives you exactly this efficiently.

github.com

zopefoundation/Products.ZCatalog/blob/master/src/Products/PluginIndexes/unindex.py#L142


    self._index = OOBTree()
    self._unindex = IOBTree()


    if self._counter is None:
        self._counter = Length()
    else:
        self._increment_counter()


def __nonzero__(self):
    return not not self._unindex


def histogram(self):
    """Return a mapping which provides a histogram of the number of
    elements found at each point in the index.
    """
    histogram = {}
    for item in self._index.items():
        if isinstance(item, int):
            entry = 1  # "set" length is 1
        else:
            key, value = item

Almost completly undocumented but we make use of it in production. We do reports by combining different facets into a combined field and then walking over the histogram of those unique values to give grouping totals.

jaroel · November 8, 2017, 9:42am

@djay
Did you see I need to check this against the result set?
As far as I can see, this doesn't do that?

djay · November 15, 2017, 2:51am

Sorry I missed that. If you want to count a filtered search result then that method doesn't work. What you would need to do is build that into the index you are counting. What I did was create a special groupby index with values like "Male|USA|White" etc and only had values on items I wanted to count. Then I iterate over the histogram rather than the over the whole DB to count things like number of Males or number of white males. So if you have much less combined groupings than records this works well. The tradeoff is if you need to add in another facet then is a huge reindex.
Not sure if this helps your usecase though.