Item indexes cause catalog-rebuild to fail

We have a strange issue on a migrated custom Content-Type of ours. We have an old site from Plone 3 which was migrated to Plone 5, within this migration we find the old items and re-create them with the Plone API so that we have a new catalog item. Within this migration we move over the values of the various schema specific fields of the old Content-Type to the new one, however for some reason the indexes of title and in this case, lastname are causing a unicode error when rebuilding the catalog or saving an item. When these indexes are removed, we can both update the field and also rebuild the catalog. If we re-add the indexes, it once again fails.

My understanding of this issue is possibly down to a unicode to string comparison which is failing somewhere, however it's causing quite a headache to fix.

Here is the traceback that happens:

../.buildout/eggs/z3c.form-3.2.11-py2.7.egg/z3c/form/util.py:180: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if (not dm.canAccess() or dm.query() != value):
../.buildout/eggs/Products.ZCatalog-3.0.2-py2.7.egg/Products/PluginIndexes/common/UnIndex.py:193: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  indexRow = self._index.get(entry, _marker)
[5] > ../.buildout/eggs/Products.ZCatalog-3.0.2-py2.7.egg/Products/PluginIndexes/common/UnIndex.py(193)insertForwardIndexEntry()
-> indexRow = self._index.get(entry, _marker)
Traceback (innermost last):
  Module ZPublisher.Publish, line 138, in publish
  Module ZPublisher.mapply, line 77, in mapply
  Module Products.PDBDebugMode.runcall, line 70, in pdb_runcall
  Module ZPublisher.Publish, line 48, in call_object
  Module plone.z3cform.layout, line 66, in __call__
  Module plone.z3cform.layout, line 50, in update
  Module plone.dexterity.browser.edit, line 58, in update
  Module plone.z3cform.fieldsets.extensible, line 59, in update
  Module plone.z3cform.patch, line 30, in GroupForm_update
  Module z3c.form.group, line 145, in update
  Module plone.app.z3cform.csrf, line 21, in execute
  Module z3c.form.action, line 98, in execute
  Module z3c.form.button, line 315, in __call__
  Module z3c.form.button, line 170, in __call__
  Module plone.dexterity.browser.edit, line 30, in handleApply
  Module z3c.form.group, line 126, in applyChanges
  Module zope.event, line 31, in notify
  Module zope.component.event, line 24, in dispatch
  Module zope.component._api, line 136, in subscribers
  Module zope.component.registry, line 321, in subscribers
  Module zope.interface.adapter, line 585, in subscribers
  Module zope.component.event, line 32, in objectEventNotify
  Module zope.component._api, line 136, in subscribers
  Module zope.component.registry, line 321, in subscribers
  Module zope.interface.adapter, line 585, in subscribers
  Module plone.dexterity.content, line 774, in reindexOnModify
  Module Products.CMFCore.CMFCatalogAware, line 88, in reindexObject
  Module Products.CMFCore.CatalogTool, line 301, in reindexObject
  Module Products.CMFPlone.CatalogTool, line 351, in catalog_object
  Module Products.PDBDebugMode.zcatalog, line 20, in catalog_object
  Module Products.ZCatalog.ZCatalog, line 476, in catalog_object
  Module Products.ZCatalog.Catalog, line 360, in catalogObject
  Module Products.PluginIndexes.common.UnIndex, line 216, in index_object
  Module Products.PluginIndexes.common.UnIndex, line 242, in _index_object
  Module Products.PluginIndexes.common.UnIndex, line 193, in insertForwardIndexEntry
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
[34] > ../.buildout/eggs/Products.ZCatalog-3.0.2-py2.7.egg/Products/PluginIndexes/common/UnIndex.py(193)insertForwardIndexEntry()
-> indexRow = self._index.get(entry, _marker)

The strange thing is that special characters such as umlauts also fail with the field being either encoded or decoded, I have no idea what's going on..?
I also attempted to rectify any string encoding errors but with little success, this was the code I ran on the Content-Type:

def contact_cleanup(self):
        from plone.protect.interfaces import IDisableCSRFProtection
        from zope.interface import alsoProvides
        import transaction
        alsoProvides(self.request, IDisableCSRFProtection)
        contacts = api.content.find(portal_type='Contact')
        contacts = [contact.getObject() for contact in contacts]
        for item in contacts:
            setattr(item, 'sortname', '')
            if isinstance(item.firstname, unicode):
                setattr(item, 'firstname', item.firstname.encode('utf-8'))
            if isinstance(item.lastname, unicode):
                setattr(item, 'lastname', item.lastname.encode('utf-8'))
            item.reindexObject(idxs=('sortname', 'firstname', 'lastname'))
        transaction.commit()

If anyone has some ideas why this may be failing I'd appreciate some insight, the main issue is the site is now migrated and while it'd normally be better to go back, debug and fix the issue. We now have to fix this post-migration...

Cheers peeps!

This can happen if some items are returning unicode values to be indexed and some are returning encoded strings. Generally speaking the safest convention is to give encoded strings to the catalog indexes, which means you may need to write an indexer which ensures they are encoded. If it is a Dexterity content type then the item itself stores unicode which needs to be encoded by the indexer. If it is an Archetype then what you get depends on whether you're accessing the value directly (unicode) or using the accessor (encoded). Hope this helps point you in the right direction...

I'll certainly give this a go, thankyou for the response!

Hmmm, still very strange - didn't manage to get this resolved. Still get the following error when trying to save fields with special characters...

UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

Here is the field in my schema...

lastname = schema.TextLine(
        title=_(u'contact_contenttype_lastname_title', default=u'Lastname'),
        description=_(u'contact_contenttype_lastname_description',
                      default=u"Please enter the lastname."),
        required=True,
    )

If you can tell me what to give that may help in finding the root of this issue then please let me know, I'm really stumped on this one!

Maybe the issue now is that schema.TextLine assumes data to be unicode, but earlier you mentioned running contact_cleanup code that encoded the strings instead.

Possibly you should modify and re-run your cleanup to decode all the fields in your content objects to unicode. Then make sure to have plone.indexer based custom indexer that encodes those values only for catalog indexes. So, content objects should have unicode strings, catalog indexes utf-8 encoded strings.

...as described here or..? https://docs.plone.org/develop/plone/searching_and_indexing/indexing.html#custom-index-methods

Could you point to an example of how this could be done, looking though the docs at the moment...

Thanks for the help!

Update: Alright, so adding the custom indexers for the fields seems to have resolved the issue. I'm going to do some further testing but for the moment it's looking good...

For reference to future users, resolution was:

indexers.py

@indexer(IContact)
def lastname_indexer(object, **kw):
    return object.lastname.encode('UTF-8')

configure.zcml

<adapter
      name="lastname"
      factory=".indexers.lastname_indexer" />

Is this really true? Catalog expects UTF8 but doesn't do anything to automatically convert during indexing? Or dexterity doesn't have a generic indexer that does the conversion for you?
So you can no longer just add a index into the catalog TTW for a DX field and expect it to work without having to deploy code?

I ask because I'm not getting this error which seems to indicate all the above is true

Module Products.CMFPlone.CatalogTool, line 390, in searchResults
Module Products.ZCatalog.ZCatalog, line 604, in searchResults
Module Products.ZCatalog.Catalog, line 1072, in searchResults
Module Products.ZCatalog.Catalog, line 549, in search
Module Products.PluginIndexes.common.UnIndex, line 426, in _apply_index
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 15: ordinal not in range(128)

At least it used to be that Catalog did not care. As long as you only feed it either unicode or bytes it worked. But as soon as the index contained mixed types (with other than ASCII values), you got errors.

I'm not sure thats my problem because its not the same comparison error given above but I won't rule it out.
But I'm using an old version of the catalog code so is also possible this is error has been fixed. It seems to be that the catalog is the right place to fix a bug like this.