How do you encode unicode string values from schema fields?

ramonski · February 14, 2021, 8:22pm

Hi everyone,

coming from Archetypes contents, the generated field getters used to return strings, so that e.g. the value u'Ramón' will be returned as 'Ram\xc3\x83\xc2\xb3n', which is an UTF-8 encoded unicode string.

in Dexterity, the value would be returned like this u'Ram\xc3\xb3n', which is just the unicode string.

Here again at a glance:

>>> name_at = obj.getName()
>>> name_at
'Ram\xc3\x83\xc2\xb3n' 

>>> name_dx = obj.name
>>> name_dx
u'Ram\xc3\xb3n'

>>> name_at.decode('utf-8')
u'Ram\xc3\xb3n

>>> name_dx.encode('utf-8')
'Ram\xc3\x83\xc2\xb3n'

What is the correct (or better) way to do in Plone?

We are still at Python 2.7 and unicode encode/decode errors are haunting me already since a long time and my brain will probably never fully understand it.
But when it comes to Python 3.x, is there something to take special note regarding this topic?

Thanks, Ramon

zopyx · February 15, 2021, 5:29am

Rule of thumb (for all Python based versions): use "str" (3.x) or "unicode" (2.x) on the storage level, use str + unicode for processing, convert from/to utf8 for data coming or going to the presentation layer - if needed. Never ever do any internal string processing based on utf8 encoded strings.

ramonski · February 15, 2021, 7:57pm

Thanks Andreas for your answer,

the rule of thumb you mentioned is more or less what I always tried to follow:

"Use unicode internally in code and UTF8 encoded string externally in presentation".

However, doesn't this mean then that the generated AT getters did it just wrong all the time?

E.g. here for the AT StringField:

@implementer(IStringField)
class StringField(ObjectField):
    """A field that stores strings"""
    _properties = Field._properties.copy()
    _properties.update({
        'type': 'string',
        'default': '',
        'default_content_type': 'text/plain',
    })

    security = ClassSecurityInfo()

    security.declarePrivate('get')

    def get(self, instance, **kwargs):
        value = ObjectField.get(self, instance, **kwargs)
        if getattr(self, 'raw', False):
            return value
        return encode(value, instance, **kwargs)
...