Tags / Subject field mangling long terms

flipmcf · November 2, 2020, 5:41pm

tl;dr: Long tags with high unicode codepoints are breaking. This bug is blocking our release of plone5

I think I have a bug bothering me. It gets aggravated on 'long terms' used in the Dublin Core Tags field, aka 'subject'. If the tag gets too long it gets mangled and an equals '=' is added.

To aggravate, edit an object (that has the Dublin Core behavior enabled), go to the
categorization tab, and insert this:
"this is a really really long, I mean really long tag that is over 72 bytes long"

and save.

Go back to edit, visit the page, and you will see and '=' appear:
"this is a really really long, I mean really long tag that is over 72 bytes =long"

Before you answer

'this is a crazy edge case and you shouldn't have tags that long'

consider our high unicode users.

We have a lao site with these keywords:
('chinareach', 'ຊາວບ້ານໃນ', 'ເມືອງວັງວຽງ', 'ບໍ່ຍອມເສັຍ', 'ດິນນາ', 'ບໍຣິສັດຈີນ')

and this is what gets displayed:

The keywords that are over 72 byes long get affected.
There are 3 utf-8 bytes representing each lao character, which turn into 9 ascii-encoded bytes (including the =) multuplied by 9 unincode characters > 72 bytes.

So, you are limited to 72 bytes of utf8 encoded string, which is not cutting it for Chinese, Lao, Burmese and other eastern languages.

Inspecting the object from debug shows the object itself is healthy:

>>> story.subject
('chinareach', 'ຊາວບ້ານໃນ', 'ເມືອງວັງວຽງ', 'ບໍ່ຍອມເສັຍ', 'ດິນນາ', 'ບໍຣິສັດຈີນ')
>>> story.subject[1]
'ຊາວບ້ານໃນ'
>>> story.subject[1].encode('utf8')
b'\xe0\xba\x8a\xe0\xba\xb2\xe0\xba\xa7\xe0\xba\x9a\xe0\xbb\x89\xe0\xba\xb2\xe0\xba\x99\xe0\xbb\x83\xe0\xba\x99'
>>> story.subject[1].encode('unicode_escape')
b'\\u0e8a\\u0eb2\\u0ea7\\u0e9a\\u0ec9\\u0eb2\\u0e99\\u0ec3\\u0e99'

What makes it worse is that if you save from here, you alter the tags. Deleting 'ຊາວບ້ານໃນ' and replacing it with =E0=BA=8A=E0=BA=B2...
Even worse, another save will encode the = and change it again to =3DE0=3DBA=3D8A..., and additional saves will prepend every byte with 3D causing an encoding disaster until you have =3D3D3D3D3DE0...

I've been debugging for about 6 hours now and need some help. The object has the correct value, the widget will ask for the vocabulary for the field and then ask the vocabulary for SimpleTerms with these values and then use the SimpleTerm 'token'- that's where the error seems to be - on the SimpleTerm token.

I have a theory that somewhere this SimpleTerm is being created with a token, and that token creation is being a bit mean to my unicode tags. SimpleTerm itself doesn't seem to be guilty, but some factory using it.

Any clues would be appreciated.

dieter · November 2, 2020, 6:30pm

Almost surely, this is a base64 encoding (of the real thing).

A bit of background: a term has three data items attached:

value: the value as used by the program
title: what is shown to the user
token: usually an ASCII encoding of value; the representation of value at the browser/server interface

At the browser side, there is only token and title. At the server side, the value is reconstructed from the vocabulary via token lookup. As you see wrong values after "save", this means that already the vocabulary is broken.

I would approach the analysis as follows:

create a small (few vocabulary lookups) reproducing case
put a breakpoint in the vocabulary construction (or if this is too difficult to locate into the term construction)
from there, look where the bad values may come from.

flipmcf · November 2, 2020, 8:00pm

Perfect hint! Thank you! The keyword index is fine, the object is fine, but when the vocabulary gets created (at form load time) things go bad.

It doesn't look base-64 encoded, it looks utf-8 encoded, but that's neither here nor there.

Problem happens in plona.app.vocabularies terms.py safe_simpleterm_from_value:

github.com

plone/plone.app.vocabularies/blob/73af3a780d15e496c469a3d282d29695836d688e/plone/app/vocabularies/terms.py#L28


      
                  value = value.encode('utf-8')
              return value
          
          
          def safe_simpleterm_from_value(value):
              """create SimpleTerm from an untrusted value.
          
              - token need cleaned up: Vocabulary term tokens *must* be 7 bit values
              - anything for display has to be cleaned up, titles *must* be unicode
              """
              return SimpleTerm(value, b2a_qp(safe_encode(value)), safe_unicode(value))
          
          
          def safe_simplevocabulary_from_values(values, query=None):
              """Creates (filtered) SimpleVocabulary from iterable of untrusted values.
              """
              items = [
                  safe_simpleterm_from_value(i)
                  for i in values
                  if query is None or safe_encode(query) in safe_encode(i)
              ]

(Pdb) safe_unicode(u'ຊາວບ້ານໃນ')
'ຊາວບ້ານໃນ'
(Pdb) safe_encode(u'ຊາວບ້ານໃນ')
b'\xe0\xba\x8a\xe0\xba\xb2\xe0\xba\xa7\xe0\xba\x9a\xe0\xbb\x89\xe0\xba\xb2\xe0\xba\x99\xe0\xbb\x83\xe0\xba\x99'
(Pdb) b2a_qp(safe_encode(u'ຊາວບ້ານໃນ'))
b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'

the b2a_qp() is doing it - and I had to look that one up.

But what i find strange is that the token is only broken, the title seems ok:
SimpleTerm(value, token, title) So why would 'display' get broken? I might not have to think about that, tho, seeing the behavior b2a_qp() is doing - adding the additional \n

And it's adding the \n because of RFC 1522 - MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text

While there is no limit to the length of a multiple-line header
field, each line of a header field that contains one or more
encoded-words is limited to 76 characters

Do you think we should continue this discussion in github - an issue with p.a.vocab?

dieter · November 2, 2020, 9:07pm

I would disregard the token: tokens are implementation details; it is not important how they are created (from the value) as long as their values remain unique (per vocabulary).

You pose one of the important questions above: why is the display broken even though the title is correct? This would be a widget bug -- it should display the title, not the token.
The second important question is "why is the value wrong?". Apparently, the computed token happens to influence value (which should not happen).

flipmcf · November 3, 2020, 5:09pm

The widget is displaying the title, but the title in the vocabulary is wrong. Why do I feel like I'm in an infinite loop here?

Seriously tho, I think the bug happens when the widget is setting the field and would be adding to the vocabulary also. I can choose a 'long term' from the vocabulary list, save, and then re-visit the edit page to see the 'mangled' value. I'm still spelunking, and surprised I haven't found the bug yet. Lots of moving parts with z3c form and widgets.

dieter · November 3, 2020, 6:18pm

You should not have this feeling:
On the server side, a vocabulary is built from a sequence of values. Those values are unicode strings and they are initially correct. Some (apparently broken) logic constructs from those values (vocabulary) Terms with attributes token, value and title. Apparently, this goes wrong - the logic should keep value unchanged, use the value for title and can (almost) do whatever it wants with token.

I do not think so: when the form is posted, the widget receives the token (which is an implementation detail from the vocabulary). It hands it over to the field which calls a vocabulary method (getTermByToken) to obtain the value. Thus, if the vocabulary is not broken, you cannot get a wrong value.

Writing the above, I see a potential reason this might be wrong: while the base vocabulary (--> zope.schema.vocabulary.SimpleVocabulary) raises an exception for an unknown token, a derived vocabulary might use the token in this case (e.g. to support the creation of new values, not previously known be the vocabulary). In this case, it would be vital that the widget receives the token value from the browser exactly as generated on the server. However, browsers may do some kind of normalization braking this requirement.

Your observations hint toward this kind of problem: in some cases, the generated tokens contain characters (especially newlines) which might get normalized by the browser. If the vocabulary then interprets an unknown token as a new value, things break.

However, this does not explain why you see a wrong title. Try to find out why this happens before other investigations.

1letter · November 3, 2020, 7:57pm

Here in the converter comes the token to the pattern options in the widget and this ends in plone.app.vocabularies

def safe_simpleterm_from_value(value):
    """create SimpleTerm from an untrusted value.

    - token need cleaned up: Vocabulary term tokens *must* be 7 bit values
    - anything for display has to be cleaned up, titles *must* be unicode
    """
    # import pdb
    # pdb.set_trace()

    encoded = safe_encode(value)
    token1 = b2a_qp(encoded, istext=True, header=True) # broken \n newline
    token2 = b2a_qp(encoded, istext=True, header=False) # broken \n newline
    token3 = b2a_qp(encoded, istext=False, header=True) # broken \n newline
    token4 = b2a_qp(encoded, istext=False, header=False) # broken \n newline
    token5 = b2a_base64(encoded, newline=False) # that helps!
    print("safe simple term token1", token1) 
    print("safe simple term token2", token2)
    print("safe simple term token3", token3)
    print("safe simple term token4", token4)
    print("safe simple term token5", token5)
    return SimpleTerm(value=value, token=token5, title=value)

safe simple term token1 b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'
safe simple term token2 b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'
safe simple term token3 b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'
safe simple term token4 b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'
safe simple term token5 b'4LqK4Lqy4Lqn4Lqa4LuJ4Lqy4LqZ4LuD4LqZ'

flipmcf · November 3, 2020, 9:06pm

That sure does work. The token in the vocabulary DOES NOT CONTAIN A NEWLINE. But when the newline is there, the vocab lookup fails, and the token becomes the value.

I think the lookup is failing in the javascript - on the browser side - not in python - thus setting the title to the token.

The value should be the token.

This works:

<input class="pat-select2 text-widget tuple-field" type="text"
            name="form.widgets.IDublinCore.subjects" value="MjAyNA==" 
            data-pat-select2='{"separator": ";",
                               "vocabularyUrl": "http://localhost:8485...@@getVocabulary?name=plone.app.vocabularies.Keywords&amp;field=subjects", 
                               "initialValues": {"MjAyNA==": "2024"}, 
                               "orderable": true, 
                               "allowNewItems": "true"}'>

This, does not work: (newline=True)

<input class="pat-select2 text-widget tuple-field" type="text"
            name="form.widgets.IDublinCore.subjects" value="MjAyNA==
" 
            data-pat-select2='{"separator": ";",
                               "vocabularyUrl": "http://localhost:8485...@@getVocabulary?name=plone.app.vocabularies.Keywords&amp;field=subjects", 
                               "initialValues": {"MjAyNA==\n": "2024"}, 
                               "orderable": true, 
                               "allowNewItems": "true"}'>

flipmcf · November 3, 2020, 9:08pm

is it appropriate to change how the token is generated ( @jensens )or strip newlines in the pat-select2 javascript. I'd prefer the former only because I am better with python than JS and don't know where to find the pattern JS source (yet)

jensens · November 3, 2020, 10:02pm

+1 for a sane token. The token has to be the browser friendly part here.
-1 to handle this in JS.
We i.e. can use a hash of the value as a token if this helps?

flipmcf · November 4, 2020, 2:49pm

witsch · November 9, 2020, 1:34pm

Hi, I think we ran into the same thing. https://github.com/plone/plone.app.vocabularies/issues/64#issuecomment-724015650 might have a hint…

mauritsvanrees · February 16, 2021, 5:12pm

PR #65 was merged and I have released plone.app.vocabularies 4.2.2 .

cekk · September 23, 2021, 10:10am

I'm a bit late on this topic but this could lead some problems with plone.restapi.

I mean that when you call the @vocabularies endpoint, it will return a list of terms that are value and token.

If the token is the base64-generated one, we can't use them anywhere: for example if i want to use the list of terms from Keywords vocabularies into a search interface (to search in Plone with a specific keyword), that token isn't the right value stored into catalog.

Should we have update plone.restapi too to not get the token but the value instead?

sneridagh · September 23, 2021, 10:36am

So, could someone explain why the Subjects field got the "based64" treatment, while all other keywords based vocabularies don't? As @cekk explained, it's difficult from the Plone RESTAPI view to handle it, since all are seen as ITokenizedTerm-able

So, without a patch that we implemented in k.volto (and in the upcoming plone.volto):

github.com

kitconcept/kitconcept.volto/blob/main/src/kitconcept/volto/vocabularies/subject.py

from zope.component import queryUtility
from zope.interface import directlyProvides
from zope.interface import implementer
from zope.schema.interfaces import ITitledTokenizedTerm
from zope.schema.interfaces import ITokenizedTerm
from zope.schema.interfaces import IVocabularyFactory
from zope.schema.vocabulary import SimpleVocabulary
from plone.app.layout.navigation.root import getNavigationRootObject
from plone.app.vocabularies.terms import safe_encode
from plone.registry.interfaces import IRegistry
from zope.site.hooks import getSite

from BTrees.IIBTree import intersection
from Products.CMFCore.utils import getToolByName


@implementer(ITokenizedTerm)
class UnsafeSimpleSubjectTerm(object):
    """Simple tokenized term that allows unicode in the token"""

This file has been truncated. show original

The Subjects vocabulary is broken in Plone 6.

A bit of history on this, why the override in k.volto:

github.com/plone/plone.restapi

Storing the subjects does not convert the vocabulary token to the value

opened 12:51PM - 17 Jul 19 UTC

csenger

Storing subjects with the data you get from `@vocabularies/plone.app.vocabularie…s.Keywords` does not work. The client get's a token and a label from \@vocabularies and is supposed to send the token in the content POST/PATCH. For Choice schema fields that works as we do an automatic conversion from the vocabulary term's `token` to `value`. But the subject field is a `schema.Tuple()` with `value_type=schema.TextLine()`, and the vocabulary is not defined on the field, but as a annotation for the widget. As a result a client sending the token will store the raw ascii encoded token instead of the value. Related to #691 [plone.app.dexterity.behaviors.metadata.ICategorization](https://github.com/plone/plone.app.dexterity/blob/82ae47ed7336e5f803e8ca1fc8700a00fafa4760/plone/app/dexterity/behaviors/metadata.py#L104) ```python @provider(IFormFieldProvider) class ICategorization(model.Schema): # ... subjects = schema.Tuple( title=_(u'label_tags', default=u'Tags'), description=_( u'help_tags', default=u'Tags are commonly used for ad-hoc organization of ' + u'content.' ), value_type=schema.TextLine(), required=False, missing_value=(), ) directives.widget( 'subjects', AjaxSelectFieldWidget, vocabulary='plone.app.vocabularies.Keywords' ) ``` Vocabulary token -> value conversation currently happens only for Choice fields: [plone.restapi.deserializer.dxfields.ChoiceFieldDeserializer](https://github.com/plone/plone.restapi/blob/35ebc41bb5d40b0183e461cbc1d48143a594725f/src/plone/restapi/deserializer/dxfields.py#L117) ```python @implementer(IFieldDeserializer) @adapter(IChoice, IDexterityContent, IBrowserRequest) class ChoiceFieldDeserializer(DefaultFieldDeserializer): def __call__(self, value): if isinstance(value, dict) and "token" in value: value = value["token"] if IVocabularyTokenized.providedBy(self.field.vocabulary): try: value = self.field.vocabulary.getTermByToken(value).value except LookupError: pass self.field.validate(value) return value ``` We need to find a way to lookup the value for the token when the vocabulary is assigned in an annotation. It should be done for other fields than IChoice too. Choice fields require the value to be in the vocabulary. Choice fields can't be used with SimpleVocabuary if the user should be able to create new entries on the fly like the subject field allows. (Note: A workaround at the moment is to override the`plone.app.vocabularies.Keywords` vocabulary with a version that uses the value ass the token (without encoding/str() conversion). This seems to work in the legacy plone interface too: https://gist.github.com/csenger/9eaed84c20f332a03e77ff8b21b396fd)

the issue and related fixed PR for all keyword indexes:

github.com/plone/plone.restapi

Serialization and deserialization of vocabulary terms should use tokens

opened 11:28AM - 10 Mar 19 UTC

closed 08:39AM - 26 Apr 19 UTC

buchi

Currently the serialization of a field with a vocabulary (e.g. ChoiceField) retu…rns the stored value while the deserialization expects the value. So far this works well for values that are JSON serializable. However the vocabularies endpoint only returns the token and the title of a vocabulary term. This makes it impossible to set a new value from the vocabulary if the token and the value are not equal. Also term values that are not JSON serializable can not be handled this way. To solve this, serialization should return tokens instead of values and deserialization should handle tokens.

@mauritsvanrees @tisto we need to to start creating a list of this "inconsistencies" that we have between back and front in Plone 6, and try to address them in Sorrento next October. There are a couple of important things to integrate into core in plone.volto that we need to take care of.

flipmcf · September 23, 2021, 1:54pm

No. you should pass the token between the client and server, as it's designed to be the safe way to represent the value across HTTP and safe to two-way encode-decode on either the client or the server.

https://docs.plone.org/develop/plone/forms/vocabularies.html

SimpleTerm.token must be an ASCII string. It is the value passed with the request when the form is submitted. A token must uniquely identify a term.

SimpleTerm.value is the actual value stored on the object. This is not passed to the browser or used in the form. The value is often a unicode string, but can be any type of object.

SimpleTerm.title is a unicode string or translatable message. It is used for display in the form.

That's a good question. To me it looks like all vocabularies got the base64 treatment. Can you point me to the inconsistency?

The serialization/deserialization of the token is an implementation detail of vocabularies. The vocabulary itself should take care of that. ~~If there is a need for the application using the vocabulary to 'know' that the vocabulary uses base64 encoding, then something is broken in the architecture.~~

This is why the value is passed in the vocabulary item. The value is what should be presented to the user, in the absence of a title.

I believe the only difference between value and token is that token = encoded(value) . ~~What that encoding is should be irrelevant and the token shouldn't be displayed to any users.~~

Note: This post contains strong opinions, not facts. Feel free to convince me I'm wrong.

Edit: I realized that the front-end does need to know what encoding is used. How else can it correctly send a new vocab term back to the server?

sneridagh · September 23, 2021, 2:14pm

@cekk, maybe we can provide a test proving the error in the Subjects field using p.restapi and p.a.vocabularies 4.2.2 along with other vocabularies serialisation.

@flipmcf again, this is about p.restapi + Volto (Plone 6) and the inconsistencies are on how we take care that one thing in one side and we do take care as well that it does not break in the other. Anyway, on this one, it's us to blame for not having pushed the fix to core before. Luckily, we will address it for sure at Sorrento.