Tags / Subject field mangling long terms

tl;dr: Long tags with high unicode codepoints are breaking. This bug is blocking our release of plone5

I think I have a bug bothering me. It gets aggravated on 'long terms' used in the Dublin Core Tags field, aka 'subject'. If the tag gets too long it gets mangled and an equals '=' is added.

To aggravate, edit an object (that has the Dublin Core behavior enabled), go to the
categorization tab, and insert this:
"this is a really really long, I mean really long tag that is over 72 bytes long"

and save.

Go back to edit, visit the page, and you will see and '=' appear:
"this is a really really long, I mean really long tag that is over 72 bytes =long"

Before you answer

'this is a crazy edge case and you shouldn't have tags that long'

consider our high unicode users.

We have a lao site with these keywords:
('chinareach', 'เบŠเบฒเบงเบšเป‰เบฒเบ™เปƒเบ™', 'เป€เบกเบทเบญเบ‡เบงเบฑเบ‡เบงเบฝเบ‡', 'เบšเปเปˆเบเบญเบกเป€เบชเบฑเบ', 'เบ”เบดเบ™เบ™เบฒ', 'เบšเปเบฃเบดเบชเบฑเบ”เบˆเบตเบ™')

and this is what gets displayed:

The keywords that are over 72 byes long get affected.
There are 3 utf-8 bytes representing each lao character, which turn into 9 ascii-encoded bytes (including the =) multuplied by 9 unincode characters > 72 bytes.

So, you are limited to 72 bytes of utf8 encoded string, which is not cutting it for Chinese, Lao, Burmese and other eastern languages.

Inspecting the object from debug shows the object itself is healthy:

>>> story.subject
('chinareach', 'เบŠเบฒเบงเบšเป‰เบฒเบ™เปƒเบ™', 'เป€เบกเบทเบญเบ‡เบงเบฑเบ‡เบงเบฝเบ‡', 'เบšเปเปˆเบเบญเบกเป€เบชเบฑเบ', 'เบ”เบดเบ™เบ™เบฒ', 'เบšเปเบฃเบดเบชเบฑเบ”เบˆเบตเบ™')
>>> story.subject[1]
'เบŠเบฒเบงเบšเป‰เบฒเบ™เปƒเบ™'
>>> story.subject[1].encode('utf8')
b'\xe0\xba\x8a\xe0\xba\xb2\xe0\xba\xa7\xe0\xba\x9a\xe0\xbb\x89\xe0\xba\xb2\xe0\xba\x99\xe0\xbb\x83\xe0\xba\x99'
>>> story.subject[1].encode('unicode_escape')
b'\\u0e8a\\u0eb2\\u0ea7\\u0e9a\\u0ec9\\u0eb2\\u0e99\\u0ec3\\u0e99'

What makes it worse is that if you save from here, you alter the tags. Deleting 'เบŠเบฒเบงเบšเป‰เบฒเบ™เปƒเบ™' and replacing it with =E0=BA=8A=E0=BA=B2...
Even worse, another save will encode the = and change it again to =3DE0=3DBA=3D8A..., and additional saves will prepend every byte with 3D causing an encoding disaster until you have =3D3D3D3D3DE0...

I've been debugging for about 6 hours now and need some help. The object has the correct value, the widget will ask for the vocabulary for the field and then ask the vocabulary for SimpleTerms with these values and then use the SimpleTerm 'token'- that's where the error seems to be - on the SimpleTerm token.

I have a theory that somewhere this SimpleTerm is being created with a token, and that token creation is being a bit mean to my unicode tags. SimpleTerm itself doesn't seem to be guilty, but some factory using it.

Any clues would be appreciated.

Almost surely, this is a base64 encoding (of the real thing).

A bit of background: a term has three data items attached:

  1. value: the value as used by the program
  2. title: what is shown to the user
  3. token: usually an ASCII encoding of value; the representation of value at the browser/server interface

At the browser side, there is only token and title. At the server side, the value is reconstructed from the vocabulary via token lookup. As you see wrong values after "save", this means that already the vocabulary is broken.

I would approach the analysis as follows:

  • create a small (few vocabulary lookups) reproducing case
  • put a breakpoint in the vocabulary construction (or if this is too difficult to locate into the term construction)
  • from there, look where the bad values may come from.
1 Like

Perfect hint! Thank you! The keyword index is fine, the object is fine, but when the vocabulary gets created (at form load time) things go bad.

It doesn't look base-64 encoded, it looks utf-8 encoded, but that's neither here nor there.

Problem happens in plona.app.vocabularies terms.py safe_simpleterm_from_value:

(Pdb) safe_unicode(u'เบŠเบฒเบงเบšเป‰เบฒเบ™เปƒเบ™')
'เบŠเบฒเบงเบšเป‰เบฒเบ™เปƒเบ™'
(Pdb) safe_encode(u'เบŠเบฒเบงเบšเป‰เบฒเบ™เปƒเบ™')
b'\xe0\xba\x8a\xe0\xba\xb2\xe0\xba\xa7\xe0\xba\x9a\xe0\xbb\x89\xe0\xba\xb2\xe0\xba\x99\xe0\xbb\x83\xe0\xba\x99'
(Pdb) b2a_qp(safe_encode(u'เบŠเบฒเบงเบšเป‰เบฒเบ™เปƒเบ™'))
b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'

the b2a_qp() is doing it - and I had to look that one up.

But what i find strange is that the token is only broken, the title seems ok:
SimpleTerm(value, token, title) So why would 'display' get broken? I might not have to think about that, tho, seeing the behavior b2a_qp() is doing - adding the additional \n

And it's adding the \n because of RFC 1522 - MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text

While there is no limit to the length of a multiple-line header
field, each line of a header field that contains one or more
encoded-words is limited to 76 characters

Do you think we should continue this discussion in github - an issue with p.a.vocab?

I would disregard the token: tokens are implementation details; it is not important how they are created (from the value) as long as their values remain unique (per vocabulary).

You pose one of the important questions above: why is the display broken even though the title is correct? This would be a widget bug -- it should display the title, not the token.
The second important question is "why is the value wrong?". Apparently, the computed token happens to influence value (which should not happen).

2 Likes

The widget is displaying the title, but the title in the vocabulary is wrong. Why do I feel like I'm in an infinite loop here?

Seriously tho, I think the bug happens when the widget is setting the field and would be adding to the vocabulary also. I can choose a 'long term' from the vocabulary list, save, and then re-visit the edit page to see the 'mangled' value. I'm still spelunking, and surprised I haven't found the bug yet. Lots of moving parts with z3c form and widgets.

You should not have this feeling:
On the server side, a vocabulary is built from a sequence of values. Those values are unicode strings and they are initially correct. Some (apparently broken) logic constructs from those values (vocabulary) Terms with attributes token, value and title. Apparently, this goes wrong - the logic should keep value unchanged, use the value for title and can (almost) do whatever it wants with token.

I do not think so: when the form is posted, the widget receives the token (which is an implementation detail from the vocabulary). It hands it over to the field which calls a vocabulary method (getTermByToken) to obtain the value. Thus, if the vocabulary is not broken, you cannot get a wrong value.

Writing the above, I see a potential reason this might be wrong: while the base vocabulary (--> zope.schema.vocabulary.SimpleVocabulary) raises an exception for an unknown token, a derived vocabulary might use the token in this case (e.g. to support the creation of new values, not previously known be the vocabulary). In this case, it would be vital that the widget receives the token value from the browser exactly as generated on the server. However, browsers may do some kind of normalization braking this requirement.

Your observations hint toward this kind of problem: in some cases, the generated tokens contain characters (especially newlines) which might get normalized by the browser. If the vocabulary then interprets an unknown token as a new value, things break.

However, this does not explain why you see a wrong title. Try to find out why this happens before other investigations.

Here in the converter comes the token to the pattern options in the widget and this ends in plone.app.vocabularies

def safe_simpleterm_from_value(value):
    """create SimpleTerm from an untrusted value.

    - token need cleaned up: Vocabulary term tokens *must* be 7 bit values
    - anything for display has to be cleaned up, titles *must* be unicode
    """
    # import pdb
    # pdb.set_trace()

    encoded = safe_encode(value)
    token1 = b2a_qp(encoded, istext=True, header=True) # broken \n newline
    token2 = b2a_qp(encoded, istext=True, header=False) # broken \n newline
    token3 = b2a_qp(encoded, istext=False, header=True) # broken \n newline
    token4 = b2a_qp(encoded, istext=False, header=False) # broken \n newline
    token5 = b2a_base64(encoded, newline=False) # that helps!
    print("safe simple term token1", token1) 
    print("safe simple term token2", token2)
    print("safe simple term token3", token3)
    print("safe simple term token4", token4)
    print("safe simple term token5", token5)
    return SimpleTerm(value=value, token=token5, title=value)
safe simple term token1 b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'
safe simple term token2 b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'
safe simple term token3 b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'
safe simple term token4 b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'
safe simple term token5 b'4LqK4Lqy4Lqn4Lqa4LuJ4Lqy4LqZ4LuD4LqZ'

That sure does work. The token in the vocabulary DOES NOT CONTAIN A NEWLINE. But when the newline is there, the vocab lookup fails, and the token becomes the value.

I think the lookup is failing in the javascript - on the browser side - not in python - thus setting the title to the token.

The value should be the token.

This works:

<input class="pat-select2 text-widget tuple-field" type="text"
            name="form.widgets.IDublinCore.subjects" value="MjAyNA==" 
            data-pat-select2='{"separator": ";",
                               "vocabularyUrl": "http://localhost:8485...@@getVocabulary?name=plone.app.vocabularies.Keywords&amp;field=subjects", 
                               "initialValues": {"MjAyNA==": "2024"}, 
                               "orderable": true, 
                               "allowNewItems": "true"}'>

This, does not work: (newline=True)

<input class="pat-select2 text-widget tuple-field" type="text"
            name="form.widgets.IDublinCore.subjects" value="MjAyNA==
" 
            data-pat-select2='{"separator": ";",
                               "vocabularyUrl": "http://localhost:8485...@@getVocabulary?name=plone.app.vocabularies.Keywords&amp;field=subjects", 
                               "initialValues": {"MjAyNA==\n": "2024"}, 
                               "orderable": true, 
                               "allowNewItems": "true"}'>

is it appropriate to change how the token is generated ( @jensens )or strip newlines in the pat-select2 javascript. I'd prefer the former only because I am better with python than JS and don't know where to find the pattern JS source (yet)

+1 for a sane token. The token has to be the browser friendly part here.
-1 to handle this in JS.
We i.e. can use a hash of the value as a token if this helps?

Hi, I think we ran into the same thing. https://github.com/plone/plone.app.vocabularies/issues/64#issuecomment-724015650 might have a hintโ€ฆ :slight_smile:

PR #65 was merged and I have released plone.app.vocabularies 4.2.2 .

2 Likes

I'm a bit late on this topic but this could lead some problems with plone.restapi.

I mean that when you call the @vocabularies endpoint, it will return a list of terms that are value and token.

If the token is the base64-generated one, we can't use them anywhere: for example if i want to use the list of terms from Keywords vocabularies into a search interface (to search in Plone with a specific keyword), that token isn't the right value stored into catalog.

Should we have update plone.restapi too to not get the token but the value instead?

So, could someone explain why the Subjects field got the "based64" treatment, while all other keywords based vocabularies don't? As @cekk explained, it's difficult from the Plone RESTAPI view to handle it, since all are seen as ITokenizedTerm-able :frowning:

So, without a patch that we implemented in k.volto (and in the upcoming plone.volto):

The Subjects vocabulary is broken in Plone 6.

A bit of history on this, why the override in k.volto:

the issue and related fixed PR for all keyword indexes:

@mauritsvanrees @tisto we need to to start creating a list of this "inconsistencies" that we have between back and front in Plone 6, and try to address them in Sorrento next October. There are a couple of important things to integrate into core in plone.volto that we need to take care of.

No. you should pass the token between the client and server, as it's designed to be the safe way to represent the value across HTTP and safe to two-way encode-decode on either the client or the server.

https://docs.plone.org/develop/plone/forms/vocabularies.html

SimpleTerm.token must be an ASCII string. It is the value passed with the request when the form is submitted. A token must uniquely identify a term.

SimpleTerm.value is the actual value stored on the object. This is not passed to the browser or used in the form. The value is often a unicode string, but can be any type of object.

SimpleTerm.title is a unicode string or translatable message. It is used for display in the form.

That's a good question. To me it looks like all vocabularies got the base64 treatment. Can you point me to the inconsistency?

The serialization/deserialization of the token is an implementation detail of vocabularies. The vocabulary itself should take care of that. If there is a need for the application using the vocabulary to 'know' that the vocabulary uses base64 encoding, then something is broken in the architecture.

This is why the value is passed in the vocabulary item. The value is what should be presented to the user, in the absence of a title.

I believe the only difference between value and token is that token = encoded(value) . What that encoding is should be irrelevant and the token shouldn't be displayed to any users.

Note: This post contains strong opinions, not facts. Feel free to convince me I'm wrong.

Edit: I realized that the front-end does need to know what encoding is used. How else can it correctly send a new vocab term back to the server?

@cekk, maybe we can provide a test proving the error in the Subjects field using p.restapi and p.a.vocabularies 4.2.2 along with other vocabularies serialisation.

@flipmcf again, this is about p.restapi + Volto (Plone 6) and the inconsistencies are on how we take care that one thing in one side and we do take care as well that it does not break in the other. Anyway, on this one, it's us to blame for not having pushed the fix to core before. Luckily, we will address it for sure at Sorrento.

1 Like