Tags / Subject field mangling long terms

tl;dr: Long tags with high unicode codepoints are breaking. This bug is blocking our release of plone5

I think I have a bug bothering me. It gets aggravated on 'long terms' used in the Dublin Core Tags field, aka 'subject'. If the tag gets too long it gets mangled and an equals '=' is added.

To aggravate, edit an object (that has the Dublin Core behavior enabled), go to the
categorization tab, and insert this:
"this is a really really long, I mean really long tag that is over 72 bytes long"

and save.

Go back to edit, visit the page, and you will see and '=' appear:
"this is a really really long, I mean really long tag that is over 72 bytes =long"

Before you answer

'this is a crazy edge case and you shouldn't have tags that long'

consider our high unicode users.

We have a lao site with these keywords:
('chinareach', 'ຊາວບ້ານໃນ', 'ເມືອງວັງວຽງ', 'ບໍ່ຍອມເສັຍ', 'ດິນນາ', 'ບໍຣິສັດຈີນ')

and this is what gets displayed:

The keywords that are over 72 byes long get affected.
There are 3 utf-8 bytes representing each lao character, which turn into 9 ascii-encoded bytes (including the =) multuplied by 9 unincode characters > 72 bytes.

So, you are limited to 72 bytes of utf8 encoded string, which is not cutting it for Chinese, Lao, Burmese and other eastern languages.

Inspecting the object from debug shows the object itself is healthy:

>>> story.subject
('chinareach', 'ຊາວບ້ານໃນ', 'ເມືອງວັງວຽງ', 'ບໍ່ຍອມເສັຍ', 'ດິນນາ', 'ບໍຣິສັດຈີນ')
>>> story.subject[1]
'ຊາວບ້ານໃນ'
>>> story.subject[1].encode('utf8')
b'\xe0\xba\x8a\xe0\xba\xb2\xe0\xba\xa7\xe0\xba\x9a\xe0\xbb\x89\xe0\xba\xb2\xe0\xba\x99\xe0\xbb\x83\xe0\xba\x99'
>>> story.subject[1].encode('unicode_escape')
b'\\u0e8a\\u0eb2\\u0ea7\\u0e9a\\u0ec9\\u0eb2\\u0e99\\u0ec3\\u0e99'

What makes it worse is that if you save from here, you alter the tags. Deleting 'ຊາວບ້ານໃນ' and replacing it with =E0=BA=8A=E0=BA=B2...
Even worse, another save will encode the = and change it again to =3DE0=3DBA=3D8A..., and additional saves will prepend every byte with 3D causing an encoding disaster until you have =3D3D3D3D3DE0...

I've been debugging for about 6 hours now and need some help. The object has the correct value, the widget will ask for the vocabulary for the field and then ask the vocabulary for SimpleTerms with these values and then use the SimpleTerm 'token'- that's where the error seems to be - on the SimpleTerm token.

I have a theory that somewhere this SimpleTerm is being created with a token, and that token creation is being a bit mean to my unicode tags. SimpleTerm itself doesn't seem to be guilty, but some factory using it.

Any clues would be appreciated.

Almost surely, this is a base64 encoding (of the real thing).

A bit of background: a term has three data items attached:

  1. value: the value as used by the program
  2. title: what is shown to the user
  3. token: usually an ASCII encoding of value; the representation of value at the browser/server interface

At the browser side, there is only token and title. At the server side, the value is reconstructed from the vocabulary via token lookup. As you see wrong values after "save", this means that already the vocabulary is broken.

I would approach the analysis as follows:

  • create a small (few vocabulary lookups) reproducing case
  • put a breakpoint in the vocabulary construction (or if this is too difficult to locate into the term construction)
  • from there, look where the bad values may come from.
1 Like

Perfect hint! Thank you! The keyword index is fine, the object is fine, but when the vocabulary gets created (at form load time) things go bad.

It doesn't look base-64 encoded, it looks utf-8 encoded, but that's neither here nor there.

Problem happens in plona.app.vocabularies terms.py safe_simpleterm_from_value:

(Pdb) safe_unicode(u'ຊາວບ້ານໃນ')
'ຊາວບ້ານໃນ'
(Pdb) safe_encode(u'ຊາວບ້ານໃນ')
b'\xe0\xba\x8a\xe0\xba\xb2\xe0\xba\xa7\xe0\xba\x9a\xe0\xbb\x89\xe0\xba\xb2\xe0\xba\x99\xe0\xbb\x83\xe0\xba\x99'
(Pdb) b2a_qp(safe_encode(u'ຊາວບ້ານໃນ'))
b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'

the b2a_qp() is doing it - and I had to look that one up.
https://docs.python.org/3/library/binascii.html#binascii.b2a_qp

But what i find strange is that the token is only broken, the title seems ok:
SimpleTerm(value, token, title) So why would 'display' get broken? I might not have to think about that, tho, seeing the behavior b2a_qp() is doing - adding the additional \n

And it's adding the \n because of https://tools.ietf.org/html/rfc1522.html

While there is no limit to the length of a multiple-line header
field, each line of a header field that contains one or more
encoded-words is limited to 76 characters

Do you think we should continue this discussion in github - an issue with p.a.vocab?

I would disregard the token: tokens are implementation details; it is not important how they are created (from the value) as long as their values remain unique (per vocabulary).

You pose one of the important questions above: why is the display broken even though the title is correct? This would be a widget bug -- it should display the title, not the token.
The second important question is "why is the value wrong?". Apparently, the computed token happens to influence value (which should not happen).

2 Likes

The widget is displaying the title, but the title in the vocabulary is wrong. Why do I feel like I'm in an infinite loop here?

Seriously tho, I think the bug happens when the widget is setting the field and would be adding to the vocabulary also. I can choose a 'long term' from the vocabulary list, save, and then re-visit the edit page to see the 'mangled' value. I'm still spelunking, and surprised I haven't found the bug yet. Lots of moving parts with z3c form and widgets.

You should not have this feeling:
On the server side, a vocabulary is built from a sequence of values. Those values are unicode strings and they are initially correct. Some (apparently broken) logic constructs from those values (vocabulary) Terms with attributes token, value and title. Apparently, this goes wrong - the logic should keep value unchanged, use the value for title and can (almost) do whatever it wants with token.

I do not think so: when the form is posted, the widget receives the token (which is an implementation detail from the vocabulary). It hands it over to the field which calls a vocabulary method (getTermByToken) to obtain the value. Thus, if the vocabulary is not broken, you cannot get a wrong value.

Writing the above, I see a potential reason this might be wrong: while the base vocabulary (--> zope.schema.vocabulary.SimpleVocabulary) raises an exception for an unknown token, a derived vocabulary might use the token in this case (e.g. to support the creation of new values, not previously known be the vocabulary). In this case, it would be vital that the widget receives the token value from the browser exactly as generated on the server. However, browsers may do some kind of normalization braking this requirement.

Your observations hint toward this kind of problem: in some cases, the generated tokens contain characters (especially newlines) which might get normalized by the browser. If the vocabulary then interprets an unknown token as a new value, things break.

However, this does not explain why you see a wrong title. Try to find out why this happens before other investigations.

Here in the converter comes the token to the pattern options in the widget and this ends in plone.app.vocabularies

def safe_simpleterm_from_value(value):
    """create SimpleTerm from an untrusted value.

    - token need cleaned up: Vocabulary term tokens *must* be 7 bit values
    - anything for display has to be cleaned up, titles *must* be unicode
    """
    # import pdb
    # pdb.set_trace()

    encoded = safe_encode(value)
    token1 = b2a_qp(encoded, istext=True, header=True) # broken \n newline
    token2 = b2a_qp(encoded, istext=True, header=False) # broken \n newline
    token3 = b2a_qp(encoded, istext=False, header=True) # broken \n newline
    token4 = b2a_qp(encoded, istext=False, header=False) # broken \n newline
    token5 = b2a_base64(encoded, newline=False) # that helps!
    print("safe simple term token1", token1) 
    print("safe simple term token2", token2)
    print("safe simple term token3", token3)
    print("safe simple term token4", token4)
    print("safe simple term token5", token5)
    return SimpleTerm(value=value, token=token5, title=value)
safe simple term token1 b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'
safe simple term token2 b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'
safe simple term token3 b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'
safe simple term token4 b'=E0=BA=8A=E0=BA=B2=E0=BA=A7=E0=BA=9A=E0=BB=89=E0=BA=B2=E0=BA=99=E0=BB=83=E0=\n=BA=99'
safe simple term token5 b'4LqK4Lqy4Lqn4Lqa4LuJ4Lqy4LqZ4LuD4LqZ'

That sure does work. The token in the vocabulary DOES NOT CONTAIN A NEWLINE. But when the newline is there, the vocab lookup fails, and the token becomes the value.

I think the lookup is failing in the javascript - on the browser side - not in python - thus setting the title to the token.

The value should be the token.

This works:

<input class="pat-select2 text-widget tuple-field" type="text"
            name="form.widgets.IDublinCore.subjects" value="MjAyNA==" 
            data-pat-select2='{"separator": ";",
                               "vocabularyUrl": "http://localhost:8485...@@getVocabulary?name=plone.app.vocabularies.Keywords&amp;field=subjects", 
                               "initialValues": {"MjAyNA==": "2024"}, 
                               "orderable": true, 
                               "allowNewItems": "true"}'>

This, does not work: (newline=True)

<input class="pat-select2 text-widget tuple-field" type="text"
            name="form.widgets.IDublinCore.subjects" value="MjAyNA==
" 
            data-pat-select2='{"separator": ";",
                               "vocabularyUrl": "http://localhost:8485...@@getVocabulary?name=plone.app.vocabularies.Keywords&amp;field=subjects", 
                               "initialValues": {"MjAyNA==\n": "2024"}, 
                               "orderable": true, 
                               "allowNewItems": "true"}'>

is it appropriate to change how the token is generated ( @jensens )or strip newlines in the pat-select2 javascript. I'd prefer the former only because I am better with python than JS and don't know where to find the pattern JS source (yet)

+1 for a sane token. The token has to be the browser friendly part here.
-1 to handle this in JS.
We i.e. can use a hash of the value as a token if this helps?

Hi, I think we ran into the same thing. https://github.com/plone/plone.app.vocabularies/issues/64#issuecomment-724015650 might have a hint… :slight_smile:

Plone Foundation Code of Conduct