Unicode error on imported content

espenmn · March 2, 2023, 7:07pm

I have been strugling with importing an old Frontpage2003- site to Plone.

Finally, I thought things were OK, since the content displays correctly.
Unfortunately, when I try to edit I get

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

If I check 'obj.text.output', I get this. Is there a way I can search / replace / change from a script so I dont have to do all (20.000 pages) the import again

b'Familjen Onstads verksamhet under bekv\xc3\xa4mlighetsflagg \xc3\xa4r kans'

PS: Unfortunately, som of files have both this AND etc.

yurj · March 3, 2023, 7:55am

espenmn · March 3, 2023, 1:47pm

Thanks. Do you know if it is possible to 'find out what needs to be changed' (There must be some control characters, but I cant 'see them (in obj.text.output)

yurj · March 6, 2023, 8:25am

I think you've to convert this to unicode equivalent.

espenmn · March 6, 2023, 11:03am

In fact, it is not the problem. It looks like there are some control characters ( \t \n or similar that messes up some of the pages. I am not exactly sure which yet). There are so much hard-coded html (thank you Microsoft). And so many different encoding. Maybe it is better to do some kind of lxm clean. I will try that and report back ( loop, read body text, = lxml.html.clean.clean_html(bodytext); save bodytext.

Will report back

I got rid of some, by using replace \t\n before import but seems like there are more.