Extracting base64 encoded images from HTML

In a Plone 6.1.3 classic site, I have imported HTML from another system (an older Odoo) into a custom content type's text field. Many of the imported content items contain base64 encoded images saved directly in the HTML, e.g.,

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAABCAAAADWCAYAAAAAaPNyAAAgAElEQVR4AYy95ZJlya6tuV+v27r7bCpKDo7FzMzMKygzKwvO7eccbZ98KWJmnNp2+4ebHOWaEDDGlOR/+/p4J8qvT/f69vlBXx6O1qbP60gvT3f753HWeP/jca/Hw0F3u70Ws6VGo6mG45m2m62e7tF5tD1+fbzTb58fhPz2dK+vD0er00eh768l8+/06+PR1vi83z7fWx9tdDLOvC/3e325PzyXz8e97vc7LedzTcYTDfsD7VfL5/GvDwd9vdvrV67vuLNC3Yv3fbs/6Dfu0XFnY7S9MJex37Hj7mWcfua4LupfT7Y9Hfd62O+0Wy+1Wa00ncw0HIy1Wiz1mTkPB7P78XjQw2Gnh8Ne+/VKq9lUm8VCs8nM7jdyMppqPJxo0B+q2+6p1xmo1x..." alt="">

I'd like to create an Image from the base64 encoded data and save it in the (containerish) content item, then update the HTML to replace that img tag with a link to the image so it renders inline but probably not full size.

A curious thing about this HTML: in the default view of the content item, it does not render the images, but when I edit the content item, TinyMCE eventually comes up and shows the image rendered inline in the rich text.

It's kind of a shame that the default view doesn't show those inline images... (says lazy me) even though of course it's probably quite inefficient to leave them in there as is.

Another curiosity: TinyMCE's view source shows HTML looking like this:

<p><u>Before:</u></p>
<p><img src="blob:http://localhost:8080/99492e9c-d130-4a04-a669-4ae2ce7248e1" alt=""></p>
<p><img src="blob:http://localhost:8080/bf25861e-e4b4-4790-b6be-341f7f3a9d39" alt="" style="font-family: inherit; font-size: initial; font-style: initial; font-variant-ligatures: initial; font-variant-caps: initial; font-weight: initial; text-align: inherit;"></p>

and if I browse to one of those blob URLs, the image is found:

which makes me think that, on initial save of the HTML, the base64 data is actually saved to Plone as an Image.

Oh dear base64 encoded images in the main content.

that also happened when you copy /paste content with images from Word into a Tinymce widget.

Iirc @mauritsvanrees wrote a proof of concept some years ago (or longer) which extracts those images into separate Image content items on save. Classic UI. Let me search.

Found it, 4 years ago: GitHub - collective/collective.base64imagepatch: Assuree that no inline base64 encoded image is stored in an RichtTextField

Maybe suitable to grab some code from :wink:

1 Like

Oh this fredvd is a great LLM! Even better than #lazyweb :face_blowing_a_kiss:

1 Like

A Ploney mind is a joy forever. I started fiddliing with it 2002-2003 :smiling_face:.

I sometimes joke with Maurits in conversations that with the 17+ years of working together on projects at Zest, I very often still remember why or where he solved something, and he then remembers the what and how.

5 Likes

collective.base64imagepatch
The current package collective.base64imagepatch is an add on for the CMS Plone. It injects eventhandlers for Contenttype creation and modification to assure that no inline base64 encoded image is stored in an RichTextField.

Said that, it is quite strange you get the image using the blobid, check in the browser if there are any real http call. Being the hash always the same, maybe you've already uploaded the image and some mechanism is using that uploaded image via its hash instead of the one you see in the browser content.

You can disable the HTML transform filter :wink: but you’re right, thats not what you really want.

1 Like

What means inefficient? When you try to read the image in the source yourself instead of watching, I agree :wink:

As long as the editor can manage to fold such stuff it is not that much more inefficient than to compare binary with base64 encoded. Except you save the request!

Another aspect is that the Plone filtering has two levels:

  • Remove nasty content from the saved source during save. You save and it is gone.
  • Filter out the tags when rendering for display. Still there in the editor.

And another final thing is transform:

  • Transform a recognized pattern during rendering (like replacing resolveuuid links with an absolute url, but keep it in the source
  • Transform before save and save the transformed result.

If the images are still preserved, you may tweak the filtering options for display.
If you create a portal transform pipe that extracts the image on save, writes it into the database and replaces the instance in the source with the image link.

What seems absolutely reasonable drops questions when you are in classic and may cannot use the page as container for the converted image by default.

Another concern ist that if you can place dataurls (usually base64 encoded) in the saved content and process them, you may be subject to serious crime.

I personally enjoyed when Zotero.org started to write singlepage html files with all resources base64 encoded inside one file. This reduces the burden to conserve content for archive purposes. But I never thought of the way back.

In frame oriented layout software you can save images inside the document and externally linked. Most allow conversion from one approach to the other. With Word and other office programs this is pain. With PDF you need special skills and tools.

With this example you can see again, how improper Long Term Content Lifecycle Management breaks the internet and damages our digital heritage.

Thanks for bringing this up.

Ah: And do not forget the conversion and handling of image metadata in this process!

With GitHub - collective/collective.base64imagepatch: Assuree that no inline base64 encoded image is stored in an RichtTextField I am wondering if the image is transferred into a blob or just patched away. The docs are not explicit with this. I did not jump into the code.

@acsr I read your hidden/deleted replies. Why hide them? They were informative

I can see the base64 data in the source as the TinyMCE code processes the (very long) html and only eventually does TinyMCE become ready to accept user input and changes what I see to the “blob”. The raw text value also still contains the base64 data.

I wrote my answers on the phone without actually reading the other stuff. I now undeleted them. But please keep that in mind :wink:

1 Like

@fredvd this was exactly what I needed. At first I tried just using it as is, and there were just a couple of small issues to tweak, given it had been 4 years since the last update.

  • Python 3 bytes vs Python 2 strings, utf-8
  • handling the raw attribute missing on RichTextField

but then I realized it did not work exactly the way I need to, since in my case I want the extracted images to be created inside each (containerish) item, so I ended up taking almost all the code but simplifying it (in my case, I don't need to try to add the image in the container of the item nor in the parent of the container of the item).

Then I found an issue with one particular HTML value that was 68,353,055 bytes long. For some reason, BeautifulSoup was not extracting the img tags correctly from it, so I replaced the tag extraction code with a regular expression findall().

Not sure how useful it would be to contribute these changes back (other than Python 3 bytes utf-8 tweaks) since the rest is specific to my use case.