Indexing PDF and DOCX files not working

Hi all,

after testing the classic plone online demo, I decided to try to setup a project with it.

In the process I ran into some problems I could only solve partially. I would be really happy if someone could help me out or point me in the right direction.

What I did:

  • installed a linux ubuntu server 24.04.2
  • chose the classic ui
  • followed the "Install Plone with Buildout" instructions

So far everything went well. But after creating a plone site and uploading pdf files their content was not getting indexed for SearchableText.

I found this topic (https://community.plone.org/t/plone-6-installation-does-not-index-pdf-files/17056) where the same problem was described. There was a link to a solution from MRTANGO (Fixing PDF indexing / missing pdf_to_text transform in Plone). After installing poppler-utils and running the stated Python commands 'pdf_to_text' showed up. Now the content from uploaded pdf files is searchable.

I would like to achieve the same for Word / docx files. For this I can not find a solution.

Can someone help me out if this can work and how?

If I made a stupid mistake, excuse me please. I only have very basic knowledge regarding this topic.

Markus :slight_smile:

wv must be installed for Office content.

Thanks for the reply. I was searching for "wv" and found this thread: How to install wv to index Word documents?
There it was explained that wv was needed for the older doc file type. docx is xml based and needs Products.OpenXml.

I added Products.OpenXml to the eggs list but the build out command resulted in errors. Chat GPT explained:

  • Products.OpenXml is not compatible with Python 3.
  • Plone 6.1, however, requires Python 3.

Any other ideas? Did somebody solve this problem?

@Markus I use collective.elasticsearch

It obviously adds complexity and it might be an overkill just for your purpose of indexing PDF and office files. But it supports indexing of a lot of data types. It does use Apache Tika under the hood.

I would use a custom indexer. i use docx2txt in a custom transform. but it can be adapted for indexing in a similar way.

from io import BytesIO
from Products.PortalTransforms.interfaces import ITransform
from zope.interface import implementer

import docx2txt


@implementer(ITransform)
class docx_to_text:
    __name__ = "docx_to_text"
    inputs = (
        "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    )
    output = "text/plain"

    def __init__(self, name=None, **kwargs):
        if name is not None:
            self.__name__ = name

    def name(self):
        return self.__name__

    def convert(self, orig, idata, **kwargs):
        out = []
        text = docx2txt.process(BytesIO(orig))
        out.extend(self.clean_data(data=text))
        idata.setData(" ".join(out))
        return idata

    def clean_data(self, data=None):
        out = []
        if not data:
            return out
        out.extend(data.replace("\n", ",").split(","))
        out = list(set(out))
        out = list(map(lambda item: item.strip('"\t '), out))
        return out


def register():
    return docx_to_text()

1 Like

Not really a complete solution, but I've used LibreOffice on Linux to convert Office documents to multiple formats, including plain text and exporting images from a .docx and generate thumbnails from them for those users who embed photos in Word docs and email it, for some bizarre reason. Anyway, if you can export the text, then you can index it.

See Is there a command line tool to convert documents to plain text files? - English - Ask LibreOffice for details.

Is this beeing indexed? Can I see the code somewhere ?

See my code snippet above. Or mean you the indexer code?

I forked Products.OpenXml awhile back to make it work on python3. GitHub - ewohnlich/Products.OpenXml: OpenXml documents support for Plone. Really simple change, but not the best solution since it's not maintained.

I tried to understand how ‘pdf_to_text’ works, so I tried to add a basic transform.

  1. I am uncertain how to ‘register the transform’ in /portal_transform.
    Is ‘register’ transform and ‘register in portal_transform’ two different things? (it seems like it is)

  2. I dont understand how pdfs are indexed. After adding ‘poppler’ pdfs do index (and show up when searching for text in the pdf). Can word files do it the same way.

    So: Do we need both transform and indexer to ‘make it behave like pdfs’

Thankds. I will have a look at it (maybe use ‘replacements’ for openxml.

Until then: where is openxmllib 1.1.2 ( I only found 1.1.1)

UPDATE (found 2.x)

Also forked: GitHub - ewohnlich/openxmllib: Automatically exported from code.google.com/p/openxmllib. They are from the same developer

I have this working in a transform now, so word files can be searched similar as pdf if python-docx is installed. Unfortunately, I can not get it to install as an (separate) add-on ( Have been trying for a whole day without any luck) Code works if I add it to existing add-ons, but not by itself.

    from Products.PortalTransforms.interfaces import ITransform

    from zope.interface import implementer
    from Products.PortalTransforms.libtransforms.commandtransform import popentransform
    @implementer(ITransform)
    class word_docx_to_text(popentransform):
    _name_ = "word_docx_to_text"
    inputs = ("application/vnd.openxmlformats-officedocument.wordprocessingml.document")
    output = "text/plain"
    output_encoding = "utf-8"
    binaryName = "docx2txt"    binaryArgs = "- -enc UTF-8 -"
    def register():
        return word_docx_to_text()

I made an add-on, if it should be of interest to anyone.

You need to install ‘docx2text’. Improvements welcome:

1 Like