Indexing PDF and DOCX files not working

Markus · June 18, 2025, 10:08am

Hi all,

after testing the classic plone online demo, I decided to try to setup a project with it.

In the process I ran into some problems I could only solve partially. I would be really happy if someone could help me out or point me in the right direction.

What I did:

installed a linux ubuntu server 24.04.2
chose the classic ui
followed the "Install Plone with Buildout" instructions

So far everything went well. But after creating a plone site and uploading pdf files their content was not getting indexed for SearchableText.

I found this topic (https://community.plone.org/t/plone-6-installation-does-not-index-pdf-files/17056) where the same problem was described. There was a link to a solution from MRTANGO (Fixing PDF indexing / missing pdf_to_text transform in Plone). After installing poppler-utils and running the stated Python commands 'pdf_to_text' showed up. Now the content from uploaded pdf files is searchable.

I would like to achieve the same for Word / docx files. For this I can not find a solution.

Can someone help me out if this can work and how?

If I made a stupid mistake, excuse me please. I only have very basic knowledge regarding this topic.

Markus

zopyx · June 18, 2025, 10:19am

wv must be installed for Office content.

Markus · June 18, 2025, 2:42pm

Thanks for the reply. I was searching for "wv" and found this thread: How to install wv to index Word documents?
There it was explained that wv was needed for the older doc file type. docx is xml based and needs Products.OpenXml.

I added Products.OpenXml to the eggs list but the build out command resulted in errors. Chat GPT explained:

Products.OpenXml is not compatible with Python 3.
Plone 6.1, however, requires Python 3.

Any other ideas? Did somebody solve this problem?

maethu · June 18, 2025, 3:31pm

@Markus I use collective.elasticsearch

It obviously adds complexity and it might be an overkill just for your purpose of indexing PDF and office files. But it supports indexing of a lot of data types. It does use Apache Tika under the hood.

1letter · June 18, 2025, 5:26pm

I would use a custom indexer. i use docx2txt in a custom transform. but it can be adapted for indexing in a similar way.

from io import BytesIO
from Products.PortalTransforms.interfaces import ITransform
from zope.interface import implementer

import docx2txt


@implementer(ITransform)
class docx_to_text:
    __name__ = "docx_to_text"
    inputs = (
        "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    )
    output = "text/plain"

    def __init__(self, name=None, **kwargs):
        if name is not None:
            self.__name__ = name

    def name(self):
        return self.__name__

    def convert(self, orig, idata, **kwargs):
        out = []
        text = docx2txt.process(BytesIO(orig))
        out.extend(self.clean_data(data=text))
        idata.setData(" ".join(out))
        return idata

    def clean_data(self, data=None):
        out = []
        if not data:
            return out
        out.extend(data.replace("\n", ",").split(","))
        out = list(set(out))
        out = list(map(lambda item: item.strip('"\t '), out))
        return out


def register():
    return docx_to_text()

stevepiercy · June 19, 2025, 8:42am

Not really a complete solution, but I've used LibreOffice on Linux to convert Office documents to multiple formats, including plain text and exporting images from a .docx and generate thumbnails from them for those users who embed photos in Word docs and email it, for some bizarre reason. Anyway, if you can export the text, then you can index it.

See Is there a command line tool to convert documents to plain text files? - English - Ask LibreOffice for details.