Plone 6 installation does not index pdf files

I followed this documentation to install Plone 6.

Much later I noticed that pdf files were not being indexed. I find that the demo plone 6 site indexes pdf files correctly (by examing the portal_catalog entry for a pdf file that I uploaded, and seeing that the SearchableText included the text from the pdf file). In my installation, the SearchableText for a pdf file only includes the title and filename.

To try to fix this, I tried the following (ubuntu):

  • sudo apt install poppler-utils
  • sudo apt-get install libreadline-dev wv poppler-utils

Following these attempts with "make build" and "make start" did not solve the problem. Can someone give advice on how to get pdf file indexing to work?

go to /Plone/portal_transforms/manage_main in your Site and look if there's the pdf_to_text transform in the list. If not, you can try reloading your transforms with the "Reload Transforms" tab.

Hi Peter,

Thank you very much for the suggestion. I tried that. On the console I got the message:

  • "aborting transaction due to no CSRF protection"

On the browser page, I clicked the "confirm action" button, and the response was:

  • Those Transformations have been reloaded : (a list that does not include pdf_to_text)

If you have any other advice, please let me know. Thanks again!

The transform searches for the binary pdftotext Products.PortalTransforms/pdf_to_text.py at master · plone/Products.PortalTransforms · GitHub

This has to be executable as the user which the Plone instance is running.

pdftotext (located in /usr/bin) has access mode: -rwxr-xr-x

OK ... then I would check this for loop when starting your application, what happens when registering pdf_to_text Products.PortalTransforms/__init__.py at master · plone/Products.PortalTransforms · GitHub

You can also set your log-level to DEBUG to see, if the TransformEngine has some log regarding registering pdf_to_text

OMG ... since when is that? That's an issue which should be reported I'd say ...

Thank you Yuri and Peter. The workaround documented by Mr. Tango worked.

Next problem...

While pdf files in folders are now correctly indexed, files in custom dexterity types (using NamedBlobFile) are not. There is some documentation here, but that and the source code, are very hard to decipher. My attempts to mark the NamedBlobFile to be indexed have failed.

Default File content has its own SearchableText indexer defined in plone.app.contenttypes.

I just tried the textindexer as documented and get the following traceback for NamedBlobFile field:

Traceback (innermost last):
  Module ZPublisher.WSGIPublisher, line 187, in transaction_pubevents
  Module transaction._manager, line 257, in commit
  Module transaction._manager, line 134, in commit
  Module transaction._transaction, line 267, in commit
  Module transaction._transaction, line 333, in _callBeforeCommitHooks
  Module transaction._transaction, line 372, in _call_hooks
  Module Products.CMFCore.indexing, line 317, in before_commit
  Module Products.CMFCore.indexing, line 227, in process
  Module Products.CMFCore.indexing, line 49, in reindex
  Module Products.CMFCore.CatalogTool, line 368, in _reindexObject
  Module Products.CMFPlone.CatalogTool, line 324, in catalog_object
  Module Products.ZCatalog.ZCatalog, line 495, in catalog_object
  Module Products.ZCatalog.Catalog, line 362, in catalogObject
  Module Products.ZCTextIndex.ZCTextIndex, line 170, in index_object
  Module plone.indexer.wrapper, line 65, in __getattr__
  Module plone.indexer.delegate, line 20, in __call__
  Module plone.app.dexterity.textindexer.indexer, line 85, in dynamic_searchable_text_indexer
AssertionError: expected converted value of IDexterityTextIndexFieldConverter to be a str

Sample Code:

from plone.app.dexterity import textindexer
from plone.namedfile import field as namedfile
from plone.supermodel import model

class IExampleSchema(model.Schema):
    test_file = namedfile.NamedBlobField(title="searchable file")
    textindexer.searchable("test_file")

and enabled behavior in exampe_type.xml

<property name="behaviors" purge="false">
    <element value="plone.textindexer" />
</property>

I've fixed that here: Fix for searchable Named(Blob)File by petschki · Pull Request #362 · plone/plone.app.dexterity · GitHub

so the sample code above works now ...

Thank you Peter! Very impressed that you were able to find the problem and fix this so quickly!

Unfortunately, the problem of not-indexing pdf files still exists. Any idea how to perform Mr.Tango's fixing pdf_to_text transform on a Plone 6 Classic installation (that doesn't have a /bin/instance)? Note, I installed the Plone 6 Classic site using the instructions here.

cd backend
./bin/zconsole run instance/etc/zope.conf

Okay, I may be missing something (probably am) but can you tell me which line(s) of your Makefile enable pdf indexing?