Plone 6 installation does not index pdf files

deankarlen · March 23, 2023, 8:44pm

I followed this documentation to install Plone 6.

Much later I noticed that pdf files were not being indexed. I find that the demo plone 6 site indexes pdf files correctly (by examing the portal_catalog entry for a pdf file that I uploaded, and seeing that the SearchableText included the text from the pdf file). In my installation, the SearchableText for a pdf file only includes the title and filename.

To try to fix this, I tried the following (ubuntu):

sudo apt install poppler-utils
sudo apt-get install libreadline-dev wv poppler-utils

Following these attempts with "make build" and "make start" did not solve the problem. Can someone give advice on how to get pdf file indexing to work?

petschki · March 24, 2023, 6:50am

go to /Plone/portal_transforms/manage_main in your Site and look if there's the pdf_to_text transform in the list. If not, you can try reloading your transforms with the "Reload Transforms" tab.

deankarlen · March 24, 2023, 2:47pm

Hi Peter,

Thank you very much for the suggestion. I tried that. On the console I got the message:

"aborting transaction due to no CSRF protection"

On the browser page, I clicked the "confirm action" button, and the response was:

Those Transformations have been reloaded : (a list that does not include pdf_to_text)

If you have any other advice, please let me know. Thanks again!

petschki · March 27, 2023, 3:12pm

The transform searches for the binary pdftotext Products.PortalTransforms/pdf_to_text.py at master · plone/Products.PortalTransforms · GitHub

This has to be executable as the user which the Plone instance is running.

deankarlen · March 27, 2023, 3:38pm

pdftotext (located in /usr/bin) has access mode: -rwxr-xr-x

petschki · March 28, 2023, 5:39am

OK ... then I would check this for loop when starting your application, what happens when registering pdf_to_text Products.PortalTransforms/__init__.py at master · plone/Products.PortalTransforms · GitHub

You can also set your log-level to DEBUG to see, if the TransformEngine has some log regarding registering pdf_to_text

yurj · March 28, 2023, 8:34am

petschki · March 28, 2023, 9:06am

OMG ... since when is that? That's an issue which should be reported I'd say ...

deankarlen · March 28, 2023, 1:42pm

Thank you Yuri and Peter. The workaround documented by Mr. Tango worked.

deankarlen · March 28, 2023, 2:49pm

Next problem...

While pdf files in folders are now correctly indexed, files in custom dexterity types (using NamedBlobFile) are not. There is some documentation here, but that and the source code, are very hard to decipher. My attempts to mark the NamedBlobFile to be indexed have failed.

petschki · March 29, 2023, 9:14am

Default File content has its own SearchableText indexer defined in plone.app.contenttypes.

I just tried the textindexer as documented and get the following traceback for NamedBlobFile field:

Traceback (innermost last):
  Module ZPublisher.WSGIPublisher, line 187, in transaction_pubevents
  Module transaction._manager, line 257, in commit
  Module transaction._manager, line 134, in commit
  Module transaction._transaction, line 267, in commit
  Module transaction._transaction, line 333, in _callBeforeCommitHooks
  Module transaction._transaction, line 372, in _call_hooks
  Module Products.CMFCore.indexing, line 317, in before_commit
  Module Products.CMFCore.indexing, line 227, in process
  Module Products.CMFCore.indexing, line 49, in reindex
  Module Products.CMFCore.CatalogTool, line 368, in _reindexObject
  Module Products.CMFPlone.CatalogTool, line 324, in catalog_object
  Module Products.ZCatalog.ZCatalog, line 495, in catalog_object
  Module Products.ZCatalog.Catalog, line 362, in catalogObject
  Module Products.ZCTextIndex.ZCTextIndex, line 170, in index_object
  Module plone.indexer.wrapper, line 65, in __getattr__
  Module plone.indexer.delegate, line 20, in __call__
  Module plone.app.dexterity.textindexer.indexer, line 85, in dynamic_searchable_text_indexer
AssertionError: expected converted value of IDexterityTextIndexFieldConverter to be a str

Sample Code:

from plone.app.dexterity import textindexer
from plone.namedfile import field as namedfile
from plone.supermodel import model

class IExampleSchema(model.Schema):
    test_file = namedfile.NamedBlobField(title="searchable file")
    textindexer.searchable("test_file")

and enabled behavior in exampe_type.xml

<property name="behaviors" purge="false">
    <element value="plone.textindexer" />
</property>

petschki · March 29, 2023, 10:50am

I've fixed that here: Fix for searchable Named(Blob)File by petschki · Pull Request #362 · plone/plone.app.dexterity · GitHub

so the sample code above works now ...

deankarlen · March 29, 2023, 1:39pm

Thank you Peter! Very impressed that you were able to find the problem and fix this so quickly!

ronc · November 1, 2024, 3:44pm

Unfortunately, the problem of not-indexing pdf files still exists. Any idea how to perform Mr.Tango's fixing pdf_to_text transform on a Plone 6 Classic installation (that doesn't have a /bin/instance)? Note, I installed the Plone 6 Classic site using the instructions here.

yurj · November 1, 2024, 6:53pm

github.com

collective/cookiecutter-plone-starter/blob/main/{{ cookiecutter.project_slug }}/backend/Makefile

### Defensive settings for make:
#     https://tech.davis-hansson.com/p/make/
SHELL:=bash
.ONESHELL:
.SHELLFLAGS:=-xeu -o pipefail -O inherit_errexit -c
.SILENT:
.DELETE_ON_ERROR:
MAKEFLAGS+=--warn-undefined-variables
MAKEFLAGS+=--no-builtin-rules

# We like colors
# From: https://coderwall.com/p/izxssa/colored-makefile-for-golang-projects
RED=`tput setaf 1`
GREEN=`tput setaf 2`
RESET=`tput sgr0`
YELLOW=`tput setaf 3`

BACKEND_FOLDER=$(shell dirname $(realpath $(firstword $(MAKEFILE_LIST))))

PLONE_VERSION=$$(cat version.txt)

This file has been truncated. show original

cd backend
./bin/zconsole run instance/etc/zope.conf

ronc · November 4, 2024, 3:01pm

Okay, I may be missing something (probably am) but can you tell me which line(s) of your Makefile enable pdf indexing?