after testing the classic plone online demo, I decided to try to setup a project with it.
In the process I ran into some problems I could only solve partially. I would be really happy if someone could help me out or point me in the right direction.
What I did:
installed a linux ubuntu server 24.04.2
chose the classic ui
followed the "Install Plone with Buildout" instructions
So far everything went well. But after creating a plone site and uploading pdf files their content was not getting indexed for SearchableText.
It obviously adds complexity and it might be an overkill just for your purpose of indexing PDF and office files. But it supports indexing of a lot of data types. It does use Apache Tika under the hood.
I would use a custom indexer. i use docx2txt in a custom transform. but it can be adapted for indexing in a similar way.
from io import BytesIO
from Products.PortalTransforms.interfaces import ITransform
from zope.interface import implementer
import docx2txt
@implementer(ITransform)
class docx_to_text:
__name__ = "docx_to_text"
inputs = (
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
)
output = "text/plain"
def __init__(self, name=None, **kwargs):
if name is not None:
self.__name__ = name
def name(self):
return self.__name__
def convert(self, orig, idata, **kwargs):
out = []
text = docx2txt.process(BytesIO(orig))
out.extend(self.clean_data(data=text))
idata.setData(" ".join(out))
return idata
def clean_data(self, data=None):
out = []
if not data:
return out
out.extend(data.replace("\n", ",").split(","))
out = list(set(out))
out = list(map(lambda item: item.strip('"\t '), out))
return out
def register():
return docx_to_text()
Not really a complete solution, but I've used LibreOffice on Linux to convert Office documents to multiple formats, including plain text and exporting images from a .docx and generate thumbnails from them for those users who embed photos in Word docs and email it, for some bizarre reason. Anyway, if you can export the text, then you can index it.