Plone search returns results based on HTML tags

On my site, when the user searches for "style guide", Plone returns a bunch of results, none of which has the word style in it until page 3, where the actual style guide is finally returned.

When I looked into the top few results to try to figure out what was going on, I realized that the pages that were at the top had a lot of html tags with style= something, and I think that is why those pages are being returned!

Is there any way for search to bypass html tag contents and just search for the actual content of the pages? Any help is much appreciated, thanks

1 Like

That’s surprising. I thought the full text index looked only at the content of a Document, so it shouldn’t include CSS or JS. Are you doing something custom with your Document objects?

1 Like

Thanks for your reply. No I'm not doing anything special at all. I know it's reading the <span style="xyz"> tags because when I remove the tags from the documents individually, they move off the search results page, and nothing else has changed. Any additional insight welcome!

Default behavior for Plone's built-in types has been (for as long as I can remember) been running through an HTML to plain text transform in portal_transforms — this is true for both Plone 6.x, Plone 5.2 if using collective.dexteritytextindexer, and for all types defined in plone.app.contenttypes in Plone 5.x. The SearchableText accessor for content in earlier versions of Plone likely worked similarly.

If you are seeing richtext/HTML indexed, something is not going according to the standard script, as it were.

There is probably more information needed here:

  1. What version of Plone are you using?
  2. Is this stock content indexed or built-in types?
  3. If Plone < 6.x, is this content based on Dexterity or Archetypes?
  4. You should have an html_to_text transform inside /portal_transforms — do you see that?

There may be other questions to resolve based on the answers to the above.

1 Like

You can check what actually is stored in the index in the ZMI, inside portal_catalog and then Catalog tab.

In a typical development environment the URL is http://127.0.0.1:8080/Plone/portal_catalog/manage_catalogView .

You can then search and click on a content item and check the results.

This is in Plone 5.x.

Thanks for your reply.

  1. Plone 5.2.2
  2. Some of the content are built-in types some are custom dexterity content types... is this what you're asking?
  3. dexterity, i think... i'm not sure what archetypes are
  4. yes, I do see that

Cool, this is interesting, except that it seems searching does nothing, whenever I search anything (under path filter) it says "no objects in catalog" so I just have to look at the main list and keep hitting "next 20" to see more. Even when I put a valid path in path filter...I don't really understand why

Keep in mind that the path stored in the catalog and available for filter is:

  1. absolute
  2. starts at the Zope, not site root

Which means if the id of your Plone site is mysite, and you have a folder full of stuff at top-level-folder/another-folder-here then you need to put a path like /mysite/top-level-folder/another-folder-here into the input and click Set path filter button.

1 Like

Thanks! That worked. I can see that "style" is one of the words in the searchable text in a document that does not contain the word style (other than in the html. Any ideas on how I fix this? If it helps, it's a custom dexterity type, based on the page type... maybe that's why it's indexed incorrectly like this...

Edit: I just looked at a page that is of the Page content type and it also contains html tags in its list of SearchableText, things like span ul li text align left etc. etc. How come it's doing this?

Wild guess: you have a broken transform in your site. I'm unclear how to figue out exact cause of your precise issue, but maybe an attempt (in a development copy of your site, and I have not tested if this can work) to:

  • Add a stock Plone site elsewhere in your Zope root
  • remove portal_transforms in your site root via ZMI
  • Use copy and paste in ZMI from the stock Plone site to copy portal_transforms from the other stock/new Plone site into your current site.

This might break things; this might be benign but unhelpful, this might help.

Just to add a data point. I have a Plone 5.2.x site and I could not reproduce the bug.

I added a Page with <span style="color: red">Lorem ipsum</span> , and the indexed content does not contain the word style.

Does it have the word style in the title or in any other field? SearchableText can index other fields too.

No

Could you please share the HTML code of one the pages, along with the value of the SearchableText index in the portal_catalog ?

I would reindex the object to see if the "style" word is there again. If you've few objects, you can do a clear and rebuild of the catalog.

<div id="content-core">
<div class=" kssattr-atfieldname-text kssattr-templateId-widgets/rich kssattr-macro-rich-field-view" id="parent-fieldname-text">
<p>Ordered priorities for service:</p>
<ul type="disc">
<li style="list-style-type: none;">
<ol start="1" type="1">
<li>Patrons who are in the department or library are served on a first come, first served basis.</li>
<li>Telephone inquiries are handled based on guidelines for<span> </span><a class="internal-link" href="http://192.168.222.217:8080/olli/manuals/reference-services/reference-service-guidelines/telephone-reference-service-guidelines" title="Telephone Reference Service Guidelines">Telephone Reference Service Guidelines</a>.</li>
<li>Inquiries referred from other branches are completed within twenty-four hours.</li>
<li>Reference inquiries that require extensive research are prioritized based on requestor and staff time available.</li>
<li>Library orientation and bibliographic instruction for groups.</li>
</ol>
</li>
<li>Limitations on service must be placed because of staff time available and research time available for each person or question.  The number of questions per patron and the amount of time available for telephone inquiries may be limited.  Reference librarians will normally be available to spend 5-10 minutes working with an individual patron.</li>
</ul>
</div>
</div>

Searchable text =

['priorities', 'of', 'service', 'div', 'id', 'content', 'core', 'div', 'class', 'kssattr', 'atfieldname', 'text', 'kssattr', 'templateid', 'widgets', 'rich', 'kssattr', 'macro', 'rich', 'field', 'view', 'id', 'parent', 'fieldname', 'text', 'p', 'ordered', 'priorities', 'for', 'service', 'p', 'ul', 'type', 'disc', 'li', 'style', 'list', 'style', 'type', 'none', 'ol', 'start', '1', 'type', '1', 'li', 'patrons', 'who', 'are', 'in', 'the', 'department', 'or', 'library', 'are', 'served', 'on', 'a', 'first', 'come', 'first', 'served', 'basis', 'li', 'li', 'telephone', 'inquiries', 'are', 'handled', 'based', 'on', 'guidelines', 'for', 'span', 'span', 'a', 'class', 'internal', 'link', 'href', 'http', '192', '168', '222', '217', '8080', 'olli', 'manuals', 'reference', 'services', 'reference', 'service', 'guidelines', 'telephone', 'reference', 'service', 'guidelines', 'title', 'telephone', 'reference', 'service', 'guidelines', 'telephone', 'reference', 'service', 'guidelines', 'a', 'li', 'li', 'inquiries', 'referred', 'from', 'other', 'branches', 'are', 'completed', 'within', 'twenty', 'four', 'hours', 'li', 'li', 'reference', 'inquiries', 'that', 'require', 'extensive', 'research', 'are', 'prioritized', 'based', 'on', 'requestor', 'and', 'staff', 'time', 'available', 'li', 'li', 'library', 'orientation', 'and', 'bibliographic', 'instruction', 'for', 'groups', 'li', 'ol', 'li', 'li', 'limitations', 'on', 'service', 'must', 'be', 'placed', 'because', 'of', 'staff', 'time', 'available', 'and', 'research', 'time', 'available', 'for', 'each', 'person', 'or', 'question', 'the', 'number', 'of', 'questions', 'per', 'patron', 'and', 'the', 'amount', 'of', 'time', 'available', 'for', 'telephone', 'inquiries', 'may', 'be', 'limited', 'reference', 'librarians', 'will', 'normally', 'be', 'available', 'to', 'spend', '5', '10', 'minutes', 'working', 'with', 'an', 'individual', 'patron', 'li', 'ul', 'div', 'div', 'priorities', 'of', 'service']

thanks. how do I do that?

Hi!

seems you're not using the html2text transform. Check the configs in portal_transforms. Also you need the lxml package installed but should already installed.

About clear and rebuild, go in the ZMI -> portal_catalog, there's the Advanced tab with a Clear and Rebuild button

Thanks. I'm looking at portal_transforms right now, and there is html_to_text but I don't know where to go from here. How do I make sure it's using html_to_text?

Also, how do I check if the lxml package is installed? Thanks,

is the field mimetype text/html? Where the text above come from?