Hide content items from internal and/or external search

fredvd · June 4, 2020, 3:30pm

I've got a use case in a project where the site managers want to have Files published in Plone, but not findable through any means, not through external search engines like google, but also not using the internal search page. (linking to individual PDF's though QR codes, but the info is only relevant for that product having the QR code on the packaging)

That is not trivial in Plone. You can hide the folder listing in various ways so that the individual Files cannot be traversed to publicly. What I had forgotten though is that we almost always have the sitemap.xml(.gz) activated. And the sitemap still happily includes the individual files for Google, bummer.

Fortunately, robots.txt is applied on top of the sitesmap.xml.gz. So we excluded the offending folders by adding them to the robots.txt. Google Search console now says that the items are found, but not indexed. (jay), but Files last indexed march 2020 were still on display.

The bazooka solution: exclude the subfolders with a 'block' from within the search console, but this only lasts for 6 months.

Now the internal search page. Luckily I found this thread on community from 4 years ago with the workaround tric (Hide folder from search): set all files/items you want to hide with an expire date set in the past. Ok. done. The only other option was to hide all Files from the search results, but there is a lot of information in PDF's we WANT to be found. Just not items from 2-3 folders.

So, mission accomplished, but I'm wondering if we need some more functionality for this in core or through an add'on. The expired trick feels like a work around since the content isn't expired and how do I remove items from the sitemap.xml.gz? When I started this quest I first had the idle hope that ticking 'exclude from navigation' would remove items from the sitemap as it's also a means of navigation. That could be labelled as a bug.

But an 'exclude from search and sitemap' boolean on the settings tab of content items through a behaviors seem a good solution.

Other input, use cases for this?

mtrebron · June 4, 2020, 7:22pm

Also note that Google will get URLs from Chrome users who navigate to the pages, and will attempt to start crawling from there.

Next to hinting that those documents should not be indexed via robots.txt, you can add X-Robots-Tag: noindex, noarchive headers to the documents as they are served. See Robots Meta Tags Specifications | Google Search Central | Documentation | Google for Developers for the full documentation.

Once the pages include a noindex directive, you can either manually remove them via the Search console or wait until the next time the pages will be crawled.

Additionally, in your front end server, you could prevent links to these pages being followed by requiring the HTTP_REFERER header passed by the browser to be empty. e.g. in Apache something like (untested)

RewriteCond %{HTTP_REFERER} ^$  
RewriteRule \.pdf$ - [F]

Personally, besides exclusion from internal search, I would not think of solving such a problem within Plone.

jaroel · June 4, 2020, 8:02pm

If you can find a way to not index the specific objects, then they will also not show up in the sitemap.xml thing [1].

[A] I guess def reindexObject is still in use, so you could use custom class for the contenttype (set it in the FTI as klass) and override reindexObject something like this:

class MyFile([insert current class here]):
    def reindexObject(self, idxs=[]):
        if getattr(self, 'please_dont_index', False):
            return
        super().reindexObject(idxs)

Then make please_dont_index available as an attribute (using a behavior), and remove the existing entries from the catalogue.

[B] Alternatively, if you have a small fixed set of folders, you could let everything be indexed as normal. Then in a ObjectModified and ObjectAdded handler, check if the object is in said folders and remove the indexed data in the same commit. This is wasteful-ish, but easy.

[C] If the files not be public, you could introduce a "non-public" state which disallows public access. Downloading the file can be done using a browserview that ignores permissions. Note that the links probably end up in Google anyhow if you're using Chrome (not validated, but not improbable).

Have fun!

[1] https://github.com/plone/plone.app.layout/blob/dc5bf988ffbc7309a99d4e10e90911cf4777d576/plone/app/layout/sitemap/sitemap.py#L81

yurj · June 5, 2020, 6:33am

I think using a marker interface here is the way to go. Then use the marker interface to don't have the folder in indexing/sitemap.

A better solution is to have te folder and subitems private and a browser view that just let you see the content. The url would be:

mysite.com/@@viewforall/path/in/the/folder/

This addon seems to do the reverse (block everything to Anonymous) but can be a way to allow everything adding a Role to the user in the traversing:

https://github.com/collective/iw.rejectanonymous/blob/f9d15cc7216a8b3ee08617a50830bf9e46951315/iw/rejectanonymous/init.py

You can also use content action to perform some automatic task on the folder when something is added.

sauzher · July 15, 2024, 8:25am

Hi,
giving the choice to index or not the object in the internal search engine, with a behavior, is great.

Is it possibile to extend this usecase even to external search engines?

This could be easily achieved for normal views, registering a viewlet for that behavior to add a <meta name="robots" content="noindex"> into HTML header slot.

But this approach works only for HTML responses: jsons, files, images and all directly streamed content have no section and so we must populate the response header instead.
I think injecting the Robots-Tag: noindex, noarchive header into the response.

Is there an "adapting traverse" way, based on the interface implemented by that behavior on the context, to do this?

alessandro.

jaroel · July 15, 2024, 12:26pm

Would GitHub - collective/collective.behavior.seo: SEO fields as a behavior for Plone 5.1+ work for you?

davisagli · July 24, 2024, 4:26am

I recently added this feature to kitconcept.seo using an after-traversal adapter: Add X-Robots-Tag header to response for items marked as no_index by sneridagh · Pull Request #17 · kitconcept/kitconcept.seo · GitHub

sauzher · July 24, 2024, 8:51am

@jaroel , @davisagli Great, thanks!