I've got a use case in a project where the site managers want to have Files published in Plone, but not findable through any means, not through external search engines like google, but also not using the internal search page. (linking to individual PDF's though QR codes, but the info is only relevant for that product having the QR code on the packaging)
That is not trivial in Plone. You can hide the folder listing in various ways so that the individual Files cannot be traversed to publicly. What I had forgotten though is that we almost always have the sitemap.xml(.gz) activated. And the sitemap still happily includes the individual files for Google, bummer.
Fortunately, robots.txt is applied on top of the sitesmap.xml.gz. So we excluded the offending folders by adding them to the robots.txt. Google Search console now says that the items are found, but not indexed. (jay), but Files last indexed march 2020 were still on display.
The bazooka solution: exclude the subfolders with a 'block' from within the search console, but this only lasts for 6 months.
Now the internal search page. Luckily I found this thread on community from 4 years ago with the workaround tric (Hide folder from search): set all files/items you want to hide with an expire date set in the past. Ok. done. The only other option was to hide all Files from the search results, but there is a lot of information in PDF's we WANT to be found. Just not items from 2-3 folders.
So, mission accomplished, but I'm wondering if we need some more functionality for this in core or through an add'on. The expired trick feels like a work around since the content isn't expired and how do I remove items from the sitemap.xml.gz? When I started this quest I first had the idle hope that ticking 'exclude from navigation' would remove items from the sitemap as it's also a means of navigation. That could be labelled as a bug.
But an 'exclude from search and sitemap' boolean on the settings tab of content items through a behaviors seem a good solution.
Once the pages include a noindex directive, you can either manually remove them via the Search console or wait until the next time the pages will be crawled.
Additionally, in your front end server, you could prevent links to these pages being followed by requiring the HTTP_REFERER header passed by the browser to be empty. e.g. in Apache something like (untested)
If you can find a way to not index the specific objects, then they will also not show up in the sitemap.xml thing [1].
[A] I guess def reindexObject is still in use, so you could use custom class for the contenttype (set it in the FTI as klass) and override reindexObject something like this:
class MyFile([insert current class here]):
def reindexObject(self, idxs=[]):
if getattr(self, 'please_dont_index', False):
return
super().reindexObject(idxs)
Then make please_dont_index available as an attribute (using a behavior), and remove the existing entries from the catalogue.
[B] Alternatively, if you have a small fixed set of folders, you could let everything be indexed as normal. Then in a ObjectModified and ObjectAdded handler, check if the object is in said folders and remove the indexed data in the same commit. This is wasteful-ish, but easy.
[C] If the files not be public, you could introduce a "non-public" state which disallows public access. Downloading the file can be done using a browserview that ignores permissions. Note that the links probably end up in Google anyhow if you're using Chrome (not validated, but not improbable).
Hi,
giving the choice to index or not the object in the internal search engine, with a behavior, is great.
Is it possibile to extend this usecase even to external search engines?
This could be easily achieved for normal views, registering a viewlet for that behavior to add a <meta name="robots" content="noindex"> into HTML header slot.
But this approach works only for HTML responses: jsons, files, images and all directly streamed content have no section and so we must populate the response header instead.
I think injecting the Robots-Tag: noindex, noarchive header into the response.
Is there an "adapting traverse" way, based on the interface implemented by that behavior on the context, to do this?