Default Plone 5 robots.txt disallows all for Google. How to bulk fix to allow for all sites on same server?

merpdotcom · October 29, 2018, 11:53pm

I was recently quite concerned to see all our search listings disappear for ALL of our sites on our newish Plone 5 server. Not low rankings, completely disappear (generally #1 on Google and all other search engines for our key words). Though still #1 on the other search engines.
Was this intentional by the Plone developers to make this drastic change? Or did something else happen? We generally haven't touched the default robots.txt over all these years.
It appears that the robot.txt now defaults to disallow all for Google indexing? Is that intentional? Is there a way to change the default to be allow all (or whatever it was with Plone 4.x over the years)? Am I now going to have to manually edit the robots.txt on all of our sites? Or is there a way to do it en masse for about 40 sites simultaneously?
Thanks!

jaroel · October 30, 2018, 6:34am

The default is to allow everything: https://github.com/plone/Products.CMFPlone/blob/master/Products/CMFPlone/interfaces/controlpanel.py#L18

You can edit the value through the control panel at /@@site-controlpanel .
If that is not the robots.txt your see, then it not served by Plone. It might be a File or DTML document in the Zope or Plone root , or something like that.

yurj · October 30, 2018, 7:24am

You can configure your web server to serve whatever robots.txt you want, for example:

merpdotcom · October 30, 2018, 5:45pm

So what went awry with this installation that ALL of the sites (30+) are default disallow for robots.txt?
Editing one site through Plone, showed it was disallow (as it is for all the sites with this install for some odd reason), did show a change to allow, so it is Plone and not the web server front end. Is there a patch fix I can run to fix this in Plone so that all current and future sites will be default Allow? Or am I going to have to try the aliasing with the front end web server that Yuri suggested?

For example, I just created a new Plone site, http://132.148.245.43:8080/testrobots/ And the default robots.txt is Disallow.

Suggestions?

Plone 5.1.2.1 (5112)
CMF 2.2.12
Zope 2.13.27
Python 2.7.14 (default, Jun 26 2018, 10:14:38) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
PIL 5.1.0 (Pillow)

Thanks.

Rotonen · October 30, 2018, 6:09pm

I'm not parsing your issue.

http://132.148.245.43:8080/testrobots/robots.txt

Currently

Sitemap: http://132.148.245.43:8080/testrobots/sitemap.xml.gz

# Define access-restrictions for robots/spiders
# http://www.robotstxt.org/wc/norobots.html



# By default we allow robots to access all areas of our site
# already accessible to anonymous users

User-agent: *
Disallow:



# Add Googlebot-specific syntax extension to exclude forms
# that are repeated for each piece of content in the site
# the wildcard is only supported by Googlebot
# http://www.google.com/support/webmasters/bin/answer.py?answer=40367&ctx=sibling

User-Agent: Googlebot
Disallow: /*?
Disallow: /*atct_album_view$
Disallow: /*folder_factories$
Disallow: /*folder_summary_view$
Disallow: /*login_form$
Disallow: /*mail_password_form$
Disallow: /@@search
Disallow: /*search_rss$
Disallow: /*sendto_form$
Disallow: /*summary_view$
Disallow: /*thumbnail_view$
Disallow: /*view$

Everything should be allowed User-agent: * Disallow: and Googlebot is only disallowed from accessing very specific subviews, the search forms or anything with a query string Disallow: /*?.

That should work.

merpdotcom · October 30, 2018, 6:16pm

According to this person on Google forum, it is disallowing all Google:

"JoyHawkins
Gold Product Expert Gold Product Expert

6

JoyHawkins yesterday
I think I found the root of your problem: http://www.rpgtherapy.com/robots.txt

Google can't crawl your website if you disallow all crawling.

Joy Hawkins, Google My Business Product Expert
Owner of Sterling Sky and the Local Search Forum"

I manually edited http://www.rpgresearch.com/norobots.txt through Plone site control panel, to allow all, and now it is starting to show up again, while the other sites are not. But I probably have the syntax all wrong. in the former, since I generally never touch the robots.txt

I see that the older server 4.3 (which works) has a much simpler robots.txt than the Plone 5 default (which isn't working correctly for Google): http://www2.rpgresearch.com/robots.txt
Is it the "Disallow /*?" line that is killing GoogleBot in the Plone 5 setup?

Rotonen · October 30, 2018, 6:34pm

Unless you have something appending a query string to every page, or Google has a bug, it should not.