How to copy a Plone site to serve with nginx

tkimnguyen · December 31, 2016, 4:41pm

I've been poking around with wget to grab a static copy of the https://2015.ploneconf.org site to serve it with nginx.

Googling around I found http://www.linuxjournal.com/content/downloading-entire-web-site-wget and so the command I've used is

wget --recursive --no-clobber --page-requisites --html-extension --domains 2015.ploneconf.org https://2015.ploneconf.org

You can see the static copy at https://2015new.ploneconf.org/ (though any links on that page are not rewritten to use 2015new.ploneconf.org)

This does seem to grab everything correctly, but I've noticed that links to, say, the Things To Do, at https://2015.ploneconf.org/venue/#things-to-do are not correctly converted to use the .html, i.e.. you get an error 403 because the static URL should really be https://2015new.ploneconf.org/venue.html#things-to-do (and sure enough the venue.html page was grabbed by wget)

Plone is smart with its traversal... it just figures out that "venue#things-to-do" is not a directory but an anchor within a page.

Any suggestions on how to handle that sort of Plone URL with nginx?

tkimnguyen · December 31, 2016, 4:59pm

OK, trying as per https://linuxaria.com/pills/how-to-modify-an-url-extension-with-a-nginx-rewrite

# add .html to URI and serve file, directory, or symlink if it exists
if (-e $request_filename.html) {
  rewrite ^/(.*)$ /$1.html last;
  break;
}

tkimnguyen · December 31, 2016, 5:04pm

Holy cr*p I think it works now https://2015.ploneconf.org

tkimnguyen · January 1, 2017, 7:51pm

To identify broken links, I used this command:

wget --spider -o wget.log -e robots=off --wait 1 -r -p https://2015.ploneconf.org

It worked great... tracked down and fixed 40+ broken links!

gutow · January 5, 2017, 2:26pm

Take a look at https://github.com/jcu-eresearch/static-plone-wget for a quite generally applicable option for turning a plone site static. He's written a script that takes care of a lot of the oddities. I used it recently to do a static backup of a site including the private parts. Some of the recent changes are my fault because of issues I encountered.
Jonathan

pigeonflight · January 25, 2019, 6:59pm

I know I'm asking years later.
Out of curiosity why not httrack for this?

tkimnguyen · January 25, 2019, 7:16pm

Because I never thought of it? looking now...

zopyx · January 25, 2019, 7:38pm

httrack is the tool for grabing websites.

tkimnguyen · January 25, 2019, 7:41pm

Indeed! I don't know why I didn't think of it or run into it in my earlier searches. https://www.httrack.com/

I will shortly be static-ifying the 2016.ploneconf.org and 2017.ploneconf.org sites, so thx for your timely question, @pigeonflight!

djowett · January 28, 2019, 10:26pm

Maybe this would have been fun too?

erral · January 29, 2019, 6:47am

We are using this in a Plone 4.3.x site with a client and it works.