Relative links in index.html, migrated content

I am migrating a website from FrontPage2003.
On the old site, there are thousands of folder which contains an index.htm file.
When migrating to Plone, these become the default views of their folders (which is good).
Unfortunately, the relative links 'misses one folder', so if the link in the index.htm is
/path/to/current_folder/articles/somedocument
It will instead point to
/path/to/articles/somedocument.

Is there any other option than renaming the index.htm files ? (and not making them the default view)

Can't you fix the index.htm files with the right links?

I have 1000 index.html files and they link to 'a lot of subpages' (they have manually made these, 20 years ago).

Plan is to rename the folder names on import, too (they have used spaces and capital letters).

Just fix the imported data.
lxml is fast enough, so load the imported data and walk all the a tags to replace them with UUID-based links.
Then rename the folders and do a victory dance.

Maybe you can kill the index.html files and just use the default folder listing instead?

Did you considered using collective.folderishtypes and get rid of the default_page?

Not really, getting the text from html files into plone is already 'confusing'. I am importing them to Plone 5, but I might upgrade and then all types will be folderish anyway (?), so I dont want to 'use more add-ons' than neccesarry
( there are thousands of internal links (and links from other websites) to these index.html pages (but a rewrite rule could probably fix that, though)

Yes, I will go for that (rename all 18.000 (!) ), then probaly write a rewrite rule for index.htm(l) to index (or just 'drop' .html' and 'htm', not sure what is easiest.

  1. How do I do that?
  2. there will probably be broken links (with almost 20.000 pages and docs and images). Any smart way to deal with them at the same time (get a list maybe)?

(typically, the men that made the original site has used 'illegal characters' in some urls ( like æøå). I have tried to replace all with regexp but there will probably be some still

Could this be an idea for your task?

from pathlib import Path

from lxml import html

directory = '/path/to/dir'

for path in Path(directory).rglob('*.html'):
    print(f"{path}")
    with open(path, "r") as input_file:
        page: str = input_file.read()
    tree: html.HtmlElement = html.fromstring(page)

    for a in tree.xpath("//a"):
        print(f"    {a.attrib['href']}")
        # check if the href is valid/brocken
        # evtl. convert absolute to relative paths or viceversa
        # replace the a.attrib['href'] with your modified href
        # a.attrib['href'] = "your/modified/href"
    
    # write the modified tree to file
    # with open(path, "r") as output_file:
    #     output_file.write(html.tostring(tree))

The example above replaces the files in the file system. I'd consider to do it in the file system before you import it to Plone.

If you write to files run this in a copy of your data! Otherwise they will be overwritten.

This should work with the least amount of effort, assuming those index.html files are just the content listing.
You can use plone.api to grab all content object called 'index.html' and set the view on the parent.

Use a normalize function, hold on a sec while I look this up :wink:

(Just) For later reference: Looks like FrontPage2003/Windows used an encoding I had not even heard of, looks like most can fix with decoding with 'Windows-1252' and then encoding with utf-8:

   'Windows-1252').encode('utf-8')

Thanks.

Looking for how to do this with images too, I stubled upon

html.make_links_absolute(base_url)

Hopefully, this would make it possible to easy convert the links by using context.absolute_url or similar if I run it from within Plone. Will post back when I have tried.