Issues with Links while Migrating with collective.exportimport

cihanandac · May 27, 2024, 8:47am

Hi Plone Community,

I am currently facing some challenges with the collective.exportimport while migrating content from classic to Volto. Specifically, I'm encountering issues with the links; links on the imported content are not working correctly. The link pointing to a "resolveuid" and it is not leading anywhere.

Example link: href="/resolveuid/4fdd8ea41ca54a969eb21c4ef521ef28"

I have found that there is a feature called 'Fix links to content and images in richtext', but it is giving the following error.

Traceback (innermost last):
  Module ZPublisher.WSGIPublisher, line 181, in transaction_pubevents
  Module ZPublisher.WSGIPublisher, line 390, in publish_module
  Module ZPublisher.WSGIPublisher, line 285, in publish
  Module ZPublisher.mapply, line 85, in mapply
  Module ZPublisher.WSGIPublisher, line 68, in call_object
  Module collective.exportimport.fix_html, line 50, in __call__
  Module collective.exportimport.fix_html, line 315, in fix_html_in_content_fields
  Module plone.dexterity.utils, line 53, in iterSchemataForType
  Module plone.dexterity.schema, line 108, in decorator
  Module plone.dexterity.schema, line 156, in get
  Module plone.dexterity.fti, line 246, in lookupSchema
  Module plone.alterego.dynamic, line 26, in __getattr__
  Module plone.dexterity.synchronize, line 10, in synchronized_function
  Module plone.dexterity.schema, line 407, in __call__
  Module plone.dexterity.fti, line 257, in lookupModel
  Module plone.dexterity.fti, line 246, in lookupSchema
  Module plone.alterego.dynamic, line 26, in __getattr__
Module plone.dexterity.synchronize, line 10, in synchronized_function
Module plone.dexterity.schema, line 407, in __call__
Module plone.dexterity.fti, line 257, in lookupModel
Module plone.dexterity.fti, line 246, in lookupSchema
Module plone.alterego.dynamic, line 26, in __getattr__
Module plone.dexterity.synchronize, line 10, in synchronized_function
Module plone.dexterity.schema, line 407, in __call__
Module plone.dexterity.fti, line 257, in lookupModel
Module plone.dexterity.fti, line 246, in lookupSchema
Module plone.alterego.dynamic, line 26, in __getattr__
Module plone.dexterity.synchronize, line 10, in synchronized_function
Module plone.dexterity.schema, line 407, in __call__
Module plone.dexterity.fti, line 257, in lookupModel
RecursionError: maximum recursion depth exceeded while calling a Python object

Additionally, I am not exporting/importing some of the pages but creating them manually. The url's are the same with the old website but I am concerned about whether the links to these manually created pages in the imported content will work correctly.

I would greatly appreciate any guidance or assistance you can provide to help resolve these issues. Has anyone encountered similar problems with the 'Fix links to content and images in richtext' feature or links in general during content migration? Any suggestions or workarounds would be immensely helpful.

Thank you in advance for your support.

espenmn · May 27, 2024, 10:34am

When creating content in Plone, each 'content item' gets an Unique ID. Because of this, the link to content will not be broken when your move content or rename (its url).

So, the 'resolveuid' url will 'find the content item that has id 4fd…28" and link to that.
If you create the content manually, it gets a new UID, and the link will not work anymore (since that uid does not exsist anymore'.

I am not sure what is the best approach if you want to create pages manually, but maybe you could make a script that looks up the url of each from the content / catalog and replace them.

Alternatively something like: How to retain UUID when migrating content type?

You could also try googling / this forum 'migrate uid'

cihanandac · May 27, 2024, 11:50am

Thank you @espenmn for you answer.

As you said, for the manually created content, it would be hard to match it with the old UIDs, but I was wondering if there is any way to change this behavior so that it gives URLs as the href for the links.

I will check the link that you shared and also google the "migrate uid".

yurj · May 27, 2024, 2:10pm

You've to use the fix on the site before migrating, and then export the content as is. On the new site, when the content will be available, you can edit and save pages, this should enable portal transform to translate the url to resolveuid/<uid>.

espenmn · May 27, 2024, 4:44pm

Not 100% sure about this, but if you have the old site 'running', you could run a script from your 'new site', where you take the url (of the new site) and check 'which UUID this corresponds to on the old site ( maybe *plone.api.content.get(path and then plone.api.content.get_uuid and just change the uuid).

Some more brainstorming: export a list of paths and UIDs of the old site and paths and UID of the new site and search for 'links and replace entry OLD123 with NEW123 ?

cihanandac · May 27, 2024, 9:22pm

I ended up customizing the collective.exportimport to swap UIDs with URLs for the links. I take this idea from the "fix_html.py" and I am sharing my solution in the bottom of this message for the people who might need this in the future.

Thanks for @espenmn and @yurj for your sharing your knowledge and ideas.

Customized file: collective.exportimport/src/collective/exportimport/export_content.py at main · cihanandac/collective.exportimport · GitHub

    def fix_links(self, item, obj):
        """Fix links in rich text fields to convert resolveuid to normal URLs."""
        for field_name, field_value in item.items():
            if isinstance(field_value, dict) and field_value.get(
                "content-type", ""
            ).startswith("text/html"):
                html = field_value.get("data", "")
                if html:
                    soup = BeautifulSoup(html, "html.parser")
                    old_portal_url = api.portal.get().absolute_url()

                    for tag, attr in [
                        (tag, attr)
                        for attr, tags in [
                            ("href", ["a"]),
                        ]
                        for tag in tags
                    ]:
                        self.fix_link_attr(soup, tag, attr, old_portal_url, obj=obj)

                    # Update the field with fixed links
                    field_value["data"] = str(soup)
                    logger.debug("Updated HTML content for field: %s", field_name)
                else:
                    logger.debug("No HTML content found for field: %s", field_name)
            else:
                logger.debug("Skipping non-HTML field: %s", field_name)
        return item

    def fix_link_attr(self, soup, tag, attr, old_portal_url, obj=None):
        """Fix the attribute of every matching tag passed within the soup."""
        logger.info(tag)
        logger.info(attr)
        for content_link in soup.find_all(tag):
            origlink = content_link.get(attr)
            if not origlink:
                # Ignore tags without attr
                continue
            orig = content_link.decode()  # to compare

            parsed_link = urlparse(origlink)
            if parsed_link.scheme in ["mailto", "file"]:
                continue

            if parsed_link.netloc and parsed_link.netloc not in old_portal_url:
                # skip external url
                print("old_portal_url")
                continue

            path = parsed_link.path
            if not path:
                # link to anchor only?
                continue

            # get uuid from link with resolveuid
            components = path.split("/")
            if "resolveuid" in components:
                print("in resolveduid")
                uid = components[components.index("resolveuid") + 1]
                target_obj = api.content.get(UID=uid)
                if target_obj:
                    new_href = target_obj.absolute_url()
                    if parsed_link.fragment:
                        new_href += "#" + parsed_link.fragment

                    new_href = new_href.split("/Plone", 1)[-1]
                    content_link[attr] = new_href
                    logger.debug(
                        "Changed %s %s from %s to %s", tag, attr, origlink, new_href
                    )
                print(new_href)
            else:
                logger.debug("No resolveuid found in %s", origlink)

        return soup