Updating markdown documents with new text from Word

Sorry if this is off topic, but I am getting tired of manually updating documents with new text from Word.

I have a Website (marfag.no) which have 'digital books' for maritime schools.

For this, I (usually) convert the text from Word-files and style them / fix typos etc with markdown.

Unfortunately, the authors often do their proof reading AFTER everything is finished, so I have to compare the two Word files and manually copy / paste all the changes (or redo all the 'styling' etc.

Would it be possible to do this another way: Basically, I want 'something' that compares doc1.docx with doc2.docx and do the same changes in chapter1.markdown

This workflow does not make sense and is broken by design and does not work as you explained yourself. If DOCX is your primary source of truth, then ensure that you convert from DOCX to Markdown without touching the generated Markdown. All changes most be done in Word and all authors must work on the most current DOCX version and comply with formatting guidelines, style guides etc. Every other approach is kaputt. Numerous DOCX diff tools exist - from free to commercial, from useless to helpful. Again, you don't want and you should not apply any changes manually on the target format.

1 Like

The problem is that I add admonition and other things that are not easy to do in word (there is also a make pdf function so I have to change things a little to make things look good).

Since it is possible to 'tell a human how to do it', one day an AI will be able to do it.

Until then, I am thinking about something like this:

a transform (or similar) that removes all markdown codes (for both converted files), leaving only 'plain text'.
Compare the two 'plain text documents'. Find lines that are different. Replace the lines that are different in the 'original markdown' (but keeping the 'mardown codes'). Maybe wrap them in a <span class="changed" > or something to make it easy to see what has been changed.

Or maybe it is possible to remove 'all markdown codes' from both documents, then compare them and make a list of changes and have 'something automatic' search / replace those lines. (and keep a log of changes that were not done)

Did you try using AI? I think it is not problematic to do it. Use few pages at the begin. It took me 10 minutes of chat to explain AI to change a map page/keywords from:
1,b:c:
2,c:d
3,a:b
4,d,e
to a map keyword/page
a,3
b,1,3
c,2
d,2,4
e,4

No, I though it would be 'too much text'

An approach like you described here doesn't seem very difficult to implement in general and might cover a big chunk of your use cases. It gets more difficult in case you have to edit documents that do not conform to the "normal" structure, and whenever your editors start changing the way they edit the documents in the future (without telling you of course). -- Still I agreee with Andreas that your basic workflow is broken.

What you describe reminds me of a chaotic way to edit many technical datasheets using Word, that eventually became a Plone-based system to update thousands of separate documents. My suggestion would be to rethink the way your books are being generated, since layout (including "admonitions" or "export to PDF links") is just the sauce that you add at the end of your production process.

AI might be the way to go "soon" but for now I judge it as an excellent way to break your already chaotic system because it moves any logic from your update process to the AI black box. And this black box tends to hallucinate nonsense a lot in my experience (even the newer commercial models). Maybe you could train your own custom LLM and get better results?

Maybe you could get some temporary relief from changing this

Stop accepting junk from your editors, force them to review their changes before publishing...

Unfortunately, my editors are teachers :grinning:

And? Is your role the one of a solution provider or are you an internal editor? If yes, deal with it.

All medical guidelines on our portal onkopedia.com are written in DOCX by professors.
Clearly a question of education and enforcement of rules.
If you can't say NO to your teachers, deal with it or refuse to accept improper content.

Unfortunately, those look like they were made in th 1980, my site requires a completely different level of design.

My teachers have usually written the book before they contact me, so it is a bit too late to tell them how to do things.

Anyway: this is the situation I am in, I am currently doing the changes manually, but it would be nice to build something that does it faster for the next time.

The authors will use the tools with which they are most familiar, and I assume that that will never change. They won't switch to Markdown or MyST. If that is true, then you must edit the Word files in their source format to avoid the tedium.

I would insert replaceable slugs in the Word doc, then when exported or saved to another format, then replace the slug with the desired content.

I can't tell if you want to include your admonitions in the PDF, or what is the ultimate published format for consumption.

Anyway, there are tools that allow conversion of documentation on the command line, including Sphinx, pandoc, and libreoffice or soffice depending on your operating system.

For one of my clients, they receive Word documents with embedded images, and they wanted those images to be displayed inline on the web. I wrote a script that, when such a file is uploaded to the website, it calls soffice to save it as HTML with each image as a separate file. It then loops over all the images and generates a thumbnail for each, then puts it all into storage where it can be displayed.

Off topic, but for documentation:

Since this book (changes) were almost entirely typos, I did the following:

I seach/replaced/AppleScript-ed the changes (In QuarkXPress (program) since I am (more) familiar with doing 'this like that there').

After that, I was left with 'all paragraphs that had been changed marked with yellow background'.

After that, I (manually) replaced all the changed paragraphs (but it still took more than a day).

I assume a script that search for 'what is probably the place to replace the text' and looped through each paragraph of 'my marked text' I assume it would find 'almost everything' (at least if I inserted line breaks after each '.' before comparing them.

A bit off topic: In my case, I do a lot of 'design' and the result can also be exported to pdf and ebook as well as beeing 'readable' for PCs, tablets and phones, so I add for example stacktable if tables are 'wide' and 'if clauses' ( for example, I can use

 !!!pdf "something"

or

  {.hidein-pdf}  
  {.pdf-600}

to only show /show differently it in pdfs, and similar for mobile PC.

Mostly: it is extremely boring to do these changes, so 'progamming it' would be 'more fun'.

The way an output layout looks is irrelevant. It just depends on the standards that you define for your end product. The important part is that your input data is predictable, so if your teachers can't spell or don't have access to a spellchecker, you would add that part in your pipeline before even getting to the layout part.

PS: Between everyone in this thread, including yourself, there’s close to 100 years of hardcore development experience. You have received some well thought out alternatives on how to solve your problem. In my view, the responses deserve more consideration rather than being dismissed so quickly.

I think it should be possible to do this.

If the line numbers are exactly the same, something like this should work, I think.
Now, I just need to figure out how to do the same thing if/when they dont (do it 'proper diff', not just a loop)

[ UPDATE 10 oct: I have gotten a lot further, updating the gist]

… with grep search replace, that could work