Scraping / extracting data from html

espenmn · March 16, 2023, 5:47pm

What would be the fastest way (I have a lot of files) to get the "Off no" (in this case 1058787. I plan to do it from Plone (with a script) after I have added the html. Beautiful soup, or is it better to convert the html to plain text (somehow) and look for the word after Off.no: ?

Update: Cleaned up pasted code

 <td ="" valign="top" class="eeece1"> 
     <p> <span lang="EN-US" xml:lang="EN-US"> Off.no:</span></p>
 </td> 
 <td ="" valign="top" class="eeece1"> 
       <p> <span> 1058787</span></p>
</td> </tr>

zopyx · March 16, 2023, 6:07pm

Seriously, why do HTML parsers exist? For exactly such purposes.

tiberiuichim · March 16, 2023, 8:19pm

Only yesterday one of my colleagues had to do a similar task. Check his code if you need some inspiration: freshwater.content/measures.py at 8bdf687b5e40d772d434111157440ccab18bd887 · eea/freshwater.content · GitHub

espenmn · March 17, 2023, 9:51am

tiberiuichim

14h Only yesterday one of my colleagues had to do a similar task

PS: I cleaned up the code I pasted, not sure why the html 'doubled'.

What I am a bit confused about is 'picking' the right value. I would have to find a td that contains/is 'Off.no:' and then get the value of the next td (?).
I dont have any classes/id that are specific here (all the html have been produced 'manually' (in Frontpage<2003)

tiberiuichim · March 17, 2023, 9:52am

Yes, that's usually how it's done.