In fact, it is not the problem. It looks like there are some control characters ( \t \n or similar that messes up some of the pages. I am not exactly sure which yet). There are so much hard-coded html (thank you Microsoft). And so many different encoding. Maybe it is better to do some kind of lxm clean. I will try that and report back ( loop, read body text, = lxml.html.clean.clean_html(bodytext); save bodytext.
Will report back
I got rid of some, by using replace \t\n before import but seems like there are more.