Prevent Beautifulsoup's Rendercontents() From Changing To Â
I'm using bs4 to do some work on some text, but in some cases it converts characters to Â. The best I can tell is that this is an encoding mismatch from UTF-8 to latin
Solution 1:
Ok, apparently I hadn't read deep enough in to the docs, here's where the answer can be found:
From https://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings:
The problem is that the snippet of code provided to BS is so short, that BeautifulSoup's sub-library Unicode, Dammit
, doesn't have enough info to properly guess the encoding.
Unicode, Dammit
guesses correctly most of the time, but sometimes it makes mistakes. ...you can avoid mistakes and delays by passing it to the BeautifulSoup constructor asfrom_encoding
.
So the key is to add from_encoding="UTF-8"
to each time the BS is constructed:
soup = BeautifulSoup(soup.renderContents(), "html.parser", from_encoding="UTF-8")
Post a Comment for "Prevent Beautifulsoup's Rendercontents() From Changing To Â"