Skip to content Skip to sidebar Skip to footer

Prevent Beautifulsoup's Rendercontents() From Changing   To Â

I'm using bs4 to do some work on some text, but in some cases it converts   characters to Â. The best I can tell is that this is an encoding mismatch from UTF-8 to latin

Solution 1:

Ok, apparently I hadn't read deep enough in to the docs, here's where the answer can be found:

From https://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings:

The problem is that the snippet of code provided to BS is so short, that BeautifulSoup's sub-library Unicode, Dammit, doesn't have enough info to properly guess the encoding.

Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes. ...you can avoid mistakes and delays by passing it to the BeautifulSoup constructor as from_encoding.

So the key is to add from_encoding="UTF-8" to each time the BS is constructed:

soup = BeautifulSoup(soup.renderContents(), "html.parser", from_encoding="UTF-8")

Post a Comment for "Prevent Beautifulsoup's Rendercontents() From Changing   To Â"