Tag Archives: “UnicodeDecodeError: ‘gb2312’ codec can’t decode byte 0xa4 in position… : illegal multibyte sequence”

[Solved] Python Error: UnicodeDecodeError: ‘gb2312’ codec can’t decode byte 0xa4 in position… : illegal multibyte sequence

1. Error reporting scenario

An error is reported when using Python: “Unicode decodeerror: ‘GB2312’ codec can’t decode byte 0xa4 in position…: illegal multibyte sequence” generally has the following two scenarios:

1. Crawl Chinese website content

html = requests.get(url).decode("gb2312")

2. Read GBK encoded files

result = open(filename, 'r', encoding='gb2312')

2. Error reporting reason

The Chinese character set included in GB2312 is not comprehensive enough, and decoding errors will occur when encountering traditional characters.

Chinese character set range GB2312 < gbk < gb18030

3. Error reporting solution

1. Ignore decoding errors

html = requests.get(url).decode('gb2312',errors = 'ignore')

The default parameter of the decode function is strict. Decode ([encoding], [errors =’strict ‘]). You can use the second parameter to control the error handling strategy. Strict means that an exception is thrown when an illegal character is encountered
if it is set to ignore, illegal characters will be ignored
if it is set to replace, it will be used? Replace illegal characters

2. Replace gbk2312 with GBK with a more comprehensive Chinese character set

result = open(filename, 'r', encoding='gbk')

Note: if ‘ignore’ is used to ignore illegal characters and report errors, the read Chinese will be garbled. If you want to read accurate Chinese content, you can first convert the content encoded in GB2312 to UTF-8 and then read it.

res = requests.get(url)
res = decode(res, "gb2312").encode("utf8")
res.encoding = 'utf-8'
html = res.text
print(html)

At this time, Chinese characters can be output normally.