Preface: there is a problem of garbled code when writing to the CSV file
Solution: UTF-8 is changed to utf-8-sig
The differences are as follows:
1. “UTF-8” takes the byte as the encoding unit, and its byte order is the same in all systems. There is no byte order problem, so it does not need BOM. Therefore, when using “UTF-8” encoding method to read a file with BOM, it will treat the BOM as the content of the file, and a similar error will occur
2. In “uft-8-sig”, all sig is spelled as “signature”, that is “UTF-8 with signature”. Therefore, when “utf-8-sig” reads “UTF-8 file” with BOM, it will separate BOM from text content, which is also our expected result
with open('data.csv', 'w',encoding='utf_8_sig') as fp:
If you want excel to open the CSV format file saved in UTF-8 normally, you need to add the BOM (byte order mark) in the front of the file. If the receiver receives a byte stream starting with EF BB BF, it knows that this is UTF-8 encoding
So before writing the content data of the file, write the BOM first. See the code below
FileOutputStream fos = new FileOutputStream(new File(this.csvFileAbsolutePath));
byte [] bs = { (byte)0xEF, (byte)0xBB, (byte)0xBF}; // UTF-8 coding
fos.write(bs);
fos.write(…);
fos.close();
In this way, the CSV file with BOM is opened directly in Excel, and there will be no garbled code
The problem I had was this. Download the CSV file, open it with Excel, garbled Chinese, open it with atom, Notepad + + and notepad, and the display is normal. Looking up the data, we found that excel could not recognize the Unicode file without BOM header, that is, excel opened the CSV file with Asni by default. So you need to add a BOM header
meaning of BOM
BOM is the byte order mark. BOM is prepared for utf-16 and UTF-32. Users mark byte order. Take utf-16 as an example, which takes two bytes as encoding units. Before interpreting a utf-16 text, we should first make clear the byte order of each encoding unit. For example, the Unicode code of “Kui” is 594e, and that of “B” is 4e59. If we receive the utf-16 byte stream “594e”, is it “Kui” or “B”
The recommended method of marking byte order in Unicode specification is BOM: there is a character called “zero width no break space” in UCS encoding, and its encoding is FEFF . In UCS, FEFF is an invisible character (, that is, invisible ), so it should not appear in actual transmission. The UCS specification suggests that we transfer the character “zero width no-break space” before transferring the byte stream. In this way, if the receiver receives the FEFF, it indicates that the byte stream is big endian; If fffe is received, the byte stream is little endian. Therefore, the character “zero width no-break space” is also called BOM
UTF-8 uses byte as encoding unit, and there is no byte order
Let’s extend it
UTF-8 encoding is processed in one byte, which is not affected by the size of CPU; When the next bit needs to be considered, the address will be + 1
Utf-16 and UTF-32 are processed in units of two bytes and four bytes, that is, two or four bytes are read at a time. In this way, the order of two or four bytes in a unit should be considered when storing and transmitting on the network
UTF-8 BOM
UTF-8 BOM is also called UTF-8 signature. UTF-8 does not need BOM to indicate byte order, but it can use BOM to indicate encoding mode. When the text program reads the byte stream starting with EF BB BF, it will know that this is UTF-8 encoding windows uses BOM to mark the encoding method of text files
Supplement:
The UCS code of “zero width no-break space” character is FEFF (assuming big end), and the corresponding UTF-8 code is EF BB BF
That is, the byte stream starting with EF BB BF indicates that it is a UTF-8 encoded byte stream but if the file itself is UTF-8 encoded, the three bytes EF BB BF are useless. Therefore, it can be said that the existence of BOM has no effect on UTF-8 itself
disadvantages of including BOM in UTF-8 file
1. Impact on PHP
PHP didn’t consider BOM when designing, that is to say, it won’t ignore the three EF BB BF characters at the beginning of UTF-8 encoded files, and parse them directly as text, resulting in parsing errors
2. Error report when executing SQL script on Linux 2
Recently, in the process of development, SQL files written under windows always report errors when they are executed under Linux
At the beginning of the file, whether it is annotated in Chinese or English, or even removed, sp2-0734: unknown command beginning “will be reported& lt; span “=””> dec< span “=””> Lare… “- restofline ignored
< span “=””> Here is the beginning of the document
1 --create tablespace
2 declare
3 v_tbs_name varchar2(200):='hytpdtsmsshistorydb';
4 begin
The error is as follows:
1 SP2-0734: unknown command beginning "?--create ..." - rest of line ignored.
2
3
4 PL/SQL procedure successfully completed.
There is no solution to this problem on the Internet, and the file code has been changed to UTF-8, which puzzles me for a long time
finally, check the difference between BOM and no BOM, try to change it to no BOM, and there is no error again
After the modification, whether in Chinese or English, or remove the comments, it can be implemented normally
blood tears suggestion: UTF-8 is best not to bring BOM
The difference between “UTF-8” and “UTF-8 with BOM” is whether there is a BOM. That is, whether there is U + FEFF at the beginning of the file
1. The way to view BOM in Linux: use less command, other commands may not see the effect:
One more & lt; U+FEFF>。
2. How to remove BOM in UTF-8
Under Linux:
(1)
1) VIM open file
2) execution: set nobomb
3) preservation: WQ
(2)
dos2unix filename
Convert windows format file to UNIX and Linux format file. This command can not only convert the newline character of Windows files to the newline character of UNIX and Linux files, but also convert UTF-8 Unicode (with BOM) to UTF-8 Unicode
PS:
in the case of comparison, one UTF-8 Unicode (with BOM) file contains two & lt; U+FEFF>, Whether method (1) or method (2) is used, it needs to be executed twice before & lt; U+FEFF> Completely removed
(2) under windows, open this file with Notepad + +, then select “code”, then select “code in UTF-8 no BOM format”, and finally save the file again
Reference sources: https://www.cnblogs.com/Allen-rg/p/10536081.html
Similar Posts:
- [Solved] DOM parsing XML Error: Content is not allowed in prolog
- Error reporting and resolution of Python 3 using binascii method
- System.Xml.XmlException: There is no Unicode byte order mark. Cannot switch to Unicode.
- Python Open File SyntaxError: (unicode error) ‘unicodeescape’ codec can’t decode bytes in …
- [Solved] Python Numpy Data load error: Unicode error: unpicking a python object failed: Unicode decodeerror
- UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 0-1:
- [Solved] UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x89 in position 0: invalid start byte
- What is the difference between utf8mb4 Unicode Ci and UTF8 general CI in MySQL database?
- [Solved] Python Error: UnicodeDecodeError: ‘gb2312’ codec can’t decode byte 0xa4 in position… : illegal multibyte sequence