Tag Archives: utf-8utf-8-sig

The difference between UTF-8 and utf-8-sig

Preface: there is a problem of garbled code when writing to the CSV file

Solution: UTF-8 is changed to utf-8-sig

The differences are as follows:

1. “UTF-8” takes the byte as the encoding unit, and its byte order is the same in all systems. There is no byte order problem, so it does not need BOM. Therefore, when using “UTF-8” encoding method to read a file with BOM, it will treat the BOM as the content of the file, and a similar error will occur

2. In “uft-8-sig”, all sig is spelled as “signature”, that is “UTF-8 with signature”. Therefore, when “utf-8-sig” reads “UTF-8 file” with BOM, it will separate BOM from text content, which is also our expected result

with open('data.csv', 'w',encoding='utf_8_sig') as fp:

  

If you want excel to open the CSV format file saved in UTF-8 normally, you need to add the BOM (byte order mark) in the front of the file. If the receiver receives a byte stream starting with EF BB BF, it knows that this is UTF-8 encoding

So before writing the content data of the file, write the BOM first. See the code below

FileOutputStream fos = new FileOutputStream(new File(this.csvFileAbsolutePath));

byte [] bs = { (byte)0xEF, (byte)0xBB, (byte)0xBF}; // UTF-8 coding

fos.write(bs);

fos.write(…);

fos.close();

In this way, the CSV file with BOM is opened directly in Excel, and there will be no garbled code

The problem I had was this. Download the CSV file, open it with Excel, garbled Chinese, open it with atom, Notepad + + and notepad, and the display is normal. Looking up the data, we found that excel could not recognize the Unicode file without BOM header, that is, excel opened the CSV file with Asni by default. So you need to add a BOM header

meaning of BOM

BOM is the byte order mark. BOM is prepared for utf-16 and UTF-32. Users mark byte order. Take utf-16 as an example, which takes two bytes as encoding units. Before interpreting a utf-16 text, we should first make clear the byte order of each encoding unit. For example, the Unicode code of “Kui” is 594e, and that of “B” is 4e59. If we receive the utf-16 byte stream “594e”, is it “Kui” or “B”

The recommended method of marking byte order in Unicode specification is BOM: there is a character called “zero width no break space” in UCS encoding, and its encoding is FEFF . In UCS, FEFF is an invisible character (, that is, invisible ), so it should not appear in actual transmission. The UCS specification suggests that we transfer the character “zero width no-break space” before transferring the byte stream. In this way, if the receiver receives the FEFF, it indicates that the byte stream is big endian; If fffe is received, the byte stream is little endian. Therefore, the character “zero width no-break space” is also called BOM

UTF-8 uses byte as encoding unit, and there is no byte order


Let’s extend it

UTF-8 encoding is processed in one byte, which is not affected by the size of CPU; When the next bit needs to be considered, the address will be + 1

Utf-16 and UTF-32 are processed in units of two bytes and four bytes, that is, two or four bytes are read at a time. In this way, the order of two or four bytes in a unit should be considered when storing and transmitting on the network


UTF-8 BOM

UTF-8 BOM is also called UTF-8 signature. UTF-8 does not need BOM to indicate byte order, but it can use BOM to indicate encoding mode. When the text program reads the byte stream starting with EF BB BF, it will know that this is UTF-8 encoding windows uses BOM to mark the encoding method of text files


Supplement:

The UCS code of “zero width no-break space” character is FEFF (assuming big end), and the corresponding UTF-8 code is EF BB BF


That is, the byte stream starting with EF BB BF indicates that it is a UTF-8 encoded byte stream but if the file itself is UTF-8 encoded, the three bytes EF BB BF are useless. Therefore, it can be said that the existence of BOM has no effect on UTF-8 itself

disadvantages of including BOM in UTF-8 file

1. Impact on PHP

PHP didn’t consider BOM when designing, that is to say, it won’t ignore the three EF BB BF characters at the beginning of UTF-8 encoded files, and parse them directly as text, resulting in parsing errors

2. Error report when executing SQL script on Linux 2


Recently, in the process of development, SQL files written under windows always report errors when they are executed under Linux

At the beginning of the file, whether it is annotated in Chinese or English, or even removed, sp2-0734: unknown command beginning “will be reported& lt; span “=””> dec< span “=””> Lare… “- restofline ignored
< span “=””> Here is the beginning of the document

1 --create tablespace
2 declare
3 v_tbs_name varchar2(200):='hytpdtsmsshistorydb';
4 begin

The error is as follows:

1 SP2-0734: unknown command beginning "?--create ..." - rest of line ignored.
2 
3 
4 PL/SQL procedure successfully completed.

There is no solution to this problem on the Internet, and the file code has been changed to UTF-8, which puzzles me for a long time
finally, check the difference between BOM and no BOM, try to change it to no BOM, and there is no error again

After the modification, whether in Chinese or English, or remove the comments, it can be implemented normally


blood tears suggestion: UTF-8 is best not to bring BOM

The difference between “UTF-8” and “UTF-8 with BOM” is whether there is a BOM. That is, whether there is U + FEFF at the beginning of the file

1. The way to view BOM in Linux: use less command, other commands may not see the effect:

One more & lt; U+FEFF>。

2. How to remove BOM in UTF-8

Under Linux:

  (1)

1) VIM open file

2) execution: set nobomb

3) preservation: WQ

  (2)

    dos2unix filename

Convert windows format file to UNIX and Linux format file. This command can not only convert the newline character of Windows files to the newline character of UNIX and Linux files, but also convert UTF-8 Unicode (with BOM) to UTF-8 Unicode

  PS:

in the case of comparison, one UTF-8 Unicode (with BOM) file contains two & lt; U+FEFF>, Whether method (1) or method (2) is used, it needs to be executed twice before & lt; U+FEFF> Completely removed

(2) under windows, open this file with Notepad + +, then select “code”, then select “code in UTF-8 no BOM format”, and finally save the file again

Reference sources: https://www.cnblogs.com/Allen-rg/p/10536081.html