- Unicode transformation format.
Unicode, UTF-8, UTF-16, UTF-32
Difference between Unicode and UTF-8W3School: HTML UTF8.
- Unicode is a Character set. UTF-8 is encoding.
- Unicode is used to translate between Characters and their unique ID numbers (UCS code number). For example, ω corresponds to U+03C9.
- Encoding is used to find a way to store such unique ID numbers in binary. For example, U+03C9 (ω) is represented by two bytes cf 89 in UTF-8, and feff 03c9 in UTF-16.
Starting from Unicode2.0, the characters could have unique numbers in the range: [U+0000, U+10FFFF], which is to say, all characters are represented in a 21-bit code space.
Therefore, every character is encoded by a sequence of 1-4 bytes (UTF-8), 1-2 16-bit code units (UTF-16), or a single 32-bit code unit (UTF-32).
Unrecognized byte sequence
In UTF-8, every byte of the form 110xxxxx$_2$ must be followed with a byte of the form 10xxxxxx$_2$. A sequence such as 110xxxxx$_2$ 0xxxxxxx$_2$ is illegal, and must never be generated.
Package Unicode to ASCII
Method 1: UTF-8
UTF-8 keep ASCII code as is, although it transforms Latin-1 (characters larger than 127).
ś (U+015B) is represented by C5 9B. C5 equals to: $12 \times 16 + 5 = 192 \gt 127$. UTF-8 preserves char larger than 127 as an indication that the current char is not ASCII but a combo for non-ASCII.
127 is represented as 7F.
Method2: Java or C style escapes
For example, ś (U+015B) is represented by: \u015B.
Method3: HTML or XML entity
For example, ś (U+015B) is represented by: #x015B. "#x" here denotes Hex.
So ś becomes: ś.
UTF-8 is most common on the web. UTF-16 is used by Java and Windows.
- Byte Order Mark. Its character code U+FEFF. Used to define the byte order and encoding form.
In LE or BE UTF-16 or UTF-32, the BOM is represented with the specified encoding, thus receiver could infer the encoding from the BOM.
- UTF-8: EF BB BF. See below for inference why it is defined like this.
- UTF-16 Big Endian: FE FF
- UTF-16 Little Endian: FF FE
- UTF-32 Big Endian: 00 00 FE FF
- UTF-16 Little Endian: FF FE 00 00
The sequence to be used depends on the UCS code number of the character: 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 0x00200000 - 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0x04000000 - 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx The xxx bit positions are filled with the bits of the character code number in binary representation.
Take BOM (U+FEFF) as example, it is smaller than FFFF, so:
F in decimal: 15. in binary: $15=2^3+2^2+2^1+1$, so 1111. FEFF in binary: 1111 1110 1111 1111 Fit into UTF-8: 1110(1111) 10(111011) 10(111111) Convert to hex: EF BB BF
The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. wikipedia.org.