All notes


Unicode transformation format.

Unicode, UTF-8, UTF-16, UTF-32

Difference between Unicode and UTF-8

W3School: HTML UTF8. Starting from Unicode2.0, the characters could have unique numbers in the range: [U+0000, U+10FFFF], which is to say, all characters are represented in a 21-bit code space.
Therefore, every character is encoded by a sequence of 1-4 bytes (UTF-8), 1-2 16-bit code units (UTF-16), or a single 32-bit code unit (UTF-32).

Unrecognized byte sequence

In UTF-8, every byte of the form 110xxxxx$_2$ must be followed with a byte of the form 10xxxxxx$_2$. A sequence such as 110xxxxx$_2$ 0xxxxxxx$_2$ is illegal, and must never be generated.

Package Unicode to ASCII

Method 1: UTF-8

UTF-8 keep ASCII code as is, although it transforms Latin-1 (characters larger than 127).

ś (U+015B) is represented by C5 9B. C5 equals to: $12 \times 16 + 5 = 192 \gt 127$. UTF-8 preserves char larger than 127 as an indication that the current char is not ASCII but a combo for non-ASCII.

127 is represented as 7F.

Method2: Java or C style escapes

For example, ś (U+015B) is represented by: \u015B.

Method3: HTML or XML entity

For example, ś (U+015B) is represented by: #x015B. "#x" here denotes Hex.

So ś becomes: ś.

Other notes

UTF-8 is most common on the web. UTF-16 is used by Java and Windows.


Byte Order Mark. Its character code U+FEFF. Used to define the byte order and encoding form.

In LE or BE UTF-16 or UTF-32, the BOM is represented with the specified encoding, thus receiver could infer the encoding from the BOM.


StackOverflow: manually converting unicode to UTF-8.

The sequence to be used depends on the UCS code number of the character:

0x00000000 - 0x0000007F:

0x00000080 - 0x000007FF:
    110xxxxx 10xxxxxx

0x00000800 - 0x0000FFFF:
    1110xxxx 10xxxxxx 10xxxxxx

0x00010000 - 0x001FFFFF:
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

0x00200000 - 0x03FFFFFF:
    111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

0x04000000 - 0x7FFFFFFF:
    1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The xxx bit positions are filled with the bits of the character code number in binary representation.

Take BOM (U+FEFF) as example, it is smaller than FFFF, so:

F in decimal: 15. in binary: $15=2^3+2^2+2^1+1$, so 1111.
FEFF in binary: 1111 1110 1111 1111
Fit into UTF-8: 1110(1111) 10(111011) 10(111111)
Convert to hex: EF BB BF


The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four.