All notes
Unicod

# Basics

UTF
Unicode transformation format.

## Unicode, UTF-8, UTF-16, UTF-32

### Difference between Unicode and UTF-8

W3School: HTML UTF8.
• Unicode is a Character set. UTF-8 is encoding.
• Unicode is used to translate between Characters and their unique ID numbers (UCS code number). For example, ω corresponds to U+03C9.
• Encoding is used to find a way to store such unique ID numbers in binary. For example, U+03C9 (ω) is represented by two bytes cf 89 in UTF-8, and feff 03c9 in UTF-16.

Unicode.org. Starting from Unicode2.0, the characters could have unique numbers in the range: [U+0000, U+10FFFF], which is to say, all characters are represented in a 21-bit code space.
Therefore, every character is encoded by a sequence of 1-4 bytes (UTF-8), 1-2 16-bit code units (UTF-16), or a single 32-bit code unit (UTF-32).

### Unrecognized byte sequence

In UTF-8, every byte of the form 110xxxxx$_2$ must be followed with a byte of the form 10xxxxxx$_2$. A sequence such as 110xxxxx$_2$ 0xxxxxxx$_2$ is illegal, and must never be generated.

### Package Unicode to ASCII

#### Method 1: UTF-8

UTF-8 keep ASCII code as is, although it transforms Latin-1 (characters larger than 127).

ś (U+015B) is represented by C5 9B. C5 equals to: $12 \times 16 + 5 = 192 \gt 127$. UTF-8 preserves char larger than 127 as an indication that the current char is not ASCII but a combo for non-ASCII.

127 is represented as 7F.

#### Method2: Java or C style escapes

For example, ś (U+015B) is represented by: \u015B.

#### Method3: HTML or XML entity

For example, ś (U+015B) is represented by: #x015B. "#x" here denotes Hex.

So &#x015B; becomes: ś.

### Other notes

UTF-8 is most common on the web. UTF-16 is used by Java and Windows.

## BOM

BOM
Byte Order Mark. Its character code U+FEFF. Used to define the byte order and encoding form.

In LE or BE UTF-16 or UTF-32, the BOM is represented with the specified encoding, thus receiver could infer the encoding from the BOM.

• UTF-8: EF BB BF. See below for inference why it is defined like this.
• UTF-16 Big Endian: FE FF
• UTF-16 Little Endian: FF FE
• UTF-32 Big Endian: 00 00 FE FF
• UTF-16 Little Endian: FF FE 00 00

## UTF-8

The sequence to be used depends on the UCS code number of the character:

0x00000000 - 0x0000007F:
0xxxxxxx

0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx

0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx

0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

0x00200000 - 0x03FFFFFF:
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

0x04000000 - 0x7FFFFFFF:
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The xxx bit positions are filled with the bits of the character code number in binary representation.


Take BOM (U+FEFF) as example, it is smaller than FFFF, so:

F in decimal: 15. in binary: $15=2^3+2^2+2^1+1$, so 1111.
FEFF in binary: 1111 1110 1111 1111
Fit into UTF-8: 1110(1111) 10(111011) 10(111111)
Convert to hex: EF BB BF


## CESU-8

The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. wikipedia.org.