[prev] 42 [next]

UTF-8 Character Encoding

UTF-8 uses a variable-length encoding as follows

#bytes #bits Byte 1 Byte 2 Byte 3 Byte 4
170xxxxxxx---
211110xxxxx10xxxxxx--
3161110xxxx10xxxxxx10xxxxxx-
42111110xxx10xxxxxx10xxxxxx10xxxxxx

The 127 1-byte codes are compatible with ASCII

The 2048 2-byte codes include most Latin-script alphabets

The 65536 3-byte codes include most Asian languages

The 2097152 4-byte codes include symbols and emojis and ...