Character encodings, from ASCII to UTF-8
May 15, 2026
An Γ© that turns into ΓΒ© in a terminal, a "Latin-1" file that
actually contains Windows-1252, a hidden BOM at the start of a CSV that
breaks a script. It all comes from the same place: fitting letters into
bytes is a simple problem that got complicated the moment we tried to
leave English behind. ASCII held the line as long as it could, Latin-1
and Windows-1252 tried to push the walls, UTF-16 tried 16-bit units
before having to extend them with surrogate pairs, and UTF-8 ended up
solving what the others had left behind. Here's the rundown, with the
details that matter when you bump into them.
Contents
1. 8-bit encodings
ASCII
- 7 bits, 128 characters
0x00β0x1F: control characters (NUL, LF, CR, TAB, BEL...)0x20β0x7E: printable (space included)0x7F: DEL- Universal base: every 8-bit encoding starts by extending ASCII
Latin-1 (ISO 8859-1)
- 8 bits, 256 characters = ASCII + 128
0x00β0x7F: identical to ASCII0x80β0x9F: C1 controls (effectively empty)0xA0β0xFF: printable Western European characters (Γ©,Γ,Γ§,Γ±,ΓΌ...)
Windows-1252 (CP1252)
- Identical to Latin-1 except
0x80β0x9Fwhich holds useful characters (β¬at0x80, typographic quotes,β¦,β,β,β’...) 0xA0β0xFF: strictly identical to Latin-1- The reality of most text files created on Windows, even when the header claims "Latin-1"
2. Detecting Latin-1 vs Windows-1252
To tell them apart on an 8-bit buffer:
- Any byte
< 0x80β pure ASCII, both encodings agree - Bytes in
0xA0β0xFFonly β indistinguishable, but it doesn't matter: Latin-1 and Windows-1252 render identically on this range (treat as Latin-1) - At least one byte in
0x80β0x9Fβ necessarily Windows-1252, Latin-1 doesn't use this range
Unreliable on small buffers: it's easy to miss the 0x80β0x9F range
entirely and guess wrong.
3. UTF-16 and UTF-32
256 characters obviously isn't enough to cover every writing system on earth. Unicode first tried to fit everything into a fixed unit wider than a byte: 16 bits, then 32.
UTF-16
- Variable length: 2 or 4 bytes per character
U+0000βU+FFFF(BMP, Basic Multilingual Plane): 2 bytes directlyU+10000βU+10FFFF: 4 bytes via surrogate pair- Endianness matters β BOM required
UCS-2 vs UTF-16
UCS-2 (1991), the ancestor of UTF-16, was fixed-width: 2 bytes per character, period. But the BMP - the first 65,536 Unicode code points - turned out to be insufficient to cover every writing system. UTF-16 (1996) inherited from UCS-2 and introduced surrogate pairs to extend coverage without breaking backward compatibility with existing UCS-2 implementations.
The surrogate zone (U+D800βU+DFFF)
These 2048 code points are permanently forbidden as Unicode characters - not just reserved for UTF-16's mechanics, but unusable in any encoding. A file containing a lone character in this range is by definition malformed.
Surrogate pair algorithm
For a non-BMP code point (U+10000βU+10FFFF), build two 16-bit units:
offset = code_point - 0x10000
high = 0xD800 + (offset >> 10)
low = 0xDC00 + (offset & 0x3FF)
Why this works:
- Subtracting
0x10000brings the offset to zero. The non-BMP range spansU+10000toU+10FFFF, which is0x100000values = exactly 20 bits of information. - 20 bits = 10 + 10: split into two 16-bit units, each with a
fixed 6-bit prefix (
0xD800for the high,0xDC00for the low), leaving 10 useful bits per unit. - High surrogate (
U+D800βU+DBFF): 1024 values (top 10 bits of the code point). - Low surrogate (
U+DC00βU+DFFF): 1024 values (bottom 10 bits of the code point). - 1024 Γ 1024 = 1,048,576 combinations β covers exactly the non-BMP range.
Example: π (U+1F600)
Code point: U+1F600 = 128512
Range : non-BMP β surrogate pair required
Offset:
offset = 0x1F600 - 0x10000 = 0xF600 = 62976
Split into 20 bits (10 high + 10 low):
0xF600 on 20 bits: 0000 1111 0110 0000 0000
Split : 0000 1111 01 | 10 0000 0000
= 0x03D | 0x200
Compute the two 16-bit units:
high = 0xD800 + 0x03D = 0xD83D
low = 0xDC00 + 0x200 = 0xDE00
Result (bytes, big-endian): 0xD8 0x3D 0xDE 0x00
Practical pitfalls
"π".lengthin JS andString.length()in Java/Kotlin return 2, not 1 - the length counts 16-bit units, not code points. To get the code point count:[...s].lengthin JS,s.codePointCount(0, s.length)in Java.- An isolated surrogate (high without a following low, or low without a preceding high) is malformed text: the pair only makes sense together.
- The JVM accepts isolated surrogates in memory (Java
Strings are sequences of 16-bit units, not real Unicode strings). But converting such aStringto UTF-8 (e.g.s.getBytes("UTF-8")) produces invalid bytes or throws depending on the implementation - classic source of bugs in I/O pipelines.
Why UTF-16 still exists
UTF-8 (1992, Ken Thompson and Rob Pike) predates UTF-16 with surrogates
(~1996) by 4 years. UTF-16 survives today through lock-in: Windows
NT (1993) and Java (1995) had already bet on UCS-2 before the BMP
turned out to be insufficient. When Unicode overflowed U+FFFF in
1996, these platforms couldn't break their ABI - hence retrofitting
surrogate pairs onto UCS-2 to produce UTF-16. Without this legacy,
UTF-8 would likely be the norm everywhere.
UTF-32
- Always 4 bytes per character, fixed length
- Easy to index, but verbose; rare on disk, common in memory for processing
- Endianness matters β BOM required
4. Endianness
Byte storage order for multi-byte units (UTF-16, UTF-32).
Example with U+4E2D (δΈ) in UTF-16:
- Big-endian:
4E 2D(most significant byte first) - Little-endian:
2D 4E(least significant byte first)
Contexts:
- Big-endian: TCP/IP networking (network byte order), PowerPC, IBM mainframes
- Little-endian: x86, x64, ARM (current mode)
- Bi-endian: ARM (configurable), MIPS, PowerPC
5. BOM (Byte Order Mark) - U+FEFF
The BOM is the character U+FEFF placed at the start of a file. Its byte
representation reveals the encoding and endianness.
| Encoding | BOM | Notes |
|---|---|---|
| UTF-8 | EF BB BF | Optional, no endianness. Created by Windows/Notepad. Strip on read. |
| UTF-16 big-endian | FE FF | Required |
| UTF-16 little-endian | FF FE | Required |
| UTF-32 big-endian | 00 00 FE FF | Required |
| UTF-32 little-endian | FF FE 00 00 | Required |
Classic trap: FF FE looks like a UTF-16 LE BOM, but if the next two
bytes are 00 00, it's actually a UTF-32 LE BOM. Read 4 bytes before
deciding.
6. UTF-8
UTF-8 sidesteps all of the above: no wasted space on ASCII text, no endianness, no mandatory BOM.
- Base unit: 8 bits (1 byte)
- Variable length: 1 to 4 bytes per character
- The "8" refers to the base unit size, not the max character size
- ASCII-compatible: any ASCII file is also valid UTF-8
- No endianness issue since the unit is 1 byte
Byte patterns
Notation: the 0s and 1s are the fixed marker bits (they identify the
format); the xs are the code point's data bits that get placed into
those positions.
0xxxxxxx β 1 byte (U+0000β007F, ASCII)
110xxxxx 10xxxxxx β 2 bytes (U+0080β07FF)
1110xxxx 10xxxxxx 10xxxxxx β 3 bytes (U+0800βFFFF)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx β 4 bytes (U+10000β10FFFF)
10xxxxxx β continuation byte
(never a leading byte)
Bit manipulation - quick refresher
UTF-8 encoding and decoding both rely on two bitwise operations: masking
with AND (&) to isolate specific bits within a byte, and shifting
(>> right for encoding, << left for decoding) to reposition them.
Masking with AND (&)
A mask is a value you combine with AND to isolate bits. Rules:
1 AND x = x, 0 AND x = 0. The 1 bits of the mask "let through",
the 0 bits "erase".
Example - isolate the low 6 bits of 0xE9 using the mask 0x3F:
1110 1001 (0xE9)
& 0011 1111 (0x3F = low 6-bit mask)
-----------
0010 1001 β low 6 bits isolated
The mask 0x3F shows up everywhere in UTF-8 because each continuation
byte carries exactly 6 bits.
Right shift (>>)
x >> n pushes every bit n positions to the right. The low bits fall
off, the high bits drop into their place.
Example - get what's left above the low 6 bits of 0xE9:
0xE9 = 1110 1001
0xE9 >> 6 = 0000 0011 β the 2 remaining high bits
In UTF-8, once you've extracted the low 6 bits for the continuation byte,
>> 6 lets you grab the bits above to place them in the leading byte.
Left shift (<<)
x << n pushes every bit n positions to the left. The high bits fall
off (past the type's width), the low bits move up.
Example - shift 0x03 back six places up:
0x03 = 0000 0011
0x03 << 6 = 1100 0000 β bits repositioned
In UTF-8, it's the inverse of >>: when decoding, you use it to lift
a leading byte's data bits back into the high position before merging
them with a continuation byte's bits.
Combining the two
To build the two UTF-8 bytes for Γ© (U+00E9) in C:
uint8_t c = 0xE9;
uint8_t byte1 = 0xC0 | (c >> 6); // marker 110 + high bits
uint8_t byte2 = 0x80 | (c & 0x3F); // marker 10 + low 6 bits
c & 0x3Fisolates the low 6 bitsc >> 6isolates the bits above| 0xC0prepends the marker110to the leading byte| 0x80prepends the marker10to the continuation byte
It's the direct translation of the "high bits / low 6 bits" split into code.
Useful masks for decoding
Format detection (on the leading byte):
0xC0= marker11000000(start of a 2-byte sequence)0xE0= marker11100000(start of a 3-byte sequence)0xF0= marker11110000(start of a 4-byte sequence)0x80= marker10000000(continuation byte)
Data bit extraction:
0x1F= mask for the low 5 bits (00011111) - data bits of a leading byte in a 2-byte sequence0x0F= mask for the low 4 bits (00001111) - data bits of a leading byte in a 3-byte sequence0x07= mask for the low 3 bits (00000111) - data bits of a leading byte in a 4-byte sequence0x3F= mask for the low 6 bits (00111111) - data bits of a continuation byte
Decoding example
To decode the UTF-8 sequence 0xC3 0xA9:
Input bytes: 0xC3 0xA9
Step 1 - identify the format from the leading byte
0xC3 = 1100 0011
Prefix `110` β 2-byte sequence
Step 2 - extract the data bits from each byte
Leading byte : 0xC3 & 0x1F = 0000 0011 (5 bits)
Continuation byte : 0xA9 & 0x3F = 0010 1001 (6 bits)
Step 3 - reassemble by shifting the leading bits 6 places up
(0x03 << 6) | 0x29 = 1100 0000 | 0010 1001 = 1110 1001 = 0x00E9
Result: U+00E9 = Γ©
In C:
uint8_t byte1 = 0xC3;
uint8_t byte2 = 0xA9;
uint32_t code_point = ((byte1 & 0x1F) << 6) | (byte2 & 0x3F); // 0xE9
For 3 bytes, the leading byte uses 0x0F (4 data bits) and the shifts
become << 12, << 6. For 4 bytes, 0x07 (3 data bits) with << 18,
<< 12, << 6. Same idea every time: mask each byte to extract its
data bits, then position them with << before merging with |.
Encoding a code point in UTF-8
To encode a Unicode code point as UTF-8 bytes, pick the format based on
the code point's value, then spread its bits into the x slots of the
format.
| Code point range | Significant bits | UTF-8 format | Bytes |
|---|---|---|---|
U+0000βU+007F | 7 bits | 0xxxxxxx | 1 |
U+0080βU+07FF | 11 bits | 110xxxxx 10xxxxxx | 2 |
U+0800βU+FFFF | 16 bits | 1110xxxx 10xxxxxx 10xxxxxx | 3 |
U+10000βU+10FFFF | 21 bits | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 4 |
Every continuation byte always carries 6 bits; only the leading byte varies.
General algorithm:
- Find the code point's range β derive the byte count and format
- Write the code point in binary on the required number of significant bits (pad with leading zeros)
- Split into chunks and fill the
xslots of the format (6 bits per continuation byte) - Prepend the markers (
110,1110,11110,10) to each chunk - Convert each binary byte to hexadecimal (1 hex digit per 4-bit group)
Encoding examples
1 byte: A (U+0041)
Code point: U+0041 = 65
Range : U+0000βU+007F β 1 byte, format 0xxxxxxx
Binary on 7 bits: 1000001
Fill the format:
Format: 0xxxxxxx
Bits : 1000001
Byte : 01000001
Convert to hex (1 hex digit = 4 bits):
01000001 = 0100 0001 = 0x41
Result: 0x41
ASCII and UTF-8 are identical on this range - that's the backwards compatibility.
2 bytes: Γ© (U+00E9)
Code point: U+00E9 = 233
Range : U+0080βU+07FF β 2 bytes, format 110xxxxx 10xxxxxx
Binary on 11 bits: 000 1110 1001
Split : 00011 | 101001 (5 high bits, 6 low bits)
Fill the format (`110`/`10` come from the format, the bits fill the `x`s):
Format: 110xxxxx 10xxxxxx
Bits : 00011 101001
Bytes : 11000011 10101001
Convert to hex:
11000011 = 1100 0011 = 0xC3
10101001 = 1010 1001 = 0xA9
Result: 0xC3 0xA9
In Latin-1, Γ© = 0xE9 (1 byte). When ΓΒ© shows up in output, that's
UTF-8 (0xC3 0xA9) being read as if it were Latin-1: 0xC3 β Γ,
0xA9 β Β©.
3 bytes: δΈ (U+4E2D)
Code point: U+4E2D = 20013
Range : U+0800βU+FFFF β 3 bytes, format 1110xxxx 10xxxxxx 10xxxxxx
Binary on 16 bits: 0100 1110 0010 1101
Split : 0100 | 111000 | 101101 (4 bits, 6 bits, 6 bits)
Fill the format:
Format: 1110xxxx 10xxxxxx 10xxxxxx
Bits : 0100 111000 101101
Bytes : 11100100 10111000 10101101
Convert to hex:
11100100 = 1110 0100 = 0xE4
10111000 = 1011 1000 = 0xB8
10101101 = 1010 1101 = 0xAD
Result: 0xE4 0xB8 0xAD
Common CJK characters (Chinese, Japanese, Korean) all live in 3 bytes.
4 bytes: π (U+1F600)
Code point: U+1F600 = 128512
Range : U+10000βU+10FFFF β 4 bytes, format 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Binary on 21 bits: 0 0001 1111 0110 0000 0000
Split : 000 | 011111 | 011000 | 000000 (3 bits, 6 bits, 6 bits, 6 bits)
Fill the format:
Format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Bits : 000 011111 011000 000000
Bytes : 11110000 10011111 10011000 10000000
Convert to hex:
11110000 = 1111 0000 = 0xF0
10011111 = 1001 1111 = 0x9F
10011000 = 1001 1000 = 0x98
10000000 = 1000 0000 = 0x80
Result: 0xF0 0x9F 0x98 0x80
All emojis are 4 bytes - same goes for supplementary Unicode planes (historic scripts like cuneiform, mathematical symbols, etc.).
Resources
Character tables
- ASCII - robelle.com
- ASCII - Linux man page
- ISO 8859-1 (Latin-1) - Wikipedia
- Windows-1252 - Wikipedia
- UTF-8 - Wikipedia