Character encodings, from ASCII to UTF-8

May 15, 2026

encodingunicodeutf-8

An é that turns into Ã© in a terminal, a "Latin-1" file that actually contains Windows-1252, a hidden BOM at the start of a CSV that breaks a script. It all comes from the same place: fitting letters into bytes is a simple problem that got complicated the moment we tried to leave English behind. ASCII held the line as long as it could, Latin-1 and Windows-1252 tried to push the walls, UTF-16 tried 16-bit units before having to extend them with surrogate pairs, and UTF-8 ended up solving what the others had left behind. Here's the rundown, with the details that matter when you bump into them.

Contents

1. 8-bit encodings

ASCII

7 bits, 128 characters
0x00–0x1F: control characters (NUL, LF, CR, TAB, BEL...)
0x20–0x7E: printable (space included)
0x7F: DEL
Universal base: every 8-bit encoding starts by extending ASCII

Latin-1 (ISO 8859-1)

8 bits, 256 characters = ASCII + 128
0x00–0x7F: identical to ASCII
0x80–0x9F: C1 controls (effectively empty)
0xA0–0xFF: printable Western European characters (é, à, ç, ñ, ü...)

Windows-1252 (CP1252)

Identical to Latin-1 except 0x80–0x9F which holds useful characters (€ at 0x80, typographic quotes, …, –, —, ™...)
0xA0–0xFF: strictly identical to Latin-1
The reality of most text files created on Windows, even when the header claims "Latin-1"

2. Detecting Latin-1 vs Windows-1252

To tell them apart on an 8-bit buffer:

Any byte < 0x80 → pure ASCII, both encodings agree
Bytes in 0xA0–0xFF only → indistinguishable, but it doesn't matter: Latin-1 and Windows-1252 render identically on this range (treat as Latin-1)
At least one byte in 0x80–0x9F → necessarily Windows-1252, Latin-1 doesn't use this range

Unreliable on small buffers: it's easy to miss the 0x80–0x9F range entirely and guess wrong.

3. UTF-16 and UTF-32

256 characters obviously isn't enough to cover every writing system on earth. Unicode first tried to fit everything into a fixed unit wider than a byte: 16 bits, then 32.

UTF-16

Variable length: 2 or 4 bytes per character
U+0000–U+FFFF (BMP, Basic Multilingual Plane): 2 bytes directly
U+10000–U+10FFFF: 4 bytes via surrogate pair
Endianness matters → BOM required

UCS-2 vs UTF-16

UCS-2 (1991), the ancestor of UTF-16, was fixed-width: 2 bytes per character, period. But the BMP - the first 65,536 Unicode code points - turned out to be insufficient to cover every writing system. UTF-16 (1996) inherited from UCS-2 and introduced surrogate pairs to extend coverage without breaking backward compatibility with existing UCS-2 implementations.

The surrogate zone (U+D800–U+DFFF)

These 2048 code points are permanently forbidden as Unicode characters - not just reserved for UTF-16's mechanics, but unusable in any encoding. A file containing a lone character in this range is by definition malformed.

Surrogate pair algorithm

For a non-BMP code point (U+10000–U+10FFFF), build two 16-bit units:

offset = code_point - 0x10000
high   = 0xD800 + (offset >> 10)
low    = 0xDC00 + (offset & 0x3FF)

Why this works:

Subtracting 0x10000 brings the offset to zero. The non-BMP range spans U+10000 to U+10FFFF, which is 0x100000 values = exactly 20 bits of information.
20 bits = 10 + 10: split into two 16-bit units, each with a fixed 6-bit prefix (0xD800 for the high, 0xDC00 for the low), leaving 10 useful bits per unit.
High surrogate (U+D800–U+DBFF): 1024 values (top 10 bits of the code point).
Low surrogate (U+DC00–U+DFFF): 1024 values (bottom 10 bits of the code point).
1024 × 1024 = 1,048,576 combinations → covers exactly the non-BMP range.

Example: `😀` (U+1F600)

Code point: U+1F600 = 128512
Range     : non-BMP → surrogate pair required

Offset:
  offset = 0x1F600 - 0x10000 = 0xF600 = 62976

Split into 20 bits (10 high + 10 low):
  0xF600 on 20 bits: 0000 1111 0110 0000 0000
  Split            : 0000 1111 01 | 10 0000 0000
                   = 0x03D         | 0x200

Compute the two 16-bit units:
  high = 0xD800 + 0x03D = 0xD83D
  low  = 0xDC00 + 0x200 = 0xDE00

Result (bytes, big-endian): 0xD8 0x3D 0xDE 0x00

Practical pitfalls

"😀".length in JS and String.length() in Java/Kotlin return 2, not 1 - the length counts 16-bit units, not code points. To get the code point count: [...s].length in JS, s.codePointCount(0, s.length) in Java.
An isolated surrogate (high without a following low, or low without a preceding high) is malformed text: the pair only makes sense together.
The JVM accepts isolated surrogates in memory (Java Strings are sequences of 16-bit units, not real Unicode strings). But converting such a String to UTF-8 (e.g. s.getBytes("UTF-8")) produces invalid bytes or throws depending on the implementation - classic source of bugs in I/O pipelines.

Why UTF-16 still exists

UTF-8 (1992, Ken Thompson and Rob Pike) predates UTF-16 with surrogates (~1996) by 4 years. UTF-16 survives today through lock-in: Windows NT (1993) and Java (1995) had already bet on UCS-2 before the BMP turned out to be insufficient. When Unicode overflowed U+FFFF in 1996, these platforms couldn't break their ABI - hence retrofitting surrogate pairs onto UCS-2 to produce UTF-16. Without this legacy, UTF-8 would likely be the norm everywhere.

UTF-32

Always 4 bytes per character, fixed length
Easy to index, but verbose; rare on disk, common in memory for processing
Endianness matters → BOM required

4. Endianness

Byte storage order for multi-byte units (UTF-16, UTF-32).

Example with U+4E2D (中) in UTF-16:

Big-endian: 4E 2D (most significant byte first)
Little-endian: 2D 4E (least significant byte first)

Contexts:

Big-endian: TCP/IP networking (network byte order), PowerPC, IBM mainframes
Little-endian: x86, x64, ARM (current mode)
Bi-endian: ARM (configurable), MIPS, PowerPC

5. BOM (Byte Order Mark) - U+FEFF

The BOM is the character U+FEFF placed at the start of a file. Its byte representation reveals the encoding and endianness.

Encoding	BOM	Notes
UTF-8	`EF BB BF`	Optional, no endianness. Created by Windows/Notepad. Strip on read.
UTF-16 big-endian	`FE FF`	Required
UTF-16 little-endian	`FF FE`	Required
UTF-32 big-endian	`00 00 FE FF`	Required
UTF-32 little-endian	`FF FE 00 00`	Required

Classic trap: FF FE looks like a UTF-16 LE BOM, but if the next two bytes are 00 00, it's actually a UTF-32 LE BOM. Read 4 bytes before deciding.

6. UTF-8

UTF-8 sidesteps all of the above: no wasted space on ASCII text, no endianness, no mandatory BOM.

Base unit: 8 bits (1 byte)
Variable length: 1 to 4 bytes per character
The "8" refers to the base unit size, not the max character size
ASCII-compatible: any ASCII file is also valid UTF-8
No endianness issue since the unit is 1 byte

Byte patterns

Notation: the 0s and 1s are the fixed marker bits (they identify the format); the xs are the code point's data bits that get placed into those positions.

0xxxxxxx                              → 1 byte  (U+0000–007F, ASCII)
110xxxxx 10xxxxxx                     → 2 bytes (U+0080–07FF)
1110xxxx 10xxxxxx 10xxxxxx            → 3 bytes (U+0800–FFFF)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx   → 4 bytes (U+10000–10FFFF)
10xxxxxx                              → continuation byte
                                        (never a leading byte)

Bit manipulation - quick refresher

UTF-8 encoding and decoding both rely on two bitwise operations: masking with AND (&) to isolate specific bits within a byte, and shifting (>> right for encoding, << left for decoding) to reposition them.

Masking with `AND` (`&`)

A mask is a value you combine with AND to isolate bits. Rules: 1 AND x = x, 0 AND x = 0. The 1 bits of the mask "let through", the 0 bits "erase".

Example - isolate the low 6 bits of 0xE9 using the mask 0x3F:

  1110 1001   (0xE9)
& 0011 1111   (0x3F = low 6-bit mask)
-----------
  0010 1001   → low 6 bits isolated

The mask 0x3F shows up everywhere in UTF-8 because each continuation byte carries exactly 6 bits.

Right shift (`>>`)

x >> n pushes every bit n positions to the right. The low bits fall off, the high bits drop into their place.

Example - get what's left above the low 6 bits of 0xE9:

0xE9       = 1110 1001
0xE9 >> 6  = 0000 0011   → the 2 remaining high bits

In UTF-8, once you've extracted the low 6 bits for the continuation byte, >> 6 lets you grab the bits above to place them in the leading byte.

Left shift (`<<`)

x << n pushes every bit n positions to the left. The high bits fall off (past the type's width), the low bits move up.

Example - shift 0x03 back six places up:

0x03       = 0000 0011
0x03 << 6  = 1100 0000   → bits repositioned

In UTF-8, it's the inverse of >>: when decoding, you use it to lift a leading byte's data bits back into the high position before merging them with a continuation byte's bits.

Combining the two

To build the two UTF-8 bytes for é (U+00E9) in C:

uint8_t c = 0xE9;

uint8_t byte1 = 0xC0 | (c >> 6);     // marker 110 + high bits
uint8_t byte2 = 0x80 | (c & 0x3F);   // marker 10  + low 6 bits

c & 0x3F isolates the low 6 bits
c >> 6 isolates the bits above
| 0xC0 prepends the marker 110 to the leading byte
| 0x80 prepends the marker 10 to the continuation byte

It's the direct translation of the "high bits / low 6 bits" split into code.

Useful masks for decoding

Format detection (on the leading byte):

0xC0 = marker 11000000 (start of a 2-byte sequence)
0xE0 = marker 11100000 (start of a 3-byte sequence)
0xF0 = marker 11110000 (start of a 4-byte sequence)
0x80 = marker 10000000 (continuation byte)

Data bit extraction:

0x1F = mask for the low 5 bits (00011111) - data bits of a leading byte in a 2-byte sequence
0x0F = mask for the low 4 bits (00001111) - data bits of a leading byte in a 3-byte sequence
0x07 = mask for the low 3 bits (00000111) - data bits of a leading byte in a 4-byte sequence
0x3F = mask for the low 6 bits (00111111) - data bits of a continuation byte

Decoding example

To decode the UTF-8 sequence 0xC3 0xA9:

Input bytes: 0xC3 0xA9

Step 1 - identify the format from the leading byte
  0xC3 = 1100 0011
  Prefix `110` → 2-byte sequence

Step 2 - extract the data bits from each byte
  Leading byte      : 0xC3 & 0x1F = 0000 0011   (5 bits)
  Continuation byte : 0xA9 & 0x3F = 0010 1001   (6 bits)

Step 3 - reassemble by shifting the leading bits 6 places up
  (0x03 << 6) | 0x29 = 1100 0000 | 0010 1001 = 1110 1001 = 0x00E9

Result: U+00E9 = é

In C:

uint8_t byte1 = 0xC3;
uint8_t byte2 = 0xA9;
uint32_t code_point = ((byte1 & 0x1F) << 6) | (byte2 & 0x3F);   // 0xE9

For 3 bytes, the leading byte uses 0x0F (4 data bits) and the shifts become << 12, << 6. For 4 bytes, 0x07 (3 data bits) with << 18, << 12, << 6. Same idea every time: mask each byte to extract its data bits, then position them with << before merging with |.

Encoding a code point in UTF-8

To encode a Unicode code point as UTF-8 bytes, pick the format based on the code point's value, then spread its bits into the x slots of the format.

Code point range	Significant bits	UTF-8 format	Bytes
`U+0000`–`U+007F`	7 bits	`0xxxxxxx`	1
`U+0080`–`U+07FF`	11 bits	`110xxxxx 10xxxxxx`	2
`U+0800`–`U+FFFF`	16 bits	`1110xxxx 10xxxxxx 10xxxxxx`	3
`U+10000`–`U+10FFFF`	21 bits	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	4

Every continuation byte always carries 6 bits; only the leading byte varies.

General algorithm:

Find the code point's range → derive the byte count and format
Write the code point in binary on the required number of significant bits (pad with leading zeros)
Split into chunks and fill the x slots of the format (6 bits per continuation byte)
Prepend the markers (110, 1110, 11110, 10) to each chunk
Convert each binary byte to hexadecimal (1 hex digit per 4-bit group)

Encoding examples

1 byte: `A` (U+0041)

Code point: U+0041 = 65
Range     : U+0000–U+007F → 1 byte, format 0xxxxxxx

Binary on 7 bits: 1000001

Fill the format:

  Format:  0xxxxxxx
  Bits  :   1000001
  Byte  :  01000001

Convert to hex (1 hex digit = 4 bits):

  01000001 = 0100 0001 = 0x41

Result: 0x41

ASCII and UTF-8 are identical on this range - that's the backwards compatibility.

2 bytes: `é` (U+00E9)

Code point: U+00E9 = 233
Range     : U+0080–U+07FF → 2 bytes, format 110xxxxx 10xxxxxx

Binary on 11 bits: 000 1110 1001
Split            : 00011 | 101001  (5 high bits, 6 low bits)

Fill the format (`110`/`10` come from the format, the bits fill the `x`s):

  Format:  110xxxxx   10xxxxxx
  Bits  :     00011     101001
  Bytes :  11000011   10101001

Convert to hex:

  11000011 = 1100 0011 = 0xC3
  10101001 = 1010 1001 = 0xA9

Result: 0xC3 0xA9

In Latin-1, é = 0xE9 (1 byte). When Ã© shows up in output, that's UTF-8 (0xC3 0xA9) being read as if it were Latin-1: 0xC3 → Ã, 0xA9 → ©.

3 bytes: `中` (U+4E2D)

Code point: U+4E2D = 20013
Range     : U+0800–U+FFFF → 3 bytes, format 1110xxxx 10xxxxxx 10xxxxxx

Binary on 16 bits: 0100 1110 0010 1101
Split            : 0100 | 111000 | 101101  (4 bits, 6 bits, 6 bits)

Fill the format:

  Format:  1110xxxx   10xxxxxx   10xxxxxx
  Bits  :      0100     111000     101101
  Bytes :  11100100   10111000   10101101

Convert to hex:

  11100100 = 1110 0100 = 0xE4
  10111000 = 1011 1000 = 0xB8
  10101101 = 1010 1101 = 0xAD

Result: 0xE4 0xB8 0xAD

Common CJK characters (Chinese, Japanese, Korean) all live in 3 bytes.

4 bytes: `😀` (U+1F600)

Code point: U+1F600 = 128512
Range     : U+10000–U+10FFFF → 4 bytes, format 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Binary on 21 bits: 0 0001 1111 0110 0000 0000
Split            : 000 | 011111 | 011000 | 000000  (3 bits, 6 bits, 6 bits, 6 bits)

Fill the format:

  Format:  11110xxx   10xxxxxx   10xxxxxx   10xxxxxx
  Bits  :       000     011111     011000     000000
  Bytes :  11110000   10011111   10011000   10000000

Convert to hex:

  11110000 = 1111 0000 = 0xF0
  10011111 = 1001 1111 = 0x9F
  10011000 = 1001 1000 = 0x98
  10000000 = 1000 0000 = 0x80

Result: 0xF0 0x9F 0x98 0x80

All emojis are 4 bytes - same goes for supplementary Unicode planes (historic scripts like cuneiform, mathematical symbols, etc.).

Character encodings, from ASCII to UTF-8

1. 8-bit encodings

ASCII

Latin-1 (ISO 8859-1)

Windows-1252 (CP1252)

2. Detecting Latin-1 vs Windows-1252

3. UTF-16 and UTF-32

UTF-16

UCS-2 vs UTF-16

The surrogate zone (U+D800–U+DFFF)

Surrogate pair algorithm

Example: `😀` (U+1F600)

Practical pitfalls

Why UTF-16 still exists

UTF-32

4. Endianness

5. BOM (Byte Order Mark) - U+FEFF

6. UTF-8

Byte patterns

Bit manipulation - quick refresher

Masking with `AND` (`&`)

Right shift (`>>`)

Left shift (`<<`)

Combining the two

Useful masks for decoding

Decoding example

Encoding a code point in UTF-8

Encoding examples

1 byte: `A` (U+0041)

2 bytes: `é` (U+00E9)

3 bytes: `中` (U+4E2D)

4 bytes: `😀` (U+1F600)

Resources

Character tables

Reading

Tools

Character encodings, from ASCII to UTF-8

1. 8-bit encodings

ASCII

Latin-1 (ISO 8859-1)

Windows-1252 (CP1252)

2. Detecting Latin-1 vs Windows-1252

3. UTF-16 and UTF-32

UTF-16

UCS-2 vs UTF-16

The surrogate zone (U+D800–U+DFFF)

Surrogate pair algorithm

Example: 😀 (U+1F600)

Practical pitfalls

Why UTF-16 still exists

UTF-32

4. Endianness

5. BOM (Byte Order Mark) - U+FEFF

6. UTF-8

Byte patterns

Bit manipulation - quick refresher

Masking with AND (&)

Right shift (>>)

Left shift (<<)

Combining the two

Useful masks for decoding

Decoding example

Encoding a code point in UTF-8

Encoding examples

1 byte: A (U+0041)

2 bytes: é (U+00E9)

3 bytes: 中 (U+4E2D)

4 bytes: 😀 (U+1F600)

Resources

Character tables

Reading

Tools

Example: `😀` (U+1F600)

Masking with `AND` (`&`)

Right shift (`>>`)

Left shift (`<<`)

1 byte: `A` (U+0041)

2 bytes: `é` (U+00E9)

3 bytes: `中` (U+4E2D)

4 bytes: `😀` (U+1F600)