Character encodings, from ASCII to UTF-8

May 15, 2026

encodingunicodeutf-8

An Γ© that turns into é in a terminal, a "Latin-1" file that actually contains Windows-1252, a hidden BOM at the start of a CSV that breaks a script. It all comes from the same place: fitting letters into bytes is a simple problem that got complicated the moment we tried to leave English behind. ASCII held the line as long as it could, Latin-1 and Windows-1252 tried to push the walls, UTF-16 tried 16-bit units before having to extend them with surrogate pairs, and UTF-8 ended up solving what the others had left behind. Here's the rundown, with the details that matter when you bump into them.

Contents

1. 8-bit encodings

ASCII

  • 7 bits, 128 characters
  • 0x00–0x1F: control characters (NUL, LF, CR, TAB, BEL...)
  • 0x20–0x7E: printable (space included)
  • 0x7F: DEL
  • Universal base: every 8-bit encoding starts by extending ASCII

Latin-1 (ISO 8859-1)

  • 8 bits, 256 characters = ASCII + 128
  • 0x00–0x7F: identical to ASCII
  • 0x80–0x9F: C1 controls (effectively empty)
  • 0xA0–0xFF: printable Western European characters (Γ©, Γ , Γ§, Γ±, ΓΌ...)

Windows-1252 (CP1252)

  • Identical to Latin-1 except 0x80–0x9F which holds useful characters (€ at 0x80, typographic quotes, …, –, β€”, β„’...)
  • 0xA0–0xFF: strictly identical to Latin-1
  • The reality of most text files created on Windows, even when the header claims "Latin-1"

2. Detecting Latin-1 vs Windows-1252

To tell them apart on an 8-bit buffer:

  • Any byte < 0x80 β†’ pure ASCII, both encodings agree
  • Bytes in 0xA0–0xFF only β†’ indistinguishable, but it doesn't matter: Latin-1 and Windows-1252 render identically on this range (treat as Latin-1)
  • At least one byte in 0x80–0x9F β†’ necessarily Windows-1252, Latin-1 doesn't use this range

Unreliable on small buffers: it's easy to miss the 0x80–0x9F range entirely and guess wrong.

3. UTF-16 and UTF-32

256 characters obviously isn't enough to cover every writing system on earth. Unicode first tried to fit everything into a fixed unit wider than a byte: 16 bits, then 32.

UTF-16

  • Variable length: 2 or 4 bytes per character
  • U+0000–U+FFFF (BMP, Basic Multilingual Plane): 2 bytes directly
  • U+10000–U+10FFFF: 4 bytes via surrogate pair
  • Endianness matters β†’ BOM required

UCS-2 vs UTF-16

UCS-2 (1991), the ancestor of UTF-16, was fixed-width: 2 bytes per character, period. But the BMP - the first 65,536 Unicode code points - turned out to be insufficient to cover every writing system. UTF-16 (1996) inherited from UCS-2 and introduced surrogate pairs to extend coverage without breaking backward compatibility with existing UCS-2 implementations.

The surrogate zone (U+D800–U+DFFF)

These 2048 code points are permanently forbidden as Unicode characters - not just reserved for UTF-16's mechanics, but unusable in any encoding. A file containing a lone character in this range is by definition malformed.

Surrogate pair algorithm

For a non-BMP code point (U+10000–U+10FFFF), build two 16-bit units:

offset = code_point - 0x10000
high   = 0xD800 + (offset >> 10)
low    = 0xDC00 + (offset & 0x3FF)

Why this works:

  • Subtracting 0x10000 brings the offset to zero. The non-BMP range spans U+10000 to U+10FFFF, which is 0x100000 values = exactly 20 bits of information.
  • 20 bits = 10 + 10: split into two 16-bit units, each with a fixed 6-bit prefix (0xD800 for the high, 0xDC00 for the low), leaving 10 useful bits per unit.
  • High surrogate (U+D800–U+DBFF): 1024 values (top 10 bits of the code point).
  • Low surrogate (U+DC00–U+DFFF): 1024 values (bottom 10 bits of the code point).
  • 1024 Γ— 1024 = 1,048,576 combinations β†’ covers exactly the non-BMP range.

Example: πŸ˜€ (U+1F600)

Code point: U+1F600 = 128512
Range     : non-BMP β†’ surrogate pair required

Offset:
  offset = 0x1F600 - 0x10000 = 0xF600 = 62976

Split into 20 bits (10 high + 10 low):
  0xF600 on 20 bits: 0000 1111 0110 0000 0000
  Split            : 0000 1111 01 | 10 0000 0000
                   = 0x03D         | 0x200

Compute the two 16-bit units:
  high = 0xD800 + 0x03D = 0xD83D
  low  = 0xDC00 + 0x200 = 0xDE00

Result (bytes, big-endian): 0xD8 0x3D 0xDE 0x00

Practical pitfalls

  • "πŸ˜€".length in JS and String.length() in Java/Kotlin return 2, not 1 - the length counts 16-bit units, not code points. To get the code point count: [...s].length in JS, s.codePointCount(0, s.length) in Java.
  • An isolated surrogate (high without a following low, or low without a preceding high) is malformed text: the pair only makes sense together.
  • The JVM accepts isolated surrogates in memory (Java Strings are sequences of 16-bit units, not real Unicode strings). But converting such a String to UTF-8 (e.g. s.getBytes("UTF-8")) produces invalid bytes or throws depending on the implementation - classic source of bugs in I/O pipelines.

Why UTF-16 still exists

UTF-8 (1992, Ken Thompson and Rob Pike) predates UTF-16 with surrogates (~1996) by 4 years. UTF-16 survives today through lock-in: Windows NT (1993) and Java (1995) had already bet on UCS-2 before the BMP turned out to be insufficient. When Unicode overflowed U+FFFF in 1996, these platforms couldn't break their ABI - hence retrofitting surrogate pairs onto UCS-2 to produce UTF-16. Without this legacy, UTF-8 would likely be the norm everywhere.

UTF-32

  • Always 4 bytes per character, fixed length
  • Easy to index, but verbose; rare on disk, common in memory for processing
  • Endianness matters β†’ BOM required

4. Endianness

Byte storage order for multi-byte units (UTF-16, UTF-32).

Example with U+4E2D (δΈ­) in UTF-16:

  • Big-endian: 4E 2D (most significant byte first)
  • Little-endian: 2D 4E (least significant byte first)

Contexts:

  • Big-endian: TCP/IP networking (network byte order), PowerPC, IBM mainframes
  • Little-endian: x86, x64, ARM (current mode)
  • Bi-endian: ARM (configurable), MIPS, PowerPC

5. BOM (Byte Order Mark) - U+FEFF

The BOM is the character U+FEFF placed at the start of a file. Its byte representation reveals the encoding and endianness.

EncodingBOMNotes
UTF-8EF BB BFOptional, no endianness. Created by Windows/Notepad. Strip on read.
UTF-16 big-endianFE FFRequired
UTF-16 little-endianFF FERequired
UTF-32 big-endian00 00 FE FFRequired
UTF-32 little-endianFF FE 00 00Required

Classic trap: FF FE looks like a UTF-16 LE BOM, but if the next two bytes are 00 00, it's actually a UTF-32 LE BOM. Read 4 bytes before deciding.

6. UTF-8

UTF-8 sidesteps all of the above: no wasted space on ASCII text, no endianness, no mandatory BOM.

  • Base unit: 8 bits (1 byte)
  • Variable length: 1 to 4 bytes per character
  • The "8" refers to the base unit size, not the max character size
  • ASCII-compatible: any ASCII file is also valid UTF-8
  • No endianness issue since the unit is 1 byte

Byte patterns

Notation: the 0s and 1s are the fixed marker bits (they identify the format); the xs are the code point's data bits that get placed into those positions.

0xxxxxxx                              β†’ 1 byte  (U+0000–007F, ASCII)
110xxxxx 10xxxxxx                     β†’ 2 bytes (U+0080–07FF)
1110xxxx 10xxxxxx 10xxxxxx            β†’ 3 bytes (U+0800–FFFF)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx   β†’ 4 bytes (U+10000–10FFFF)
10xxxxxx                              β†’ continuation byte
                                        (never a leading byte)

Bit manipulation - quick refresher

UTF-8 encoding and decoding both rely on two bitwise operations: masking with AND (&) to isolate specific bits within a byte, and shifting (>> right for encoding, << left for decoding) to reposition them.

Masking with AND (&)

A mask is a value you combine with AND to isolate bits. Rules: 1 AND x = x, 0 AND x = 0. The 1 bits of the mask "let through", the 0 bits "erase".

Example - isolate the low 6 bits of 0xE9 using the mask 0x3F:

  1110 1001   (0xE9)
& 0011 1111   (0x3F = low 6-bit mask)
-----------
  0010 1001   β†’ low 6 bits isolated

The mask 0x3F shows up everywhere in UTF-8 because each continuation byte carries exactly 6 bits.

Right shift (>>)

x >> n pushes every bit n positions to the right. The low bits fall off, the high bits drop into their place.

Example - get what's left above the low 6 bits of 0xE9:

0xE9       = 1110 1001
0xE9 >> 6  = 0000 0011   β†’ the 2 remaining high bits

In UTF-8, once you've extracted the low 6 bits for the continuation byte, >> 6 lets you grab the bits above to place them in the leading byte.

Left shift (<<)

x << n pushes every bit n positions to the left. The high bits fall off (past the type's width), the low bits move up.

Example - shift 0x03 back six places up:

0x03       = 0000 0011
0x03 << 6  = 1100 0000   β†’ bits repositioned

In UTF-8, it's the inverse of >>: when decoding, you use it to lift a leading byte's data bits back into the high position before merging them with a continuation byte's bits.

Combining the two

To build the two UTF-8 bytes for Γ© (U+00E9) in C:

uint8_t c = 0xE9;

uint8_t byte1 = 0xC0 | (c >> 6);     // marker 110 + high bits
uint8_t byte2 = 0x80 | (c & 0x3F);   // marker 10  + low 6 bits
  • c & 0x3F isolates the low 6 bits
  • c >> 6 isolates the bits above
  • | 0xC0 prepends the marker 110 to the leading byte
  • | 0x80 prepends the marker 10 to the continuation byte

It's the direct translation of the "high bits / low 6 bits" split into code.

Useful masks for decoding

Format detection (on the leading byte):

  • 0xC0 = marker 11000000 (start of a 2-byte sequence)
  • 0xE0 = marker 11100000 (start of a 3-byte sequence)
  • 0xF0 = marker 11110000 (start of a 4-byte sequence)
  • 0x80 = marker 10000000 (continuation byte)

Data bit extraction:

  • 0x1F = mask for the low 5 bits (00011111) - data bits of a leading byte in a 2-byte sequence
  • 0x0F = mask for the low 4 bits (00001111) - data bits of a leading byte in a 3-byte sequence
  • 0x07 = mask for the low 3 bits (00000111) - data bits of a leading byte in a 4-byte sequence
  • 0x3F = mask for the low 6 bits (00111111) - data bits of a continuation byte

Decoding example

To decode the UTF-8 sequence 0xC3 0xA9:

Input bytes: 0xC3 0xA9

Step 1 - identify the format from the leading byte
  0xC3 = 1100 0011
  Prefix `110` β†’ 2-byte sequence

Step 2 - extract the data bits from each byte
  Leading byte      : 0xC3 & 0x1F = 0000 0011   (5 bits)
  Continuation byte : 0xA9 & 0x3F = 0010 1001   (6 bits)

Step 3 - reassemble by shifting the leading bits 6 places up
  (0x03 << 6) | 0x29 = 1100 0000 | 0010 1001 = 1110 1001 = 0x00E9

Result: U+00E9 = Γ©

In C:

uint8_t byte1 = 0xC3;
uint8_t byte2 = 0xA9;
uint32_t code_point = ((byte1 & 0x1F) << 6) | (byte2 & 0x3F);   // 0xE9

For 3 bytes, the leading byte uses 0x0F (4 data bits) and the shifts become << 12, << 6. For 4 bytes, 0x07 (3 data bits) with << 18, << 12, << 6. Same idea every time: mask each byte to extract its data bits, then position them with << before merging with |.

Encoding a code point in UTF-8

To encode a Unicode code point as UTF-8 bytes, pick the format based on the code point's value, then spread its bits into the x slots of the format.

Code point rangeSignificant bitsUTF-8 formatBytes
U+0000–U+007F7 bits0xxxxxxx1
U+0080–U+07FF11 bits110xxxxx 10xxxxxx2
U+0800–U+FFFF16 bits1110xxxx 10xxxxxx 10xxxxxx3
U+10000–U+10FFFF21 bits11110xxx 10xxxxxx 10xxxxxx 10xxxxxx4

Every continuation byte always carries 6 bits; only the leading byte varies.

General algorithm:

  1. Find the code point's range β†’ derive the byte count and format
  2. Write the code point in binary on the required number of significant bits (pad with leading zeros)
  3. Split into chunks and fill the x slots of the format (6 bits per continuation byte)
  4. Prepend the markers (110, 1110, 11110, 10) to each chunk
  5. Convert each binary byte to hexadecimal (1 hex digit per 4-bit group)

Encoding examples

1 byte: A (U+0041)

Code point: U+0041 = 65
Range     : U+0000–U+007F β†’ 1 byte, format 0xxxxxxx

Binary on 7 bits: 1000001

Fill the format:

  Format:  0xxxxxxx
  Bits  :   1000001
  Byte  :  01000001

Convert to hex (1 hex digit = 4 bits):

  01000001 = 0100 0001 = 0x41

Result: 0x41

ASCII and UTF-8 are identical on this range - that's the backwards compatibility.

2 bytes: Γ© (U+00E9)

Code point: U+00E9 = 233
Range     : U+0080–U+07FF β†’ 2 bytes, format 110xxxxx 10xxxxxx

Binary on 11 bits: 000 1110 1001
Split            : 00011 | 101001  (5 high bits, 6 low bits)

Fill the format (`110`/`10` come from the format, the bits fill the `x`s):

  Format:  110xxxxx   10xxxxxx
  Bits  :     00011     101001
  Bytes :  11000011   10101001

Convert to hex:

  11000011 = 1100 0011 = 0xC3
  10101001 = 1010 1001 = 0xA9

Result: 0xC3 0xA9

In Latin-1, Γ© = 0xE9 (1 byte). When é shows up in output, that's UTF-8 (0xC3 0xA9) being read as if it were Latin-1: 0xC3 β†’ Γƒ, 0xA9 β†’ Β©.

3 bytes: δΈ­ (U+4E2D)

Code point: U+4E2D = 20013
Range     : U+0800–U+FFFF β†’ 3 bytes, format 1110xxxx 10xxxxxx 10xxxxxx

Binary on 16 bits: 0100 1110 0010 1101
Split            : 0100 | 111000 | 101101  (4 bits, 6 bits, 6 bits)

Fill the format:

  Format:  1110xxxx   10xxxxxx   10xxxxxx
  Bits  :      0100     111000     101101
  Bytes :  11100100   10111000   10101101

Convert to hex:

  11100100 = 1110 0100 = 0xE4
  10111000 = 1011 1000 = 0xB8
  10101101 = 1010 1101 = 0xAD

Result: 0xE4 0xB8 0xAD

Common CJK characters (Chinese, Japanese, Korean) all live in 3 bytes.

4 bytes: πŸ˜€ (U+1F600)

Code point: U+1F600 = 128512
Range     : U+10000–U+10FFFF β†’ 4 bytes, format 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Binary on 21 bits: 0 0001 1111 0110 0000 0000
Split            : 000 | 011111 | 011000 | 000000  (3 bits, 6 bits, 6 bits, 6 bits)

Fill the format:

  Format:  11110xxx   10xxxxxx   10xxxxxx   10xxxxxx
  Bits  :       000     011111     011000     000000
  Bytes :  11110000   10011111   10011000   10000000

Convert to hex:

  11110000 = 1111 0000 = 0xF0
  10011111 = 1001 1111 = 0x9F
  10011000 = 1001 1000 = 0x98
  10000000 = 1000 0000 = 0x80

Result: 0xF0 0x9F 0x98 0x80

All emojis are 4 bytes - same goes for supplementary Unicode planes (historic scripts like cuneiform, mathematical symbols, etc.).

Resources

Character tables

Reading

← back to blog