46
Everything
you thought
you knew
about
strings is
wrong.
Surely you’ve seen web pages like this, with strange
question-mark-likecharacters where apostrophes should
be. That usually means the page author didn’t declare
their character encoding correctly, your browser was
left guessing, and theresult was a mix of expected and
unexpected characters. In English it’s merely annoying;in
other languages, theresult can be completely
unreadable.
Therearecharacter encodings for each major language
in the world. Since each language is different, and
memory and disk space havehistorically been expensive,
each character encoding is optimized for a particular
language. By that, I mean each encoding using thesame
numbers (0–255) to represent that language’s characters.
For instance, you’re probably familiar with the
ASCII
encoding, which stores English characters as numbers
ranging from 0 to 127. (65 is capital “A”, 97 is
lowercase “a”,
&
c.) English has a very simplealphabet,
so it can be completely expressed in less than 128 numbers. For thoseof you whocan count in base 2,
that’s 7 out of the8 bits in a byte.
Western European languages like French, Spanish, and German have more letters than English. Or, more
precisely, they have letters combined with various diacritical marks, like the
ñ
character in Spanish. The most
common encodingfor theselanguages is CP-1252, alsocalled “windows-1252” because it is widely used on
Microsoft Windows. TheCP-1252 encoding shares characters with
ASCII
in the 0–127 range, but then
extends into the 128–255 range for characters like n-with-a-tilde-over-it (241), u-with-two-dots-over-it (252),
&
c. It’s still a single-byte encoding, though;thehighest possible number, 255, still fits in onebyte.
Then therearelanguages likeChinese, Japanese, and Korean, which haveso many characters that they
require multiple-byte character sets. That is, each “character” is represented by a two-byte number from
0–65535. But different multi-byte encodings still share the same problem as different single-byte encodings,
namely that they each usethesamenumbers to mean different things. It’s just that the rangeof numbers is
broader, because there are many morecharacters to represent.
107