Western Latin character sets (computing)

Updated on Sep 22, 2024

Edit

Comment

Several binary representations of character sets for common Western European languages are compared in this article. These encodings were designed for representation of Italian, Spanish, Portuguese, French, German, Dutch, English, Danish, Swedish, Norwegian, and Icelandic, which use the Latin alphabet, a few additional letters and ones with precomposed diacritics, some punctuation, and various symbols (including some Greek letters). Although they're called "Western European" many of these languages are spoken all over the world. Also, these character sets happen to support many other languages such as Malay, Swahili, and Classical Latin.

Summary

The ISO-8859 series of 8-bit character sets encodes all Latin character sets used in Europe, albeit that the same code points have multiple uses that caused some difficulty. The arrival of Unicode, with a unique code point for every glyph, resolved these issues.

ISO/IEC 8859-1 or Latin-1 is the most used and also defines the first 256 codes in Unicode

ISO/IEC 8859-15 modifies ISO-8859-1 to fully support Estonian, Finnish and French and add the euro sign.

Windows-1252 is a superset of ISO-8859-1 that includes the characters from ISO-8859-15 and popular punctuation such as curved quotation marks. It is common that web page tools for Windows use Windows-1252 but label the web page as using ISO-8859-1, this has been addressed in HTML 5, which mandates that pages labeled as ISO-8859-1 must be interpreted as Windows-1252.

IBM CP437, being intended for English only, has very little in the way of accented letters but has far more graphics characters than the others and also some Greek characters that are useful as technical symbols.

IBM CP850 has all the printable characters that ISO-8859-1 has (albeit arranged differently) and still manages to have enough graphics characters to build a usable text-mode user interface.

IBM CP858 differs from CP850 only by one character — a dotless i (ı), rarely used outside Turkey, was replaced by euro currency sign (€).

IBM CP859 contains all the printable characters that ISO-8859-15 has, so unlike CP850 it supports the € and French.

IBM code pages 037, 500, and 1047 are EBCDIC encodings that include all of the ISO-8859-1 characters.

The Mac OS Roman character set (often referred to as MacRoman and known by the IANA as simply MACINTOSH) has most, but not all, of the same characters as ISO-8859-1 but in a very different arrangement; and it also adds many technical and mathematical characters (though it lacks the important ×) and more diacritics. Older Macintosh web browsers were known to munge the few characters that were in ISO-8859-1 but not their native Macintosh character set when editing text from Web sites. Conversely, in Web material prepared on an older Macintosh, many characters were displayed incorrectly when read by other operating systems.

The euro sign post-dates these (ISO-8859) specifications: conflicting ways to retrofit it led to significant difficulty until Unicode became more generally adopted.

History

The earlier seven-bit U.S. ASCII encoding has characters sufficient to properly represent only English, Latin, and Swahili. It is missing some letters and letter-diacritic combinations used in other Latin-alphabet languages. However, since there was no other choice on most U.S.-supplied computer platforms, ASCII was unavoidable in most of the non-English-speaking world (seven-bit encoding was necessitated by the limitations of early computing networks). There was the ISO 646 group of encodings which replaced some of the symbols in ASCII with local characters, but space was very limited, and some of the symbols replaced were quite common in things like programming languages.

Although seven-bit communication was the norm, most computers internally used eight-bit bytes, and they mostly put some form of characters in the 128 higher byte positions. In the early days most of these were system specific, but gradually a few standards were settled on.

In recent years, as storage and memory costs fall, the issues associated with multiple meanings of a given eight-bit code (there are seven ISO-Latin code sets alone) have ceased to be justified. All major operating systems have moved to Unicode as their main internal representation. However Windows does not support Unicode using their 8-bit character interfaces (by supporting UTF-8 in standard interfaces such as fopen), so many applications continue to be restricted to these legacy character sets.

The euro sign

The coming of the euro and its euro sign introduced significant pressure to support the euro sign (€), and most 8-bit character sets had to be adapted in some way.

Apple with MacRoman and Sun Microsystems with Solaris OS simply replaced the generic currency sign (¤). This caused significant difficulty because organisations had found other uses for it, such as the company logo.

ISO introduced a further variant of ISO 8859, ISO 8859-15, which replaced the generic currency sign with the euro sign as well as making some other replacements of symbols with letters with diacritics. ISO 8859-15 never received widespread adoption.

Windows-1252 placed the euro sign in a gap (position 80_hex) in the existing C1 control codes.

All of these issues have been resolved as operating systems have been upgraded to support Unicode as standard, which encodes the euro sign at U+20AC (decimal 8364).

Comparison table

Code points U+0000 to U+007F are not shown in this table currently, as they are directly mapped in all character sets listed here. The ASCII coding standard defines the original specification for the mapping of the first 0-127 characters.

The table is arranged by Unicode code point. Character sets are referred to here by their IANA names in upper case.

In addition, Macintosh assigns the Apple logo ⟨⟩ (Mac OS Roman: F0) to U+F8FF in the Private Use Area.

References

Western Latin character sets (computing) Wikipedia

(Text) CC BY-SA

Contents

Summary

History

The euro sign

Comparison table

References