Puneet Varma

Vietnamese language and computers

Updated on
Share on FacebookTweet on TwitterShare on LinkedIn
Vietnamese language and computers

The Vietnamese language is written with a complex Latin alphabet that requires various accommodations in computing. Historically, Vietnamese was written in a much more complex logographic script, chữ Nôm, which does not yet enjoy full computer support.

Contents

Vietnamese alphabet

There are as many as 46 character encodings for representing the Vietnamese alphabet. Unicode has become the most popular, due to its superior compatibility and software support. Diacritics may be encoded either as combining characters or as precomposed characters, which are scattered among the Latin Extended-A, Latin Extended-B, and Latin Extended Additional blocks. The Vietnamese đồng symbol is encoded in the Currency Symbols block. The Middle Vietnamese letter B with flourish (ꞗ) is included in the Latin Extended-D block. The apex is not included in Unicode, but U+1DC4 ◌᷄ Combining macron-acute may serve as a rough approximation.

Early versions of Unicode assigned the characters U+0340 ◌̀ Combining grave tone mark and U+0341 ◌́ Combining acute tone mark for the purpose of placing these marks beside a circumflex, as is common in Vietnamese typography. These two characters have been deprecated; U+0301 ◌́ Combining acute accent and U+0300 ◌̀ Combining grave accent are now used regardless of any present circumflex.

For systems that lack support for Unicode, dozens of 8-bit Vietnamese code pages are available. The most common are VISCII, VSCII (TCVN 5712:1993), VPS, and Windows-1258. Where ASCII is required, such as when ensuring readability in plain text e-mail, Vietnamese letters are often encoded according to Vietnamese Quoted-Readable (VIQR) or VSCII Mnemonic (VSCII-MNEM), though usage of either variable-width scheme has declined dramatically following the adoption of Unicode on the World Wide Web.

Many Vietnamese fonts intended for desktop publishing are encoded in VNI or TCVN3. Such fonts are known as "ABC fonts". Popular Web browsers lack support for specialty Vietnamese encodings, so any webpage that uses these fonts appears as unintelligible mojibake on systems without them installed.

Vietnamese frequently stacks diacritics, so typeface designers must take care to prevent stacked diacritics from colliding with adjacent letters or lines. In advertising signage and in cursive handwriting, diacritics often take forms unfamiliar to other Latin alphabets. For example, the lowercase letter I retains its tittle in ì, , ĩ, and í. These nuances are rarely accounted for in computing environments.

Chữ Nôm

Unicode includes over 10,000 nôm characters as part of Unicode's repertoire of CJK Unified Ideographs. Of these characters, 10,082 can be found in the CJK Unified Ideographs Extension B block, while the rest are distributed between the CJK Unified Ideographs, CJK Unified Ideographs Extension A, and CJK Unified Ideographs Extension C blocks. A further 1,028 characters, including over 400 characters specific to the Tày language, are encoded in the CJK Unified Ideographs Extension E block. The characters are taken from the Vietnamese standards TCVN 5773:1993 and TCVN 6909:2001, as well as from research by the Han-Nom Research Institute and other groups.

The two most comprehensive nôm fonts are the Vietnamese Nôm Preservation Foundation's Nôm Na Tống Light and the community-developed HAN NOM A/HAN NOM B, both of which place a large number of unstandardized characters in the Private Use Areas.

The Unicode Consortium's Unihan database includes Vietnamese readings of some characters but does not distinguish between Sino-Vietnamese and nôm readings.

Like other CJKV writing systems, chữ Nôm is traditionally written vertically, from top to bottom and right to left.

Text input

A purely physical Vietnamese keyboard would be impractical, due to the sheer number of letter-diacritic-diacritic combinations in the alphabet. Instead, Vietnamese input relies on software-based keyboard layouts, virtual keyboards, or input methods (also known as IMEs).

Keyboard layouts

Vietnamese keyboard layouts rely on dead keys to compose letters with diacritics. Most desktop operating systems include a Vietnamese keyboard layout similar to TCVN 6064:1995, a Vietnamese national standard.

Input methods

The three common Vietnamese input methods are Telex, VNI, and VIQR. Telex indicates diacritics using letters that are unlikely to appear at the end of a word, while VNI repurposes the number keys or function keys and VIQR repurposes various punctuation marks. The Telex and VIQR conventions originated in an earlier era of telex machines and typewriters, respectively.

Support for these input methods is provided by input method editors (IMEs), which are known in Vietnamese as bộ gõ, literally "pecker". IMEs may be provided by the operating system, installed as a third-party application, installed as a browser extension, or provided by an individual website in the form of a script. Common third-party applications include GoTiengViet, UniKey, VietKey, VPSKeys, WinVNKey, and xvnkb. On Unix-like operating systems, the IBus and SCIM frameworks both support Vietnamese. IMEs scripts such as AVIM, Mudim, and VietTyping can be found on most Vietnamese message boards, the Vietnamese Wikipedia, and other text-intensive websites. The Vietnamese Web browser Cốc Cốc comes with an input method built-in.

Input methods allow words to be composed in a more flexible order than keyboard layouts allow. For example, to enter the word "viết" using the TCVN 6064:1995 keyboard layout, one must type VI38T, in that order. By contrast, most IMEs permit the user to insert diacritics at the end of the word: VIEETS in Telex, VIET61 in VNI, or VIET^' in VIQR. Some IMEs even allow diacritics to be entered before their base letters. Depending on an IME's implementation, it may also be possible to edit an existing word's diacritics without retyping the word.

Borrowing a feature common among Chinese input methods, some Vietnamese IMEs allow one to skip diacritics altogether. Instead, after typing the base letters, the user selects the accented word from a candidate list. In order to provide this autocomplete list, the IME may need to communicate with a Web service. Some IMEs also use candidate lists to allow the user to convert text from the Vietnamese alphabet to chữ Nôm, because there is no one-to-one correspondence between alphabetic words and nôm characters.

Other considerations

Typical Vietnamese text contains a high proportion of compound words. Compound words are never hyphenated in contemporary usage, so spell checkers are limited to checking individual syllables unless a statistical language model is consulted.

Vietnamese has rigid spelling rules and few exceptions, so text-to-speech engines may avoid dictionary lookups except when encountering a foreign loan word. TTS engines must account for tones, which are essential to the meaning of any Vietnamese word.

References

Vietnamese language and computers Wikipedia


Similar Topics
Tweetys SOS
Brad Bombardir
Maximiliano Oliva
Topics