Coding

Next: Further languages Up: Introduction Previous: Transcription

Coding

SAMPA's coding principles involve restricting the available ASCII codes to the range 32-127. At the time SAMPA was formulated, many computers used only the 7-bit ASCII character set. With the spread of PCs and compatibles, the ``extended ASCII'' (8-bit) set has become familiar, allowing codes in the range 128-255. Has the decision to restrict SAMPA to the range 32-127 proved wise? Or should we now relax it?

In the (American) English extended ASCII character set used by PCs running MS-DOS, the range 128-255 is used to provide for the screen and printer a number of accented alphabetic letters, currency symbols, graphic symbols, and Greek and mathematical symbols. Those that are not available on the keyboard can be accessed by entering their ASCII number on the keypad while pressing the Alt key. Unfortunately, from the point of view of non-English-speaking Europeans, this extended ASCII fails to provide all the accented Latin letters needed for such languages as Portuguese, Icelandic, Czech, Polish and Hungarian. To remedy this shortcoming, a number of different ``code pages'' are now available, each providing a different set of characters in the 128-255 range. In the USA and the UK most PCs use code page 437 (International English), in Western Europe 850 (Multilingual Latin I), and in much of Eastern Europe 852 (Slavic Latin II).

Applications running under the popular front-end Windows use yet another character set, one known as ``enhanced ANSI''. This is identical with the ASCII set for 33-127; for 128-255 it offers its own specific choice of accented alphabetic and other characters, with codes different from ASCII.

The consequence is that in PC-compatible computing the code numbers in the range 128-255 (the ``extended'' characters) may currently have several different interpretations. Conversely, a given character may be coded in several different ways.

Consider the IPA symbols /æ/ and //, both needed for the phonetic transcription of English. For reasons that seemed valid at the time (cf. Wells (1987: 95)), SAMPA assigned the former the code 123, which now appears on all Latin-alphabet PC screens as ``{''; the latter was coded 68, ``D''.

Both ``æ'' and ``'' are now available on-screen for PCs running Windows. While ``æ'' is an ASCII character, with the extended code 145 (for those using code page 437 or 850), ``'' is not. But both are in the enhanced ANSI set, with codes 230 and 240 respectively. (Hence under Windows they can be accessed, if not on the keyboard, by keying Alt+0230 and Alt+0240; ``æ'' can also be accessed as Alt+145.)

However, a PC using code page 852 (Slavic) will display code 145 as an upper-case L with acute accent (Ĺ), 230 as ``Š'', and 240 as ``¯ ''. With code page 860 (Portugal), 145 is ``À'', 230 ``'' and 240 ``''.

Recently a number of phonetic fonts have become available for use under Windows. These comprise only phonetic symbols (perhaps with a few punctuation signs). Unfortunately they disagree extensively on key assignment and coding. On my PC I now have three TrueType phonetic fonts provided by the Summer Institute of Linguistics and four others of whose origins, I regret to say, I am uncertain. These fonts agree with SAMPA (but not ANSI) in assigning ``'' to code 68/D; but for ``æ'' they assign codes and keystrokes 81/Q (SIL Doulos/Manuscript/Sophia IPA), 60/< (Times IPA New), 64/@ (Tech Phonetic), and 233 (IPA Roman 1, IPA Plus).

Next: Further languages Up: Introduction Previous: Transcription

EAGLES SWLG SoftEdition, May 1997. Get the book...