WebSVN – planix.SVN – Blame – /os/trunk/sys/doc/utf.ms

Rev	Author	Line No.	Line
2	-	1	`.HTML "Hello World or Καλημέρα κόσμε or こんにちは世界`
		2	`.TL`
		3	`Hello World`
		4	`.br`
		5	`or`
		6	`.br`
		7	`.ft R`
		8	`Καλημέρα κόσμε`
		9	`.ft`
		10	`.br`
		11	`or`
		12	`.br`
		13	`\f(Jpこんにちは世界\fP`
		14	`.AU`
		15	`Rob Pike`
		16	`Ken Thompson`
		17	`.sp`
		18	`rob,ken@plan9.bell-labs.com`
		19	`.AB`
		20	`.FS`
		21	`Originally appeared, in a slightly different form, in`
		22	`.I`
		23	`Proc. of the Winter 1993 USENIX Conf.,`
		24	`.R`
		25	`pp. 43-50,`
		26	`San Diego.`
		27	`It has been revised to reflect the move to 21-bit Unicode.`
		28	`.FE`
		29	`Plan 9 from Bell Labs has recently been converted from ASCII`
		30	`to an ASCII-compatible variant of the Unicode Standard,`
		31	`a 16-bit (now 21-bit) character set.`
		32	`In this paper we explain the reasons for the change,`
		33	`describe the character set and representation we chose,`
		34	`and present the programming models and software changes`
		35	`that support the new text format.`
		36	`Although we stopped short of full internationalization\(emfor`
		37	`example, system error messages are in Unixese, not Japanese\(emwe`
		38	`believe Plan 9 is the first system to treat the representation`
		39	`of all major languages on a uniform, equal footing throughout all its`
		40	`software.`
		41	`.AE`
		42	`.SH`
		43	`Introduction`
		44	`.PP`
		45	`The world is multilingual but most computer systems`
		46	`are based on English and ASCII.`
		47	`The first release of Plan 9 [Pike90], a new distributed operating`
		48	`system from Bell Laboratories, seemed a good occasion`
		49	`to correct this chauvinism.`
		50	`It is easier to make such deep changes when building new systems than`
		51	`by refitting old ones.`
		52	`.PP`
		53	`The ANSI C standard [ANSIC] contains some guidance on the matter of`
		54	`wide' and `multi-byte' characters but falls far short of
		55	`solving the myriad associated problems.`
		56	`We could find no literature on how to convert a`
		57	`.I system`
		58	`to larger character sets, although some individual`
		59	`.I programs`
		60	`had been converted.`
		61	`This paper reports what we discovered as we`
		62	`explored the problem of representing multilingual`
		63	`text at all levels of an operating system,`
		64	`from the file system and kernel through`
		65	`the applications and up to the window system`
		66	`and display.`
		67	`.PP`
		68	Plan 9 has not been `internationalized':
		69	`its manuals are in English,`
		70	`its error messages are in English,`
		71	`and it can display text that goes from left to right only.`
		72	`But before we can address these other problems,`
		73	`we need to handle, uniformly and comfortably,`
		74	`the textual representation of all the major written languages.`
		75	`That subproblem is richer than we had anticipated.`
		76	`.SH`
		77	`Standards`
		78	`.PP`
		79	`Our first step was to select a standard.`
		80	`At the time (January 1992),`
		81	`there were only two viable options:`
		82	`ISO 10646 [ISO10646] and Unicode [Unicode].`
		83	`The documents describing both proposals were still in the draft stage.`
		84	`.PP`
		85	`The draft of ISO 10646 was not`
		86	`very attractive to us.`
		87	`It defined a sparse set of 32-bit characters,`
		88	`which would be`
		89	`hard to implement`
		90	`and have punitive storage requirements.`
		91	`Also, the draft attempted to`
		92	`mollify national interests by allocating`
		93	`16-bit subspaces to national committees`
		94	`to partition individually.`
		95	`The suggested mode of use was to`
		96	``flip'' between separate national
		97	`standards to implement the international standard.`
		98	`This did not strike us as a sound basis for a character set.`
		99	`As well, transmitting 32-bit values in a byte stream,`
		100	`such as in pipes, would be expensive and hard to implement.`
		101	`Since the standard does not define a byte order for such`
		102	`transmission, the byte stream would also have to carry`
		103	`state to enable the values to be recovered.`
		104	`.PP`
		105	`The Unicode Standard is a proposal by a consortium of mostly American`
		106	`computer companies formed`
		107	`to protest the technical`
		108	`failings of ISO 10646.`
		109	`It defines a uniform 16-bit code based on the`
		110	`principle of unification:`
		111	`two characters are the same if they look the`
		112	`same even though they are from different`
		113	`languages.`
		114	`This principle, called Han unification,`
		115	`allows the large Japanese, Chinese, and Korean`
		116	`character sets to be packed comfortably into a 16-bit representation.`
		117	`.PP`
		118	`We chose the Unicode Standard for its technical merits and because its`
		119	`code space was better defined.`
		120	`Moreover,`
		121	`the Unicode Consortium was derailing the`
		122	`ISO 10646 standard.`
		123	`(Now, in 1995,`
		124	`ISO 10646 is a standard`
		125	`with one 16-bit group defined,`
		126	`which is almost exactly the Unicode Standard.`
		127	`As most people expected, the two standards bodies`
		128	`reached a détente and`
		129	`ISO 10646 and Unicode represent the same character set.)`
		130	`.PP`
		131	`The Unicode Standard defines an adequate character set`
		132	`but an unreasonable representation.`
		133	`It states that all characters`
		134	`are 16 bits wide and are communicated and stored in`
		135	`16-bit units.`
		136	`It also reserves a pair of characters`
		137	`(hexadecimal FFFE and FEFF) to detect byte order`
		138	`in transmitted text, requiring state in the byte stream.`
		139	`(The Unicode Consortium was thinking of files, not pipes.)`
		140	`To adopt this encoding,`
		141	`we would have had to convert all text going`
		142	`into and out of Plan 9 between ASCII and Unicode, which cannot be done.`
		143	`Within a single program, in command of all its input and output,`
		144	`it is possible to define characters as 16-bit quantities;`
		145	`in the context of a networked system with`
		146	`hundreds of applications on diverse machines`
		147	`by different manufacturers,`
		148	`it is impossible.`
		149	`.PP`
		150	`We needed a way to adapt the Unicode Standard to the tools-and-pipes`
		151	`model of text processing embodied by the Unix system.`
		152	`To do that, we`
		153	`needed an ASCII-compatible textual`
		154	`representation of Unicode characters for transmission`
		155	`and storage.`
		156	`In the draft ISO standard there was an informative`
		157	`(non-required)`
		158	`Annex`
		159	`called UTF`
		160	`that provided a byte stream encoding`
		161	`of the 32-bit ISO code.`
		162	`The encoding uses multibyte sequences composed`
		163	`from the 190 printable characters of Latin-1`
		164	`to represent character values larger`
		165	`than 159.`
		166	`.PP`
		167	`The UTF encoding has several good properties.`
		168	`By far the most important is that`
		169	`a byte in the ASCII range 0-127 represents`
		170	`itself in UTF.`
		171	`Thus UTF is backward compatible with ASCII.`
		172	`.PP`
		173	`UTF has other advantages.`
		174	`It is a byte encoding and is`
		175	`therefore byte-order independent.`
		176	`ASCII control characters appear in the byte stream`
		177	`only as themselves, never as an element of a sequence`
		178	`encoding another character,`
		179	`so newline bytes separate lines of UTF text.`
		180	`Finally, ANSI C's`
		181	`.CW strcmp`
		182	`function applied to UTF strings preserves the ordering of Unicode characters.`
		183	`.PP`
		184	`To encode and decode UTF is expensive (involving multiplication,`
		185	`division, and modulo operations) but workable.`
		186	`UTF's major disadvantage is that the encoding`
		187	`is not self-synchronizing.`
		188	`It is in general impossible to find the character`
		189	`boundaries in a UTF string without reading from`
		190	`the beginning of the string, although in practice`
		191	`control characters such as newlines,`
		192	`tabs, and blanks provide synchronization points.`
		193	`.PP`
		194	`In August 1992,`
		195	`X-Open circulated a proposal for another UTF-like`
		196	`byte encoding of Unicode characters.`
		197	`Their major concern was that an embedded character`
		198	`in a file name`
		199	`(in particular a slash)`
		200	`could be part of an escape sequence in UTF and`
		201	`therefore confuse a traditional file system.`
		202	`Their proposal would allow all 7-bit ASCII characters`
		203	`to represent themselves`
		204	`.I "and only themselves"`
		205	`in text.`
		206	`Multibyte sequences would contain only characters`
		207	`with the high bit set.`
		208	`We proposed a modification to the new UTF that`
		209	`would address our synchronization problem.`
		210	`Our proposal, which was originally known informally as UTF-2 and FSS-UTF,`
		211	`is now referred to as UTF-8 and has been approved by ISO to become`
		212	`Annex P to ISO 10646.`
		213	`.PP`
		214	`The model for text in Plan 9 is chosen from these`
		215	`three standards*:`
		216	`.FS`
		217	* ``That's the nice thing about standards\(emthere's so many to choose from.'' \- Andy Tannenbaum (no, the other one)
		218	`.FE`
		219	`the Unicode character set encoded as a byte stream by`
		220	`UTF-8, from`
		221	`(soon to be) Annex P of ISO 10646.`
		222	`Although this mixture may seem like a precarious position for us to adopt,`
		223	`it is not as bad as it sounds.`
		224	`ISO 10646 and the Unicode Standard have converged,`
		225	`other systems such as Linux have adopted the same character set and encoding,`
		226	`and the general feeling seems to be that Unicode and UTF-8 will be accepted as the way`
		227	`to exchange text between systems.`
		228	`The prognosis for wide acceptance is good.`
		229	`.PP`
		230	`There are a couple of aspects of the Unicode Standard we have not faced.`
		231	`One is the issue of right-to-left text such as Hebrew or Arabic.`
		232	`Since that is an issue of display, not representation, we believe`
		233	`we can defer that problem for the moment without affecting our`
		234	`ability to solve it later.`
		235	Another issue is diacriticals and `combining characters',
		236	`which cause overstriking of multiple Unicode characters.`
		237	`Although necessary for some scripts, such as Thai, Arabic, and Hebrew,`
		238	`such characters confuse the issues for Latin languages because they`
		239	`generate multiple representations for accented characters.`
		240	`ISO 10646 describes three levels of implementation;`
		241	`in Plan 9 we decided not to address the issue.`
		242	`Again, this can be labeled as a display issue and its finer points are still being debated,`
		243	`so we felt comfortable deferring. Mañana.`
		244	`.PP`
		245	`Although we converted Plan 9 in the altruistic interests of`
		246	`serving foreign languages, we have found the large character`
		247	`set attractive for other reasons. The Unicode Standard includes many`
		248	`characters\(emmathematical symbols, scientific notation,`
		249	`more general punctuation, and more\(emthat we now use`
		250	`daily in our work. We no longer test our imaginations`
		251	`to find ways to include non-ASCII symbols in our text;`
		252	`why type`
		253	`.CW :-)`
		254	`when you can use the character ☺?`
		255	`Most compelling is the ability to absorb documents`
		256	`and data that contain non-ASCII characters; our browser for the`
		257	`Oxford English Dictionary`
		258	`lets us see the dictionary as it really is, with pronunciation`
		259	`in the IPA font, foreign phrases properly rendered, and so on,`
		260	`.I "in plain text.`
		261	`.PP`
		262	`As of Unicode 4.0,`
		263	`characters are now 21 bits wide and the longest UTF-8 encoding of a character`
		264	`requires 4 bytes.`
		265	`We are adapting the system to match.`
		266	`.PP`
		267	`In the rest of this paper, except when`
		268	stated otherwise, the term `UTF' refers to the UTF-8 encoding
		269	`of Unicode characters as adopted by Plan 9.`
		270	`.SH`
		271	`C Compiler`
		272	`.PP`
		273	`The first program to be converted to UTF`
		274	`was the C Compiler.`
		275	`There are two levels of conversion.`
		276	`On the syntactic level,`
		277	`input to the C compiler`
		278	`is UTF; on the semantic level,`
		279	`the C language needs to define`
		280	`how compiled programs manipulate`
		281	`the UTF set.`
		282	`.PP`
		283	`The syntactic part is simple.`
		284	`The ANSI C language standard defines the`
		285	`source character set to be ASCII.`
		286	`Since UTF is backward compatible with ASCII,`
		287	`the compiler needs little change.`
		288	`The only places where a larger character set`
		289	`is allowed are in character constants, strings, and comments.`
		290	`Since 7-bit ASCII characters can represent only`
		291	`themselves in UTF,`
		292	`the compiler does not have to be careful while looking`
		293	`for the termination of a string or comment.`
		294	`.PP`
		295	`The Plan 9 compiler extends ANSI C to treat any Unicode`
		296	`character with a value outside of the ASCII range as`
		297	`an alphabetic.`
		298	`To a Greek programmer or an English mathematician,`
		299	`α is a sensible and now valid variable name.`
		300	`.PP`
		301	`On the semantic level, ANSI C allows,`
		302	`but does not tie down,`
		303	`the notion of a`
		304	`.I "wide character`
		305	`and admits string and character constants`
		306	`of this type.`
		307	`We chose the wide character type to be`
		308	`.CW unsigned`
		309	`.CW short`
		310	`(now`
		311	`.CW unsigned`
		312	`.CW long) .`
		313	`In the libraries, the word`
		314	`.CW Rune`
		315	`is now defined by a`
		316	`.CW typedef`
		317	`to be equivalent to`
		318	`.CW unsigned`
		319	`.CW long`
		320	`and is`
		321	`used to signify a Unicode character.`
		322	`.PP`
		323	`There are surprises; for example:`
		324	`.P1`
		325	`L'x' \f1is 120\fP`
		326	`\&'x' \f1is 120\fP`
		327	`L'ÿ' \f1is 255\fP`
		328	`\&'ÿ' \f1is -1, stdio \fPEOF\f1 (if \fPchar\f1 is signed)\fP`
		329	`L'\f1α\fP' \f1is 945\fP`
		330	`\&'\f1α\fP' \f1is illegal\fP`
		331	`.P2`
		332	`In the string constants,`
		333	`.P1`
		334	`"\f(Jpこんにちは世界\fP"`
		335	`L"\f(Jpこんにちは世界\fP",`
		336	`.P2`
		337	`the former is an array of`
		338	`.CW chars`
		339	`with 22 elements`
		340	`and a null byte,`
		341	`while the latter is an array of`
		342	`.CW unsigned`
		343	`.CW long s`
		344	`.CW Runes ) (`
		345	`with 8 elements and a null`
		346	`.CW Rune .`
		347	`.PP`
		348	`The Plan 9 library provides an output conversion function,`
		349	`.CW print`
		350	`(analogous to`
		351	`.CW printf ),`
		352	`with formats`
		353	`.CW %c ,`
		354	`.CW %C ,`
		355	`.CW %s ,`
		356	`and`
		357	`.CW %S .`
		358	`Since`
		359	`.CW print`
		360	`produces text, its output is always UTF.`
		361	`The character conversion`
		362	`.CW %c`
		363	`(lower case) masks its argument`
		364	`to 8 bits before converting to UTF.`
		365	`Thus`
		366	`.CW L'ÿ'`
		367	`and`
		368	`.CW 'ÿ'`
		369	`printed under`
		370	`.CW %c`
		371	`will be identical,`
		372	`but`
		373	`.CW L'\f1α\fP'`
		374	`will print as the Unicode`
		375	`character with decimal value 177.`
		376	`The character conversion`
		377	`.CW %C`
		378	`(upper case) masks its argument`
		379	`to 16 bits before converting to UTF.`
		380	`Thus`
		381	`.CW L'ÿ'`
		382	`and`
		383	`.CW L'\f1α\fP'`
		384	`will print correctly under`
		385	`.CW %C ,`
		386	`but`
		387	`.CW 'ÿ'`
		388	`will not.`
		389	`The conversion`
		390	`.CW %s`
		391	`(lower case)`
		392	`expects a pointer to`
		393	`.CW char`
		394	`and copies UTF sequences up to a null byte.`
		395	`The conversion`
		396	`.CW %S`
		397	`(upper case) expects a pointer to`
		398	`.CW Rune`
		399	`and`
		400	`performs sequential`
		401	`.CW %C`
		402	`conversions until a null`
		403	`.CW Rune`
		404	`is encountered.`
		405	`.PP`
		406	`Another problem in format conversion`
		407	`is the definition of`
		408	`.CW %10s :`
		409	`does the number refer to bytes or characters?`
		410	`We decided that such formats were most`
		411	`often used to align output columns and`
		412	`so made the number count characters.`
		413	`Some programs, however, use the count`
		414	`to place blank-padded strings`
		415	`in fixed-sized arrays.`
		416	`These programs must be found and corrected.`
		417	`.PP`
		418	`Here is a complete example:`
		419	`.P1`
		420	`#include <u.h>`
		421
		422	`char c[] = "\f(Jpこんにちは世界\fP";`
		423	`Rune s[] = L"\f(Jpこんにちは世界\fP";`
		424
		425	`main(void)`
		426	`{`
		427	`print("%d, %d\en", sizeof(c), sizeof(s));`
		428	`print("%s\en", c);`
		429	`print("%S\en", s);`
		430	`}`
		431	`.P2`
		432	`.PP`
		433	`This program prints`
		434	`.CW 23,`
		435	`.CW 18`
		436	`and then two identical lines of`
		437	`UTF text.`
		438	`In practice,`
		439	`.CW %S`
		440	`and`
		441	`.CW L"..."`
		442	`are rare in programs; one reason is`
		443	`that most formatted I/O is done in unconverted UTF.`
		444	`.SH`
		445	`Ramifications`
		446	`.PP`
		447	`All programs in Plan 9 now read and write text as UTF, not ASCII.`
		448	`This change breaks two deep-rooted symmetries implicit in most C programs:`
		449	`.IP 1.`
		450	`A character is no longer a`
		451	`.CW char .`
		452	`.IP 2.`
		453	`The internal representation (Rune) of a character now differs from its`
		454	`external representation (UTF).`
		455	`.PP`
		456	`In the sections that follow,`
		457	`we show how these issues were faced in the layers of`
		458	`system software from the operating system up to the applications.`
		459	`The effects are wide-reaching and often surprising.`
		460	`.SH`
		461	`Operating system`
		462	`.PP`
		463	`Since UTF is the only format for text in Plan 9,`
		464	`the interface to the operating system had to be converted to UTF.`
		465	`Text strings cross the interface in several places:`
		466	`command arguments,`
		467	`file names,`
		468	`user names (people can log in using their native name),`
		469	`error messages,`
		470	`and miscellaneous minor places such as commands to the I/O system.`
		471	`Little change was required: null-terminated UTF strings`
		472	`are equivalent to null-terminated ASCII strings for most purposes`
		473	`of the operating system.`
		474	`The library routines described in the next section made that`
		475	`change straightforward.`
		476	`.PP`
		477	`The window system, once called`
		478	`.CW 8.5 ,`
		479	`is now rightfully called`
		480	`.CW 8½ .`
		481	`.SH`
		482	`Libraries`
		483	`.PP`
		484	`A header file included by all programs (see [Pike92]) declares`
		485	`the`
		486	`.CW Rune`
		487	`type to hold 21-bit character values:`
		488	`.P1`
		489	`typedef unsigned long Rune;`
		490	`.P2`
		491	`Also defined are several constants relevant to UTF:`
		492	`.P1`
		493	`enum`
		494	`{`
		495	`UTFmax = 4, /* maximum bytes per rune */`
		496	`Runesync = 0x80, /* cannot be in a UTF sequence (<) */`
		497	`Runeself = 0x80, /* rune==UTF sequence (<) */`
		498	`Runeerror = 0xFFFD, /* decoding error in UTF */`
		499	`Runemax = 0x10FFFF, /* largest 21-bit rune */`
		500	`Runemask = 0x1FFFFF, /* bits used by runes (see grep) */`
		501	`};`
		502	`.P2`
		503	`(With the original UTF,`
		504	`.CW Runesync`
		505	`was hexadecimal 21 and`
		506	`.CW Runeself`
		507	`was A0.)`
		508	`.CW UTFmax`
		509	`bytes are sufficient`
		510	`to hold the UTF encoding of any Unicode character.`
		511	`Characters of value less than`
		512	`.CW Runesync`
		513	`only appear in a UTF string as`
		514	`themselves, never as part of a sequence encoding another character.`
		515	`Characters of value less than`
		516	`.CW Runeself`
		517	`encode into single bytes`
		518	`of the same value.`
		519	`Finally, when the library detects errors in UTF input\(embyte sequences`
		520	`that are not valid UTF sequences\(emit converts the first byte of the`
		521	`error sequence to the character`
		522	`.CW Runeerror .`
		523	`There is little a rune-oriented program can do when given bad data`
		524	`except exit, which is unreasonable, or carry on.`
		525	`Originally the conversion routines, described below,`
		526	`returned errors when given invalid UTF,`
		527	`but we found ourselves repeatedly checking for errors and ignoring them.`
		528	`We therefore decided to convert a bad sequence to a valid rune`
		529	`and continue processing.`
		530	`(The ANSI C routines, on the other hand, return errors.)`
		531	`.PP`
		532	`This technique does have the unfortunate property that converting`
		533	`invalid UTF byte strings in and out of runes does not preserve the input,`
		534	`but this circumstance only occurs when non-textual input is`
		535	`given to a textual program.`
		536	`The Unicode Standard defines an error character, value FFFD, to stand for`
		537	`characters from other sets that it does not represent.`
		538	`The`
		539	`.CW Runeerror`
		540	`character is a different concept, related to the encoding rather than the character set.`
		541	`.PP`
		542	`The Plan 9 C library contains a number of routines for`
		543	`manipulating runes.`
		544	`The first set converts between runes and UTF strings:`
		545	`.P1`
		546	`extern int runetochar(char, Rune);`
		547	`extern int chartorune(Rune, char);`
		548	`extern int runelen(long);`
		549	`extern int fullrune(char*, int);`
		550	`.P2`
		551	`.CW Runetochar`
		552	`translates a single`
		553	`.CW Rune`
		554	`to a UTF sequence and returns the number of bytes produced.`
		555	`.CW Chartorune`
		556	`goes the other way, reporting how many bytes were consumed.`
		557	`.CW Runelen`
		558	`returns the number of bytes in the UTF encoding of a rune.`
		559	`.CW Fullrune`
		560	`examines a UTF string up to a specified number of bytes`
		561	`and reports whether the string begins with a complete UTF encoding.`
		562	`All these routines use the`
		563	`.CW Runeerror`
		564	`character to work around encoding problems.`
		565	`.PP`
		566	`There is also a set of routines for examining null-terminated UTF strings,`
		567	`based on the model of the ANSI standard`
		568	`.CW str`
		569	`routines, but with`
		570	`.CW utf`
		571	`substituted for`
		572	`.CW str`
		573	`and`
		574	`.CW rune`
		575	`for`
		576	`.CW chr :`
		577	`.P1`
		578	`extern int utflen(char*);`
		579	`extern char* utfrune(char*, long);`
		580	`extern char* utfrrune(char*, long);`
		581	`extern char* utfutf(char, char);`
		582	`.P2`
		583	`.CW Utflen`
		584	`returns the number of runes in a UTF string;`
		585	`.CW utfrune`
		586	`returns a pointer to the first occurrence of a rune in a UTF string;`
		587	`and`
		588	`.CW utfrrune`
		589	`a pointer to the last.`
		590	`.CW Utfutf`
		591	`searches for the first occurrence of a UTF string in another UTF string.`
		592	`Given the synchronizing property of UTF-8,`
		593	`.CW utfutf`
		594	`is the same as`
		595	`.CW strstr`
		596	`if the arguments point to valid UTF strings.`
		597	`.PP`
		598	`It is a mistake to use`
		599	`.CW strchr`
		600	`or`
		601	`.CW strrchr`
		602	`unless searching for a 7-bit ASCII character, that is, a character`
		603	`less than`
		604	`.CW Runeself .`
		605	`.PP`
		606	`We have no routines for manipulating null-terminated arrays of`
		607	`.CW Runes .`
		608	`Although they should probably exist for completeness, we have`
		609	`found no need for them, for the same reason that`
		610	`.CW %S`
		611	`and`
		612	`.CW L"..."`
		613	`are rarely used.`
		614	`.PP`
		615	`Most Plan 9 programs use a new buffered I/O library, BIO, in place of`
		616	`Standard I/O.`
		617	`BIO contains routines to read and write UTF streams, converting to and from`
		618	`runes.`
		619	`.CW Bgetrune`
		620	`returns, as a`
		621	`.CW Rune`
		622	`within a`
		623	`.CW long ,`
		624	`the next character in the UTF input stream;`
		625	`.CW Bputrune`
		626	`takes a rune and writes its UTF representation.`
		627	`.CW Bungetrune`
		628	`puts a rune back into the input stream for rereading.`
		629	`.PP`
		630	`Plan 9 programs use a simple set of macros to process command line arguments.`
		631	`Converting these macros to UTF automatically updated the`
		632	`argument processing of most programs.`
		633	`In general,`
		634	`argument flag names can no longer be held in bytes and`
		635	`arrays of 256 bytes cannot be used to hold a set of flags.`
		636	`.PP`
		637	`We have done nothing analogous to ANSI C's locales, partly because`
		638	`we do not feel qualified to define locales and partly because we remain`
		639	`unconvinced of that model for dealing with the problems.`
		640	`That is really more an issue of internationalization than conversion`
		641	`to a larger character set; on the other hand,`
		642	`because we have chosen a single character set that encompasses`
		643	`most languages, some of the need for`
		644	`locales is eliminated.`
		645	`(We have a utility,`
		646	`.CW tcs ,`
		647	`that translates between UTF and other character sets.)`
		648	`.PP`
		649	`There are several reasons why our library does not follow the ANSI design`
		650	`for wide and multi-byte characters.`
		651	`The ANSI model was designed by a committee, untried, almost`
		652	`as an afterthought, whereas`
		653	`we wanted to design as we built.`
		654	`(We made several major changes to the interface`
		655	`as we became familiar with the problems involved.)`
		656	`We disagree with ANSI C's handling of invalid multi-byte sequences.`
		657	`Also, the ANSI C library is incomplete:`
		658	`although it contains some crucial routines for handling`
		659	`wide and multi-byte characters, there are some serious omissions.`
		660	`For example, our software can exploit`
		661	`the fact that UTF preserves ASCII characters in the byte stream.`
		662	`We could remove that assumption by replacing all`
		663	`calls to`
		664	`.CW strchr`
		665	`with`
		666	`.CW utfrune`
		667	`and so on.`
		668	`(Because of the weaker properties of the original UTF,`
		669	`we have actually done so.)`
		670	`ANSI C cannot:`
		671	`the standard says nothing about the representation, so portable code should`
		672	`.I never`
		673	`call`
		674	`.CW strchr ,`
		675	`yet there is no ANSI equivalent to`
		676	`.CW utfrune .`
		677	`ANSI C simultaneously invalidates`
		678	`.CW strchr`
		679	`and offers no replacement.`
		680	`.PP`
		681	`Finally, ANSI did nothing to integrate wide characters`
		682	`into the I/O system: it gives no method for printing`
		683	`wide characters.`
		684	`We therefore needed to invent some things and decided to invent`
		685	`everything.`
		686	`In the end, some of our entry points do correspond closely to`
		687	`ANSI routines\(emfor example`
		688	`.CW chartorune`
		689	`and`
		690	`.CW runetochar`
		691	`are similar to`
		692	`.CW mbtowc`
		693	`and`
		694	`.CW wctomb \(embut`
		695	`Plan 9's library defines more functionality, enough`
		696	`to write real applications comfortably.`
		697	`.SH`
		698	`Converting the tools`
		699	`.PP`
		700	`The source for our tools and applications had already been converted to`
		701	work with Latin-1, so it was `8-bit safe', but the conversion to the Unicode
		702	`Standard and UTF is more involved.`
		703	`Some programs needed no change at all:`
		704	`.CW cat ,`
		705	`for instance,`
		706	`interprets its argument strings, delivered in UTF,`
		707	`as file names that it passes uninterpreted to the`
		708	`.CW open`
		709	`system call,`
		710	`and then just copies bytes from its input to its output;`
		711	`it never makes decisions based on the values of the bytes.`
		712	`(Plan 9`
		713	`.CW cat`
		714	`has no options such as`
		715	`.CW -v`
		716	`to complicate matters.)`
		717	`Most programs, however, needed modest change.`
		718	`.PP`
		719	`It is difficult to`
		720	`find automatically the places that need attention,`
		721	`but`
		722	`.CW grep`
		723	`helps.`
		724	`Software that uses the libraries conscientiously can be searched`
		725	`for calls to library routines that examine bytes as characters:`
		726	`.CW strchr ,`
		727	`.CW strrchr ,`
		728	`.CW strstr ,`
		729	`etc.`
		730	`Replacing these by calls to`
		731	`.CW utfrune ,`
		732	`.CW utfrrune ,`
		733	`and`
		734	`.CW utfutf`
		735	`is enough to fix many programs.`
		736	`Few tools actually need to operate on runes internally;`
		737	`more typically they need only to look for the final slash in a file`
		738	`name and similar trivial tasks.`
		739	`Of the 170 C source programs in the top levels of`
		740	`.CW /sys/src/cmd ,`
		741	`only 23 now contain the word`
		742	`.CW Rune .`
		743	`.PP`
		744	`The programs that`
		745	`.I do`
		746	`store runes internally`
		747	`are mostly those whose`
		748	`.I raison`
		749	`.I d'être`
		750	`is character manipulation:`
		751	`.CW sam`
		752	`(the text editor),`
		753	`.CW sed ,`
		754	`.CW sort ,`
		755	`.CW tr ,`
		756	`.CW troff ,`
		757	`.CW 8½`
		758	`(the window system and terminal emulator),`
		759	`and so on.`
		760	`To decide whether to compute using runes`
		761	`or UTF-encoded byte strings requires balancing the cost of converting`
		762	`the data when read and written`
		763	`against the cost of converting relevant text on demand.`
		764	`For programs such as editors that run a long time with a relatively`
		765	`constant dataset, runes are the better choice.`
		766	`There are space considerations too, but they are more complicated:`
		767	`plain ASCII text grows when converted to runes; UTF-encoded Japanese`
		768	`shrinks.`
		769	`.PP`
		770	`Again, it is hard to automate the conversion of a program from`
		771	`.CW chars`
		772	`to`
		773	`.CW Runes .`
		774	`It is not enough just to change the type of variables; the assumption`
		775	`that bytes and characters are equivalent can be insidious.`
		776	`For instance, to clear a character array by`
		777	`.P1`
		778	`memset(buf, 0, BUFSIZE)`
		779	`.P2`
		780	`becomes wrong if`
		781	`.CW buf`
		782	`is changed from an array of`
		783	`.CW chars`
		784	`to an array of`
		785	`.CW Runes .`
		786	`Any program that indexes tables based on character values needs`
		787	`rethinking.`
		788	`Consider`
		789	`.CW tr ,`
		790	`which originally used multiple 256-byte arrays for the mapping.`
		791	`The naïve conversion would yield multiple 1,114,112-rune arrays.`
		792	`Instead Plan 9`
		793	`.CW tr`
		794	`saves space by building in effect`
		795	`a run-encoded version of the map.`
		796	`.PP`
		797	`.CW Sort`
		798	`has related problems.`
		799	`The cooperation of UTF and`
		800	`.CW strcmp`
		801	`means that a simple sort\(emone with no options\(emcan be done`
		802	`on the original UTF strings using`
		803	`.CW strcmp .`
		804	`With sorting options enabled, however,`
		805	`.CW sort`
		806	`may need to convert its input to runes: for example,`
		807	`option`
		808	`.CW -t\f1α\fP`
		809	`requires searching for alphas in the input text to`
		810	`crack the input into fields.`
		811	`The field specifier`
		812	`.CW +3.2`
		813	`refers to 2 runes beyond the third field.`
		814	`Some of the other options are hopelessly provincial:`
		815	`consider the case-folding and dictionary order options`
		816	`(Japanese doesn't even have an official dictionary order) or`
		817	`.CW -M`
		818	`which compares by case-insensitive English month name.`
		819	`Handling these options involves the`
		820	`larger issues of internationalization and is beyond the scope`
		821	`of this paper and our expertise.`
		822	`Plan 9`
		823	`.CW sort`
		824	`works sensibly with options that make sense relative to the input.`
		825	`The simple and most important options are, however, usually meaningful.`
		826	`In particular,`
		827	`.CW sort`
		828	`sorts UTF into the same order that`
		829	`.CW look`
		830	`expects.`
		831	`.PP`
		832	`Regular expression-matching algorithms need rethinking to`
		833	`be applied to UTF text.`
		834	`Deterministic automata are usually applied to bytes;`
		835	`converting them to operate on variable-sized byte sequences is awkward.`
		836	`On the other hand, converting the input stream to runes adds measurable`
		837	`expense`
		838	`and the state tables expand`
		839	`from size 256 to 1,114,112; it can be expensive just to generate them.`
		840	`For simple string searching,`
		841	`the Boyer-Moore algorithm works with UTF provided the input is`
		842	`guaranteed to be only valid UTF strings; however, it does not work`
		843	`with the old UTF encoding.`
		844	`At a more mundane level, even character classes are harder:`
		845	`the usual bit-vector representation within a non-deterministic automaton`
		846	`is unwieldy with 1,114,112 characters in the alphabet.`
		847	`.PP`
		848	`We compromised.`
		849	`An existing library for compiling and executing regular expressions`
		850	`was adapted to work on runes, with two entry points for searching`
		851	`in arrays of runes and arrays of chars (the pattern is always UTF text).`
		852	`Character classes are represented internally as runs of runes;`
		853	`the reserved value`
		854	`.CW FFFF`
		855	`marks the end of the class.`
		856	`Then`
		857	`.I all`
		858	`utilities that use regular expressions\(emeditors,`
		859	`.CW grep ,`
		860	`.CW awk ,`
		861	`etc.\(emexcept the shell, whose notation`
		862	`was grandfathered, were converted to use the library.`
		863	`For some programs, there was a concomitant loss of performance,`
		864	`but there was also a strong advantage.`
		865	`To our knowledge, Plan 9 is the only Unix-like system`
		866	`that has a single definition and implementation of`
		867	`regular expressions; patterns are written and interpreted`
		868	`identically by all the programs in the system.`
		869	`.PP`
		870	`A handful of programs have the notion of character built into them`
		871	`so strongly as to confuse the issue of what they should do with UTF input.`
		872	`Such programs were treated as individual special cases.`
		873	`For example,`
		874	`.CW wc`
		875	`is, by default, unchanged in behavior and output; a new option,`
		876	`.CW -r ,`
		877	`counts the number of correctly encoded runes\(emvalid UTF sequences\(emin`
		878	`its input;`
		879	`.CW -b`
		880	`the number of invalid sequences.`
		881	`.PP`
		882	`It took us several months to convert all the software in the system`
		883	`to the Unicode Standard and the old UTF.`
		884	`When we decided to convert from that to the new UTF,`
		885	`only three things needed to be done.`
		886	`First, we rewrote the library routines to encode and decode the`
		887	`new UTF. This took an evening.`
		888	`Next, we converted all the files containing UTF`
		889	`to the new encoding.`
		890	`We wrote a trivial program to look for non-ASCII bytes in`
		891	`text files and used a Plan 9 program called`
		892	`.CW tcs`
		893	`(translate character set) to change encodings.`
		894	`Finally, we recompiled all the system software;`
		895	`the library interface was unchanged, so recompilation was sufficient`
		896	`to effect the transformation.`
		897	`The second two steps were done concurrently and took an afternoon.`
		898	`We concluded that the actual encoding is relatively unimportant to the`
		899	`software; the adoption of large characters and a byte-stream encoding`
		900	`.I per`
		901	`.I se`
		902	`are much deeper issues.`
		903	`.SH`
		904	`Graphics and fonts`
		905	`.PP`
		906	`Plan 9 provides only minimal support for plain text terminals.`
		907	`It is instead designed to be used with all character input and`
		908	`output mediated by a window system such as`
		909	`.CW 8½ .`
		910	`The window system and related software are responsible for the`
		911	`display of UTF text as Unicode character images.`
		912	`For plain text, the window system must provide a user-settable`
		913	`.I font`
		914	`that provides a (possibly empty) picture for each Unicode character.`
		915	`Fancier applications that use bold and Italic characters`
		916	`need multiple fonts storing multiple pictures for each`
		917	`Unicode value.`
		918	`All the issues are apparent, though,`
		919	`in just the problem of`
		920	`displaying a single image for each character, that is, the`
		921	`Unicode equivalent of a plain text terminal.`
		922	`With 128 or even 256 characters, a font can be just`
		923	`an array of bitmaps. With 1,114,112 characters,`
		924	`a more sophisticated design is necessary. To store the ideographs`
		925	`for just Japanese as 16×16×1 bit images,`
		926	`the smallest they can reasonably be, takes over a quarter of a`
		927	`megabyte. Make the images a little larger, store more bits per`
		928	`pixel, and hold a copy in every running application, and the`
		929	`memory cost becomes unreasonable.`
		930	`.PP`
		931	`The structure of the bitmap graphics services is described at length elsewhere`
		932	`[Pike91].`
		933	`In summary, the memory holding the bitmaps is stored in the same machine that has`
		934	`the display, mouse, and keyboard: the terminal in Plan 9 terminology,`
		935	`the workstation in others'.`
		936	`Access to that memory and associated services is provided`
		937	`by device files served by system`
		938	`software on the terminal. One of those files,`
		939	`.CW /dev/bitblt ,`
		940	`interprets messages written upon it as requests for actions`
		941	`corresponding to entry points in the graphics library:`
		942	`allocate a bitmap, execute a raster operation, draw a text string, etc.`
		943	`The window system`
		944	`acts as a multiplexer that mediates access to the services`
		945	`and resources of the terminal by simulating in each client window`
		946	`a set of files mirroring those provided by the system.`
		947	`That is, each window has a distinct`
		948	`.CW /dev/mouse ,`
		949	`.CW /dev/bitblt ,`
		950	`and so on through which applications drive graphical`
		951	`input and output.`
		952	`.PP`
		953	`One of the resources managed by`
		954	`.CW 8½`
		955	`and the terminal is the set of active`
		956	`.I subfonts.`
		957	`Each subfont holds the`
		958	`bitmaps and associated data structures for a sequential set of Unicode`
		959	`characters.`
		960	`Subfonts are stored in files and loaded into the terminal by`
		961	`.CW 8½`
		962	`or an application.`
		963	`For example, one subfont`
		964	`might hold the images of the first 256 characters of the Unicode space,`
		965	`corresponding to the Latin-1 character set;`
		966	`another might hold the standard phonetic character set, Unicode characters`
		967	`with value 0250 to 02E9.`
		968	`These files are collected in directories corresponding to typefaces:`
		969	`.CW /lib/font/bit/pelm`
		970	`contains the Pellucida Monospace character set, with subfonts holding`
		971	`the Latin-1, Greek, Cyrillic and other components of the typeface.`
		972	`A suffix on subfont files encodes (in a subfont-specific`
		973	`way) the size of the images:`
		974	`.CW /lib/font/bit/pelm/latin1.9`
		975	`contains the Latin-1 Pellucida Monospace characters with lower`
		976	`case letters 9 pixels high;`
		977	`.CW /lib/font/bit/jis/jis5400.16`
		978	`contains 16-pixel high`
		979	`ideographs starting at Unicode value 5400.`
		980	`.PP`
		981	`The subfonts do not identify which portion of the Unicode space`
		982	`they cover. Instead, a`
		983	`font file, in plain text,`
		984	`describes how to assemble subfonts into a complete`
		985	`character set.`
		986	`The font file is presented as an argument to the window system`
		987	`to determine how plain text is displayed in text windows and`
		988	`applications.`
		989	`Here is the beginning of the font file`
		990	`.CW /lib/font/bit/pelm/jis.9.font ,`
		991	`which describes the layout of a font covering that portion of`
		992	`the Unicode Standard for which we have characters of typical`
		993	`display size, using Japanese characters`
		994	`to cover the Han space:`
		995	`.P1`
		996	`18 14`
		997	`0x0000 0x00FF latin1.9`
		998	`0x0100 0x017E latineur.9`
		999	`0x0250 0x02E9 ipa.9`
		1000	`0x0386 0x03F5 greek.9`
		1001	`0x0400 0x0475 cyrillic.9`
		1002	`0x2000 0x2044 ../misc/genpunc.9`
		1003	`0x2070 0x208E supsub.9`
		1004	`0x20A0 0x20AA currency.9`
		1005	`0x2100 0x2138 ../misc/letterlike.9`
		1006	`0x2190 0x21EA ../misc/arrows`
		1007	`0x2200 0x227F ../misc/math1`
		1008	`0x2280 0x22F1 ../misc/math2`
		1009	`0x2300 0x232C ../misc/tech`
		1010	`0x2500 0x257F ../misc/chart`
		1011	`0x2600 0x266F ../misc/ding`
		1012	`.P2`
		1013	`.P1`
		1014	`0x3000 0x303f ../jis/jis3000.16`
		1015	`0x30a1 0x30fe ../jis/katakana.16`
		1016	`0x3041 0x309e ../jis/hiragana.16`
		1017	`0x4e00 0x4fff ../jis/jis4e00.16`
		1018	`0x5000 0x51ff ../jis/jis5000.16`
		1019	`\&...`
		1020	`.P2`
		1021	`The first two numbers set the interline spacing of the font (18`
		1022	`pixels) and the distance from the baseline to the top of the`
		1023	`line (14 pixels).`
		1024	`When characters are displayed, they are placed so as best`
		1025	`to fit within those constraints; characters`
		1026	`too large to fit will be truncated.`
		1027	`The rest of the file associates subfont files`
		1028	`with portions of Unicode space.`
		1029	`The first four such files are in the Pellucida Monospace typeface`
		1030	`and directory; others reside in other directories. The file names`
		1031	`are relative to the font file's own location.`
		1032	`.PP`
		1033	`There are several advantages to this two-level structure.`
		1034	`First, it simultaneously breaks the huge Unicode space into manageable`
		1035	`components and provides a unifying architecture for`
		1036	`assembling fonts from disjoint pieces.`
		1037	`Second, the structure promotes sharing.`
		1038	`For example, we have only one set of Japanese`
		1039	`characters but dozens of typefaces for the Latin-1 characters,`
		1040	`and this structure permits us to store only one copy of the`
		1041	`Japanese set but use it with any Roman typeface.`
		1042	`Also, customization is easy.`
		1043	`English-speaking users who don't need Japanese characters`
		1044	`but may want to read an on-line Oxford English Dictionary can`
		1045	`assemble a custom font with the`
		1046	`Latin-1 (or even just ASCII) characters and the International`
		1047	`Phonetic Alphabet (IPA).`
		1048	`Moreover, to do so requires just editing a plain text file,`
		1049	`not using a special font editing tool.`
		1050	`Finally, the structure guides the design of`
		1051	`caching protocols to improve performance and memory usage.`
		1052	`.PP`
		1053	`To load a complete Unicode character set into each application`
		1054	`would consume too`
		1055	`much memory and, particularly on slow terminal lines, would take`
		1056	`unreasonably long.`
		1057	`Instead, Plan 9 assembles a multi-level cache structure for`
		1058	`each font.`
		1059	`An application opens a font file, reads and parses it,`
		1060	`and allocates a data structure.`
		1061	`A message written to`
		1062	`.CW /dev/bitblt`
		1063	`allocates an associated structure held in the terminal, in particular,`
		1064	`a bitmap to act as a cache`
		1065	`for recently used character images.`
		1066	`Other messages copy these images to bitmaps such as the screen`
		1067	`by loading characters from subfonts into the cache on demand and`
		1068	`from there to the destination bitmap.`
		1069	`The protocol to draw characters is in terms of cache indices,`
		1070	`not Unicode character number or UTF sequences.`
		1071	`These details are hidden from the application, which instead`
		1072	`sees only a subroutine to draw a string in a bitmap from a`
		1073	`given font, functions to discover character size information,`
		1074	`and routines to allocate and to free fonts.`
		1075	`.PP`
		1076	`As needed, whole`
		1077	`subfonts are opened by the graphics library, read, and then downloaded`
		1078	`to the terminal.`
		1079	`They are held open by the library in an LRU-replacement list.`
		1080	`Even when the program closes a subfont, it is retained`
		1081	`in the terminal for later use.`
		1082	`When the application opens the subfont, it asks the terminal`
		1083	`if it already has a copy to avoid reading it from the file`
		1084	`server if possible.`
		1085	`This level of cache has the property that the bitmaps for, say,`
		1086	`all the Japanese characters are stored only once, in the terminal;`
		1087	`the applications read only size and width information from the terminal`
		1088	`and share the images.`
		1089	`.PP`
		1090	`The sizes of the character and subfont caches held by the`
		1091	`application are adaptive.`
		1092	`A simple algorithm monitors the cache miss rate to enlarge and`
		1093	`shrink the caches as required.`
		1094	`The size of the character cache is limited to 2048 images maximum,`
		1095	`which in practice seems enough even for Japanese text.`
		1096	`For plain ASCII-like text it naturally stays around 128 images.`
		1097	`.PP`
		1098	`This mechanism sounds complicated but is implemented by only about`
		1099	`500 lines in the library and considerably less in each of the`
		1100	`terminal's graphics driver and`
		1101	`.CW 8½ .`
		1102	`It has the advantage that only characters that are`
		1103	`being used are loaded into memory.`
		1104	`It is also efficient: if the characters being drawn`
		1105	`are in the cache the extra overhead is negligible.`
		1106	`It works particularly well for alphabetic character sets,`
		1107	`but also adapts on demand for ideographic sets.`
		1108	`When a user first looks at Japanese text, it takes a few`
		1109	`seconds to read all the font data, but thereafter the`
		1110	`text is drawn almost as fast as regular text (the images`
		1111	`are larger, so draw a little slower).`
		1112	`Also, because the bitmaps are remembered by the terminal,`
		1113	`if a second application then looks at Japanese text`
		1114	`it starts faster than the first.`
		1115	`.PP`
		1116	`We considered`
		1117	building a `font server'
		1118	`to cache character images and associated data`
		1119	`for the applications, the window system, and the terminal.`
		1120	`We rejected this design because, although isolating`
		1121	`many of the problems of font management into a separate program,`
		1122	`it didn't simplify the applications.`
		1123	`Moreover, in a distributed system such as Plan 9 it is easy`
		1124	`to have too many special purpose servers.`
		1125	`Making the management of the fonts the concern of only`
		1126	`the essential components simplifies the system and makes`
		1127	`bootstrapping less intricate.`
		1128	`.SH`
		1129	`Input`
		1130	`.PP`
		1131	`A completely different problem is how to type Unicode characters`
		1132	`as input to the system.`
		1133	`We selected an unused key on our ASCII keyboards`
		1134	`to serve as a prefix for multi-keystroke`
		1135	`sequences that generate Unicode characters.`
		1136	`For example, the character`
		1137	`.CW ü`
		1138	`is generated by the prefix key`
		1139	`(typically`
		1140	`.CW ALT`
		1141	`or`
		1142	`.CW Compose )`
		1143	`followed by a double quote and a lower-case`
		1144	`.CW u .`
		1145	`When that character is read by the application, from the file`
		1146	`.CW /dev/cons ,`
		1147	`it is of course presented as its UTF encoding.`
		1148	`Such sequences generate characters from an arbitrary set that`
		1149	`includes all of Latin-1 plus a selection of mathematical`
		1150	`and technical characters.`
		1151	`An arbitrary Unicode character may be generated by typing the prefix,`
		1152	`an upper case X, and four hexadecimal digits that identify`
		1153	`the Unicode value.`
		1154	`.PP`
		1155	`These simple mechanisms are adequate for most of our day-to-day needs:`
		1156	it's easy to remember to type `ALT 1 2' for ½\^ or `ALT accent letter'
		1157	`for accented Latin letters.`
		1158	`For the occasional unusual character, the cut and paste features of`
		1159	`.CW 8½`
		1160	`serve well. A program called (perhaps misleadingly)`
		1161	`.CW unicode`
		1162	`takes as argument a hexadecimal value, and prints the UTF representation of that character,`
		1163	`which may then be picked up with the mouse and used as input.`
		1164	`.PP`
		1165	`These methods`
		1166	`are clearly unsatisfactory when working in a non-English language.`
		1167	`In the native country of such a language`
		1168	`the appropriate keyboard is likely to be at hand.`
		1169	`But it's also reasonable\(emespecially now that the system handles Unicode characters\(emto`
		1170	`work in a language foreign to the keyboard.`
		1171	`.PP`
		1172	`For alphabetic languages such as Greek or Russian, it is`
		1173	`straightforward to construct a program that does phonetic substitution,`
		1174	so that, for example, typing a Latin `a' yields the Greek `α'.
		1175	`Within Plan 9, such a program can be inserted transparently`
		1176	`between the real keyboard and a program such as the window system,`
		1177	`providing a manageable input device for such languages.`
		1178	`.PP`
		1179	`For ideographic languages such as Chinese or Japanese the problem is harder.`
		1180	`Native users of such languages have adopted methods for dealing with`
		1181	`Latin keyboards that involve a hybrid technique based on phonetics`
		1182	`to generate a list of possible symbols followed by menu selection to`
		1183	`choose the desired one.`
		1184	`Such methods can be`
		1185	`effective, but their design must be rooted in information about`
		1186	`the language unknown to non-native speakers.`
		1187	`.CW Cxterm , (`
		1188	`a Chinese terminal emulator built by and for`
		1189	`Chinese programmers,`
		1190	`employs such a technique`
		1191	`[Pong and Zhang].)`
		1192	`Although the technical problem of implementing such a device`
		1193	`is easy in Plan 9\(emit is just an elaboration of the technique for`
		1194	`alphabetic languages\(emour lack of familiarity with such languages`
		1195	`has restrained our enthusiasm for building one.`
		1196	`.PP`
		1197	`The input problem is technically the least interesting but perhaps`
		1198	`emotionally the most important of the problems of converting a system`
		1199	`to an international character set.`
		1200	`Beyond that remain the deeper problems of internationalization`
		1201	`such as multi-lingual error messages and command names,`
		1202	`problems we are not qualified to solve.`
		1203	`With the ability to treat text of most languages on an equal`
		1204	`footing, though, we can begin down that path.`
		1205	`Perhaps people in non-English speaking countries will`
		1206	`consider adopting Plan 9, solving the input problem locally\(emperhaps`
		1207	`just by plugging in their local terminals\(emand begin to use`
		1208	`a system with at least the capacity to be international.`
		1209	`.SH`
		1210	`Acknowledgements`
		1211	`.PP`
		1212	`Dennis Ritchie provided consultation and encouragement.`
		1213	`Bob Flandrena converted most of the standard tools to UTF.`
		1214	`Brian Kernighan suffered cheerfully with several`
		1215	`inadequate implementations and converted`
		1216	`.CW troff`
		1217	`to UTF.`
		1218	`Rich Drechsler converted his Postscript driver to UTF.`
		1219	`John Hobby built the Postscript ☺.`
		1220	`We thank them all.`
		1221	`.SH`
		1222	`References`
		1223	`.LP`
		1224	`[ANSIC] \f2American National Standard for Information Systems \-`
		1225	`Programming Language C\f1, American National Standards Institute, Inc.,`
		1226	`New York, 1990.`
		1227	`.LP`
		1228	`[ISO10646]`
		1229	`ISO/IEC DIS 10646-1:1993`
		1230	`\f2Information technology \-`
		1231	`Universal Multiple-Octet Coded Character Set (UCS) \(em`
		1232	`Part 1: Architecture and Basic Multilingual Plane\fP.`
		1233	`.LP`
		1234	`[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,`
		1235	``Plan 9 from Bell Labs'',
		1236	`UKUUG Proc. of the Summer 1990 Conf.,`
		1237	`London, England,`
		1238	`1990.`
		1239	`.LP`
		1240	[Pike91] R. Pike, ``8½, The Plan 9 Window System'', USENIX Summer
		1241	`Conf. Proc., Nashville, 1991, reprinted in this volume.`
		1242	`.LP`
		1243	[Pike92] R. Pike, ``How to Use the Plan 9 C Compiler'', this volume.
		1244	`.LP`
		1245	[Pong and Zhang] Man-Chi Pong and Yongguang Zhang, ``cxterm:
		1246	`A Chinese Terminal Emulator for the X Window System'',`
		1247	`.I`
		1248	`Software\(emPractice and Experience,`
		1249	`.R`
		1250	`Vol 22(1), 809-926, October 1992.`
		1251	`.LP`
		1252	`[Unicode]`
		1253	`\f2The Unicode Standard,`
		1254	`Worldwide Character Encoding,`
		1255	`Version 1.0, Volume 1\f1,`
		1256	`The Unicode Consortium,`
		1257	`Addison Wesley,`
		1258	`New York,`
		1259	`1991.`

Subversion Repositories planix.SVN

(root)/os/trunk/sys/doc/utf.ms – Rev 2