2 |
- |
1 |
.HTML "Hello World or Καλημέρα κόσμε or こんにちは 世界
2 |
3 |
Hello World
4 |
5 |
6 |
7 |
.ft R
8 |
Καλημέρα κόσμε
9 |
10 |
11 |
12 |
13 |
\f(Jpこんにちは 世界\fP
14 |
15 |
Rob Pike
16 |
Ken Thompson
17 |
18 |
19 |
20 |
21 |
Originally appeared, in a slightly different form, in
22 |
23 |
Proc. of the Winter 1993 USENIX Conf.,
24 |
25 |
pp. 43-50,
26 |
San Diego.
27 |
It has been revised to reflect the move to 21-bit Unicode.
28 |
29 |
Plan 9 from Bell Labs has recently been converted from ASCII
30 |
to an ASCII-compatible variant of the Unicode Standard,
31 |
a 16-bit (now 21-bit) character set.
32 |
In this paper we explain the reasons for the change,
33 |
describe the character set and representation we chose,
34 |
and present the programming models and software changes
35 |
that support the new text format.
36 |
Although we stopped short of full internationalization\(emfor
37 |
example, system error messages are in Unixese, not Japanese\(emwe
38 |
believe Plan 9 is the first system to treat the representation
39 |
of all major languages on a uniform, equal footing throughout all its
40 |
41 |
42 |
43 |
44 |
45 |
The world is multilingual but most computer systems
46 |
are based on English and ASCII.
47 |
The first release of Plan 9 [Pike90], a new distributed operating
48 |
system from Bell Laboratories, seemed a good occasion
49 |
to correct this chauvinism.
50 |
It is easier to make such deep changes when building new systems than
51 |
by refitting old ones.
52 |
53 |
The ANSI C standard [ANSIC] contains some guidance on the matter of
54 |
`wide' and `multi-byte' characters but falls far short of
55 |
solving the myriad associated problems.
56 |
We could find no literature on how to convert a
57 |
.I system
58 |
to larger character sets, although some individual
59 |
.I programs
60 |
had been converted.
61 |
This paper reports what we discovered as we
62 |
explored the problem of representing multilingual
63 |
text at all levels of an operating system,
64 |
from the file system and kernel through
65 |
the applications and up to the window system
66 |
and display.
67 |
68 |
Plan 9 has not been `internationalized':
69 |
its manuals are in English,
70 |
its error messages are in English,
71 |
and it can display text that goes from left to right only.
72 |
But before we can address these other problems,
73 |
we need to handle, uniformly and comfortably,
74 |
the textual representation of all the major written languages.
75 |
That subproblem is richer than we had anticipated.
76 |
77 |
78 |
79 |
Our first step was to select a standard.
80 |
At the time (January 1992),
81 |
there were only two viable options:
82 |
ISO 10646 [ISO10646] and Unicode [Unicode].
83 |
The documents describing both proposals were still in the draft stage.
84 |
85 |
The draft of ISO 10646 was not
86 |
very attractive to us.
87 |
It defined a sparse set of 32-bit characters,
88 |
which would be
89 |
hard to implement
90 |
and have punitive storage requirements.
91 |
Also, the draft attempted to
92 |
mollify national interests by allocating
93 |
16-bit subspaces to national committees
94 |
to partition individually.
95 |
The suggested mode of use was to
96 |
``flip'' between separate national
97 |
standards to implement the international standard.
98 |
This did not strike us as a sound basis for a character set.
99 |
As well, transmitting 32-bit values in a byte stream,
100 |
such as in pipes, would be expensive and hard to implement.
101 |
Since the standard does not define a byte order for such
102 |
transmission, the byte stream would also have to carry
103 |
state to enable the values to be recovered.
104 |
105 |
The Unicode Standard is a proposal by a consortium of mostly American
106 |
computer companies formed
107 |
to protest the technical
108 |
failings of ISO 10646.
109 |
It defines a uniform 16-bit code based on the
110 |
principle of unification:
111 |
two characters are the same if they look the
112 |
same even though they are from different
113 |
114 |
This principle, called Han unification,
115 |
allows the large Japanese, Chinese, and Korean
116 |
character sets to be packed comfortably into a 16-bit representation.
117 |
118 |
We chose the Unicode Standard for its technical merits and because its
119 |
code space was better defined.
120 |
121 |
the Unicode Consortium was derailing the
122 |
ISO 10646 standard.
123 |
(Now, in 1995,
124 |
ISO 10646 is a standard
125 |
with one 16-bit group defined,
126 |
which is almost exactly the Unicode Standard.
127 |
As most people expected, the two standards bodies
128 |
reached a détente and
129 |
ISO 10646 and Unicode represent the same character set.)
130 |
131 |
The Unicode Standard defines an adequate character set
132 |
but an unreasonable representation.
133 |
It states that all characters
134 |
are 16 bits wide and are communicated and stored in
135 |
16-bit units.
136 |
It also reserves a pair of characters
137 |
(hexadecimal FFFE and FEFF) to detect byte order
138 |
in transmitted text, requiring state in the byte stream.
139 |
(The Unicode Consortium was thinking of files, not pipes.)
140 |
To adopt this encoding,
141 |
we would have had to convert all text going
142 |
into and out of Plan 9 between ASCII and Unicode, which cannot be done.
143 |
Within a single program, in command of all its input and output,
144 |
it is possible to define characters as 16-bit quantities;
145 |
in the context of a networked system with
146 |
hundreds of applications on diverse machines
147 |
by different manufacturers,
148 |
it is impossible.
149 |
150 |
We needed a way to adapt the Unicode Standard to the tools-and-pipes
151 |
model of text processing embodied by the Unix system.
152 |
To do that, we
153 |
needed an ASCII-compatible textual
154 |
representation of Unicode characters for transmission
155 |
and storage.
156 |
In the draft ISO standard there was an informative
157 |
158 |
159 |
called UTF
160 |
that provided a byte stream encoding
161 |
of the 32-bit ISO code.
162 |
The encoding uses multibyte sequences composed
163 |
from the 190 printable characters of Latin-1
164 |
to represent character values larger
165 |
than 159.
166 |
167 |
The UTF encoding has several good properties.
168 |
By far the most important is that
169 |
a byte in the ASCII range 0-127 represents
170 |
itself in UTF.
171 |
Thus UTF is backward compatible with ASCII.
172 |
173 |
UTF has other advantages.
174 |
It is a byte encoding and is
175 |
therefore byte-order independent.
176 |
ASCII control characters appear in the byte stream
177 |
only as themselves, never as an element of a sequence
178 |
encoding another character,
179 |
so newline bytes separate lines of UTF text.
180 |
Finally, ANSI C's
181 |
.CW strcmp
182 |
function applied to UTF strings preserves the ordering of Unicode characters.
183 |
184 |
To encode and decode UTF is expensive (involving multiplication,
185 |
division, and modulo operations) but workable.
186 |
UTF's major disadvantage is that the encoding
187 |
is not self-synchronizing.
188 |
It is in general impossible to find the character
189 |
boundaries in a UTF string without reading from
190 |
the beginning of the string, although in practice
191 |
control characters such as newlines,
192 |
tabs, and blanks provide synchronization points.
193 |
194 |
In August 1992,
195 |
X-Open circulated a proposal for another UTF-like
196 |
byte encoding of Unicode characters.
197 |
Their major concern was that an embedded character
198 |
in a file name
199 |
(in particular a slash)
200 |
could be part of an escape sequence in UTF and
201 |
therefore confuse a traditional file system.
202 |
Their proposal would allow all 7-bit ASCII characters
203 |
to represent themselves
204 |
.I "and only themselves"
205 |
in text.
206 |
Multibyte sequences would contain only characters
207 |
with the high bit set.
208 |
We proposed a modification to the new UTF that
209 |
would address our synchronization problem.
210 |
Our proposal, which was originally known informally as UTF-2 and FSS-UTF,
211 |
is now referred to as UTF-8 and has been approved by ISO to become
212 |
Annex P to ISO 10646.
213 |
214 |
The model for text in Plan 9 is chosen from these
215 |
three standards*:
216 |
217 |
* ``That's the nice thing about standards\(emthere's so many to choose from.'' \- Andy Tannenbaum (no, the other one)
218 |
219 |
the Unicode character set encoded as a byte stream by
220 |
UTF-8, from
221 |
(soon to be) Annex P of ISO 10646.
222 |
Although this mixture may seem like a precarious position for us to adopt,
223 |
it is not as bad as it sounds.
224 |
ISO 10646 and the Unicode Standard have converged,
225 |
other systems such as Linux have adopted the same character set and encoding,
226 |
and the general feeling seems to be that Unicode and UTF-8 will be accepted as the way
227 |
to exchange text between systems.
228 |
The prognosis for wide acceptance is good.
229 |
230 |
There are a couple of aspects of the Unicode Standard we have not faced.
231 |
One is the issue of right-to-left text such as Hebrew or Arabic.
232 |
Since that is an issue of display, not representation, we believe
233 |
we can defer that problem for the moment without affecting our
234 |
ability to solve it later.
235 |
Another issue is diacriticals and `combining characters',
236 |
which cause overstriking of multiple Unicode characters.
237 |
Although necessary for some scripts, such as Thai, Arabic, and Hebrew,
238 |
such characters confuse the issues for Latin languages because they
239 |
generate multiple representations for accented characters.
240 |
ISO 10646 describes three levels of implementation;
241 |
in Plan 9 we decided not to address the issue.
242 |
Again, this can be labeled as a display issue and its finer points are still being debated,
243 |
so we felt comfortable deferring. Mañana.
244 |
245 |
Although we converted Plan 9 in the altruistic interests of
246 |
serving foreign languages, we have found the large character
247 |
set attractive for other reasons. The Unicode Standard includes many
248 |
characters\(emmathematical symbols, scientific notation,
249 |
more general punctuation, and more\(emthat we now use
250 |
daily in our work. We no longer test our imaginations
251 |
to find ways to include non-ASCII symbols in our text;
252 |
why type
253 |
.CW :-)
254 |
when you can use the character ☺?
255 |
Most compelling is the ability to absorb documents
256 |
and data that contain non-ASCII characters; our browser for the
257 |
Oxford English Dictionary
258 |
lets us see the dictionary as it really is, with pronunciation
259 |
in the IPA font, foreign phrases properly rendered, and so on,
260 |
.I "in plain text.
261 |
262 |
As of Unicode 4.0,
263 |
characters are now 21 bits wide and the longest UTF-8 encoding of a character
264 |
requires 4 bytes.
265 |
We are adapting the system to match.
266 |
267 |
In the rest of this paper, except when
268 |
stated otherwise, the term `UTF' refers to the UTF-8 encoding
269 |
of Unicode characters as adopted by Plan 9.
270 |
271 |
C Compiler
272 |
273 |
The first program to be converted to UTF
274 |
was the C Compiler.
275 |
There are two levels of conversion.
276 |
On the syntactic level,
277 |
input to the C compiler
278 |
is UTF; on the semantic level,
279 |
the C language needs to define
280 |
how compiled programs manipulate
281 |
the UTF set.
282 |
283 |
The syntactic part is simple.
284 |
The ANSI C language standard defines the
285 |
source character set to be ASCII.
286 |
Since UTF is backward compatible with ASCII,
287 |
the compiler needs little change.
288 |
The only places where a larger character set
289 |
is allowed are in character constants, strings, and comments.
290 |
Since 7-bit ASCII characters can represent only
291 |
themselves in UTF,
292 |
the compiler does not have to be careful while looking
293 |
for the termination of a string or comment.
294 |
295 |
The Plan 9 compiler extends ANSI C to treat any Unicode
296 |
character with a value outside of the ASCII range as
297 |
an alphabetic.
298 |
To a Greek programmer or an English mathematician,
299 |
α is a sensible and now valid variable name.
300 |
301 |
On the semantic level, ANSI C allows,
302 |
but does not tie down,
303 |
the notion of a
304 |
.I "wide character
305 |
and admits string and character constants
306 |
of this type.
307 |
We chose the wide character type to be
308 |
.CW unsigned
309 |
.CW short
310 |
311 |
.CW unsigned
312 |
.CW long) .
313 |
In the libraries, the word
314 |
.CW Rune
315 |
is now defined by a
316 |
.CW typedef
317 |
to be equivalent to
318 |
.CW unsigned
319 |
.CW long
320 |
and is
321 |
used to signify a Unicode character.
322 |
323 |
There are surprises; for example:
324 |
325 |
L'x' \f1is 120\fP
326 |
\&'x' \f1is 120\fP
327 |
L'ÿ' \f1is 255\fP
328 |
\&'ÿ' \f1is -1, stdio \fPEOF\f1 (if \fPchar\f1 is signed)\fP
329 |
L'\f1α\fP' \f1is 945\fP
330 |
\&'\f1α\fP' \f1is illegal\fP
331 |
332 |
In the string constants,
333 |
334 |
"\f(Jpこんにちは 世界\fP"
335 |
L"\f(Jpこんにちは 世界\fP",
336 |
337 |
the former is an array of
338 |
.CW chars
339 |
with 22 elements
340 |
and a null byte,
341 |
while the latter is an array of
342 |
.CW unsigned
343 |
.CW long s
344 |
.CW Runes ) (
345 |
with 8 elements and a null
346 |
.CW Rune .
347 |
348 |
The Plan 9 library provides an output conversion function,
349 |
.CW print
350 |
(analogous to
351 |
.CW printf ),
352 |
with formats
353 |
.CW %c ,
354 |
.CW %C ,
355 |
.CW %s ,
356 |
357 |
.CW %S .
358 |
359 |
.CW print
360 |
produces text, its output is always UTF.
361 |
The character conversion
362 |
.CW %c
363 |
(lower case) masks its argument
364 |
to 8 bits before converting to UTF.
365 |
366 |
.CW L'ÿ'
367 |
368 |
.CW 'ÿ'
369 |
printed under
370 |
.CW %c
371 |
will be identical,
372 |
373 |
.CW L'\f1α\fP'
374 |
will print as the Unicode
375 |
character with decimal value 177.
376 |
The character conversion
377 |
.CW %C
378 |
(upper case) masks its argument
379 |
to 16 bits before converting to UTF.
380 |
381 |
.CW L'ÿ'
382 |
383 |
.CW L'\f1α\fP'
384 |
will print correctly under
385 |
.CW %C ,
386 |
387 |
.CW 'ÿ'
388 |
will not.
389 |
The conversion
390 |
.CW %s
391 |
(lower case)
392 |
expects a pointer to
393 |
.CW char
394 |
and copies UTF sequences up to a null byte.
395 |
The conversion
396 |
.CW %S
397 |
(upper case) expects a pointer to
398 |
.CW Rune
399 |
400 |
performs sequential
401 |
.CW %C
402 |
conversions until a null
403 |
.CW Rune
404 |
is encountered.
405 |
406 |
Another problem in format conversion
407 |
is the definition of
408 |
.CW %10s :
409 |
does the number refer to bytes or characters?
410 |
We decided that such formats were most
411 |
often used to align output columns and
412 |
so made the number count characters.
413 |
Some programs, however, use the count
414 |
to place blank-padded strings
415 |
in fixed-sized arrays.
416 |
These programs must be found and corrected.
417 |
418 |
Here is a complete example:
419 |
420 |
#include <u.h>
421 |
422 |
char c[] = "\f(Jpこんにちは 世界\fP";
423 |
Rune s[] = L"\f(Jpこんにちは 世界\fP";
424 |
425 |
426 |
427 |
print("%d, %d\en", sizeof(c), sizeof(s));
428 |
print("%s\en", c);
429 |
print("%S\en", s);
430 |
431 |
432 |
433 |
This program prints
434 |
.CW 23,
435 |
.CW 18
436 |
and then two identical lines of
437 |
UTF text.
438 |
In practice,
439 |
.CW %S
440 |
441 |
.CW L"..."
442 |
are rare in programs; one reason is
443 |
that most formatted I/O is done in unconverted UTF.
444 |
445 |
446 |
447 |
All programs in Plan 9 now read and write text as UTF, not ASCII.
448 |
This change breaks two deep-rooted symmetries implicit in most C programs:
449 |
.IP 1.
450 |
A character is no longer a
451 |
.CW char .
452 |
.IP 2.
453 |
The internal representation (Rune) of a character now differs from its
454 |
external representation (UTF).
455 |
456 |
In the sections that follow,
457 |
we show how these issues were faced in the layers of
458 |
system software from the operating system up to the applications.
459 |
The effects are wide-reaching and often surprising.
460 |
461 |
Operating system
462 |
463 |
Since UTF is the only format for text in Plan 9,
464 |
the interface to the operating system had to be converted to UTF.
465 |
Text strings cross the interface in several places:
466 |
command arguments,
467 |
file names,
468 |
user names (people can log in using their native name),
469 |
error messages,
470 |
and miscellaneous minor places such as commands to the I/O system.
471 |
Little change was required: null-terminated UTF strings
472 |
are equivalent to null-terminated ASCII strings for most purposes
473 |
of the operating system.
474 |
The library routines described in the next section made that
475 |
change straightforward.
476 |
477 |
The window system, once called
478 |
.CW 8.5 ,
479 |
is now rightfully called
480 |
.CW 8½ .
481 |
482 |
483 |
484 |
A header file included by all programs (see [Pike92]) declares
485 |
486 |
.CW Rune
487 |
type to hold 21-bit character values:
488 |
489 |
typedef unsigned long Rune;
490 |
491 |
Also defined are several constants relevant to UTF:
492 |
493 |
494 |
495 |
UTFmax = 4, /* maximum bytes per rune */
496 |
Runesync = 0x80, /* cannot be in a UTF sequence (<) */
497 |
Runeself = 0x80, /* rune==UTF sequence (<) */
498 |
Runeerror = 0xFFFD, /* decoding error in UTF */
499 |
Runemax = 0x10FFFF, /* largest 21-bit rune */
500 |
Runemask = 0x1FFFFF, /* bits used by runes (see grep) */
501 |
502 |
503 |
(With the original UTF,
504 |
.CW Runesync
505 |
was hexadecimal 21 and
506 |
.CW Runeself
507 |
was A0.)
508 |
.CW UTFmax
509 |
bytes are sufficient
510 |
to hold the UTF encoding of any Unicode character.
511 |
Characters of value less than
512 |
.CW Runesync
513 |
only appear in a UTF string as
514 |
themselves, never as part of a sequence encoding another character.
515 |
Characters of value less than
516 |
.CW Runeself
517 |
encode into single bytes
518 |
of the same value.
519 |
Finally, when the library detects errors in UTF input\(embyte sequences
520 |
that are not valid UTF sequences\(emit converts the first byte of the
521 |
error sequence to the character
522 |
.CW Runeerror .
523 |
There is little a rune-oriented program can do when given bad data
524 |
except exit, which is unreasonable, or carry on.
525 |
Originally the conversion routines, described below,
526 |
returned errors when given invalid UTF,
527 |
but we found ourselves repeatedly checking for errors and ignoring them.
528 |
We therefore decided to convert a bad sequence to a valid rune
529 |
and continue processing.
530 |
(The ANSI C routines, on the other hand, return errors.)
531 |
532 |
This technique does have the unfortunate property that converting
533 |
invalid UTF byte strings in and out of runes does not preserve the input,
534 |
but this circumstance only occurs when non-textual input is
535 |
given to a textual program.
536 |
The Unicode Standard defines an error character, value FFFD, to stand for
537 |
characters from other sets that it does not represent.
538 |
539 |
.CW Runeerror
540 |
character is a different concept, related to the encoding rather than the character set.
541 |
542 |
The Plan 9 C library contains a number of routines for
543 |
manipulating runes.
544 |
The first set converts between runes and UTF strings:
545 |
546 |
extern int runetochar(char*, Rune*);
547 |
extern int chartorune(Rune*, char*);
548 |
extern int runelen(long);
549 |
extern int fullrune(char*, int);
550 |
551 |
.CW Runetochar
552 |
translates a single
553 |
.CW Rune
554 |
to a UTF sequence and returns the number of bytes produced.
555 |
.CW Chartorune
556 |
goes the other way, reporting how many bytes were consumed.
557 |
.CW Runelen
558 |
returns the number of bytes in the UTF encoding of a rune.
559 |
.CW Fullrune
560 |
examines a UTF string up to a specified number of bytes
561 |
and reports whether the string begins with a complete UTF encoding.
562 |
All these routines use the
563 |
.CW Runeerror
564 |
character to work around encoding problems.
565 |
566 |
There is also a set of routines for examining null-terminated UTF strings,
567 |
based on the model of the ANSI standard
568 |
.CW str
569 |
routines, but with
570 |
.CW utf
571 |
substituted for
572 |
.CW str
573 |
574 |
.CW rune
575 |
576 |
.CW chr :
577 |
578 |
extern int utflen(char*);
579 |
extern char* utfrune(char*, long);
580 |
extern char* utfrrune(char*, long);
581 |
extern char* utfutf(char*, char*);
582 |
583 |
.CW Utflen
584 |
returns the number of runes in a UTF string;
585 |
.CW utfrune
586 |
returns a pointer to the first occurrence of a rune in a UTF string;
587 |
588 |
.CW utfrrune
589 |
a pointer to the last.
590 |
.CW Utfutf
591 |
searches for the first occurrence of a UTF string in another UTF string.
592 |
Given the synchronizing property of UTF-8,
593 |
.CW utfutf
594 |
is the same as
595 |
.CW strstr
596 |
if the arguments point to valid UTF strings.
597 |
598 |
It is a mistake to use
599 |
.CW strchr
600 |
601 |
.CW strrchr
602 |
unless searching for a 7-bit ASCII character, that is, a character
603 |
less than
604 |
.CW Runeself .
605 |
606 |
We have no routines for manipulating null-terminated arrays of
607 |
.CW Runes .
608 |
Although they should probably exist for completeness, we have
609 |
found no need for them, for the same reason that
610 |
.CW %S
611 |
612 |
.CW L"..."
613 |
are rarely used.
614 |
615 |
Most Plan 9 programs use a new buffered I/O library, BIO, in place of
616 |
Standard I/O.
617 |
BIO contains routines to read and write UTF streams, converting to and from
618 |
619 |
.CW Bgetrune
620 |
returns, as a
621 |
.CW Rune
622 |
within a
623 |
.CW long ,
624 |
the next character in the UTF input stream;
625 |
.CW Bputrune
626 |
takes a rune and writes its UTF representation.
627 |
.CW Bungetrune
628 |
puts a rune back into the input stream for rereading.
629 |
630 |
Plan 9 programs use a simple set of macros to process command line arguments.
631 |
Converting these macros to UTF automatically updated the
632 |
argument processing of most programs.
633 |
In general,
634 |
argument flag names can no longer be held in bytes and
635 |
arrays of 256 bytes cannot be used to hold a set of flags.
636 |
637 |
We have done nothing analogous to ANSI C's locales, partly because
638 |
we do not feel qualified to define locales and partly because we remain
639 |
unconvinced of that model for dealing with the problems.
640 |
That is really more an issue of internationalization than conversion
641 |
to a larger character set; on the other hand,
642 |
because we have chosen a single character set that encompasses
643 |
most languages, some of the need for
644 |
locales is eliminated.
645 |
(We have a utility,
646 |
.CW tcs ,
647 |
that translates between UTF and other character sets.)
648 |
649 |
There are several reasons why our library does not follow the ANSI design
650 |
for wide and multi-byte characters.
651 |
The ANSI model was designed by a committee, untried, almost
652 |
as an afterthought, whereas
653 |
we wanted to design as we built.
654 |
(We made several major changes to the interface
655 |
as we became familiar with the problems involved.)
656 |
We disagree with ANSI C's handling of invalid multi-byte sequences.
657 |
Also, the ANSI C library is incomplete:
658 |
although it contains some crucial routines for handling
659 |
wide and multi-byte characters, there are some serious omissions.
660 |
For example, our software can exploit
661 |
the fact that UTF preserves ASCII characters in the byte stream.
662 |
We could remove that assumption by replacing all
663 |
calls to
664 |
.CW strchr
665 |
666 |
.CW utfrune
667 |
and so on.
668 |
(Because of the weaker properties of the original UTF,
669 |
we have actually done so.)
670 |
ANSI C cannot:
671 |
the standard says nothing about the representation, so portable code should
672 |
.I never
673 |
674 |
.CW strchr ,
675 |
yet there is no ANSI equivalent to
676 |
.CW utfrune .
677 |
ANSI C simultaneously invalidates
678 |
.CW strchr
679 |
and offers no replacement.
680 |
681 |
Finally, ANSI did nothing to integrate wide characters
682 |
into the I/O system: it gives no method for printing
683 |
wide characters.
684 |
We therefore needed to invent some things and decided to invent
685 |
686 |
In the end, some of our entry points do correspond closely to
687 |
ANSI routines\(emfor example
688 |
.CW chartorune
689 |
690 |
.CW runetochar
691 |
are similar to
692 |
.CW mbtowc
693 |
694 |
.CW wctomb \(embut
695 |
Plan 9's library defines more functionality, enough
696 |
to write real applications comfortably.
697 |
698 |
Converting the tools
699 |
700 |
The source for our tools and applications had already been converted to
701 |
work with Latin-1, so it was `8-bit safe', but the conversion to the Unicode
702 |
Standard and UTF is more involved.
703 |
Some programs needed no change at all:
704 |
.CW cat ,
705 |
for instance,
706 |
interprets its argument strings, delivered in UTF,
707 |
as file names that it passes uninterpreted to the
708 |
.CW open
709 |
system call,
710 |
and then just copies bytes from its input to its output;
711 |
it never makes decisions based on the values of the bytes.
712 |
(Plan 9
713 |
.CW cat
714 |
has no options such as
715 |
.CW -v
716 |
to complicate matters.)
717 |
Most programs, however, needed modest change.
718 |
719 |
It is difficult to
720 |
find automatically the places that need attention,
721 |
722 |
.CW grep
723 |
724 |
Software that uses the libraries conscientiously can be searched
725 |
for calls to library routines that examine bytes as characters:
726 |
.CW strchr ,
727 |
.CW strrchr ,
728 |
.CW strstr ,
729 |
730 |
Replacing these by calls to
731 |
.CW utfrune ,
732 |
.CW utfrrune ,
733 |
734 |
.CW utfutf
735 |
is enough to fix many programs.
736 |
Few tools actually need to operate on runes internally;
737 |
more typically they need only to look for the final slash in a file
738 |
name and similar trivial tasks.
739 |
Of the 170 C source programs in the top levels of
740 |
.CW /sys/src/cmd ,
741 |
only 23 now contain the word
742 |
.CW Rune .
743 |
744 |
The programs that
745 |
.I do
746 |
store runes internally
747 |
are mostly those whose
748 |
.I raison
749 |
.I d'être
750 |
is character manipulation:
751 |
.CW sam
752 |
(the text editor),
753 |
.CW sed ,
754 |
.CW sort ,
755 |
.CW tr ,
756 |
.CW troff ,
757 |
.CW 8½
758 |
(the window system and terminal emulator),
759 |
and so on.
760 |
To decide whether to compute using runes
761 |
or UTF-encoded byte strings requires balancing the cost of converting
762 |
the data when read and written
763 |
against the cost of converting relevant text on demand.
764 |
For programs such as editors that run a long time with a relatively
765 |
constant dataset, runes are the better choice.
766 |
There are space considerations too, but they are more complicated:
767 |
plain ASCII text grows when converted to runes; UTF-encoded Japanese
768 |
769 |
770 |
Again, it is hard to automate the conversion of a program from
771 |
.CW chars
772 |
773 |
.CW Runes .
774 |
It is not enough just to change the type of variables; the assumption
775 |
that bytes and characters are equivalent can be insidious.
776 |
For instance, to clear a character array by
777 |
778 |
memset(buf, 0, BUFSIZE)
779 |
780 |
becomes wrong if
781 |
.CW buf
782 |
is changed from an array of
783 |
.CW chars
784 |
to an array of
785 |
.CW Runes .
786 |
Any program that indexes tables based on character values needs
787 |
788 |
789 |
.CW tr ,
790 |
which originally used multiple 256-byte arrays for the mapping.
791 |
The naïve conversion would yield multiple 1,114,112-rune arrays.
792 |
Instead Plan 9
793 |
.CW tr
794 |
saves space by building in effect
795 |
a run-encoded version of the map.
796 |
797 |
.CW Sort
798 |
has related problems.
799 |
The cooperation of UTF and
800 |
.CW strcmp
801 |
means that a simple sort\(emone with no options\(emcan be done
802 |
on the original UTF strings using
803 |
.CW strcmp .
804 |
With sorting options enabled, however,
805 |
.CW sort
806 |
may need to convert its input to runes: for example,
807 |
808 |
.CW -t\f1α\fP
809 |
requires searching for alphas in the input text to
810 |
crack the input into fields.
811 |
The field specifier
812 |
.CW +3.2
813 |
refers to 2 runes beyond the third field.
814 |
Some of the other options are hopelessly provincial:
815 |
consider the case-folding and dictionary order options
816 |
(Japanese doesn't even have an official dictionary order) or
817 |
.CW -M
818 |
which compares by case-insensitive English month name.
819 |
Handling these options involves the
820 |
larger issues of internationalization and is beyond the scope
821 |
of this paper and our expertise.
822 |
Plan 9
823 |
.CW sort
824 |
works sensibly with options that make sense relative to the input.
825 |
The simple and most important options are, however, usually meaningful.
826 |
In particular,
827 |
.CW sort
828 |
sorts UTF into the same order that
829 |
.CW look
830 |
831 |
832 |
Regular expression-matching algorithms need rethinking to
833 |
be applied to UTF text.
834 |
Deterministic automata are usually applied to bytes;
835 |
converting them to operate on variable-sized byte sequences is awkward.
836 |
On the other hand, converting the input stream to runes adds measurable
837 |
838 |
and the state tables expand
839 |
from size 256 to 1,114,112; it can be expensive just to generate them.
840 |
For simple string searching,
841 |
the Boyer-Moore algorithm works with UTF provided the input is
842 |
guaranteed to be only valid UTF strings; however, it does not work
843 |
with the old UTF encoding.
844 |
At a more mundane level, even character classes are harder:
845 |
the usual bit-vector representation within a non-deterministic automaton
846 |
is unwieldy with 1,114,112 characters in the alphabet.
847 |
848 |
We compromised.
849 |
An existing library for compiling and executing regular expressions
850 |
was adapted to work on runes, with two entry points for searching
851 |
in arrays of runes and arrays of chars (the pattern is always UTF text).
852 |
Character classes are represented internally as runs of runes;
853 |
the reserved value
854 |
855 |
marks the end of the class.
856 |
857 |
.I all
858 |
utilities that use regular expressions\(emeditors,
859 |
.CW grep ,
860 |
.CW awk ,
861 |
etc.\(emexcept the shell, whose notation
862 |
was grandfathered, were converted to use the library.
863 |
For some programs, there was a concomitant loss of performance,
864 |
but there was also a strong advantage.
865 |
To our knowledge, Plan 9 is the only Unix-like system
866 |
that has a single definition and implementation of
867 |
regular expressions; patterns are written and interpreted
868 |
identically by all the programs in the system.
869 |
870 |
A handful of programs have the notion of character built into them
871 |
so strongly as to confuse the issue of what they should do with UTF input.
872 |
Such programs were treated as individual special cases.
873 |
For example,
874 |
.CW wc
875 |
is, by default, unchanged in behavior and output; a new option,
876 |
.CW -r ,
877 |
counts the number of correctly encoded runes\(emvalid UTF sequences\(emin
878 |
its input;
879 |
.CW -b
880 |
the number of invalid sequences.
881 |
882 |
It took us several months to convert all the software in the system
883 |
to the Unicode Standard and the old UTF.
884 |
When we decided to convert from that to the new UTF,
885 |
only three things needed to be done.
886 |
First, we rewrote the library routines to encode and decode the
887 |
new UTF. This took an evening.
888 |
Next, we converted all the files containing UTF
889 |
to the new encoding.
890 |
We wrote a trivial program to look for non-ASCII bytes in
891 |
text files and used a Plan 9 program called
892 |
.CW tcs
893 |
(translate character set) to change encodings.
894 |
Finally, we recompiled all the system software;
895 |
the library interface was unchanged, so recompilation was sufficient
896 |
to effect the transformation.
897 |
The second two steps were done concurrently and took an afternoon.
898 |
We concluded that the actual encoding is relatively unimportant to the
899 |
software; the adoption of large characters and a byte-stream encoding
900 |
.I per
901 |
.I se
902 |
are much deeper issues.
903 |
904 |
Graphics and fonts
905 |
906 |
Plan 9 provides only minimal support for plain text terminals.
907 |
It is instead designed to be used with all character input and
908 |
output mediated by a window system such as
909 |
.CW 8½ .
910 |
The window system and related software are responsible for the
911 |
display of UTF text as Unicode character images.
912 |
For plain text, the window system must provide a user-settable
913 |
.I font
914 |
that provides a (possibly empty) picture for each Unicode character.
915 |
Fancier applications that use bold and Italic characters
916 |
need multiple fonts storing multiple pictures for each
917 |
Unicode value.
918 |
All the issues are apparent, though,
919 |
in just the problem of
920 |
displaying a single image for each character, that is, the
921 |
Unicode equivalent of a plain text terminal.
922 |
With 128 or even 256 characters, a font can be just
923 |
an array of bitmaps. With 1,114,112 characters,
924 |
a more sophisticated design is necessary. To store the ideographs
925 |
for just Japanese as 16×16×1 bit images,
926 |
the smallest they can reasonably be, takes over a quarter of a
927 |
megabyte. Make the images a little larger, store more bits per
928 |
pixel, and hold a copy in every running application, and the
929 |
memory cost becomes unreasonable.
930 |
931 |
The structure of the bitmap graphics services is described at length elsewhere
932 |
933 |
In summary, the memory holding the bitmaps is stored in the same machine that has
934 |
the display, mouse, and keyboard: the terminal in Plan 9 terminology,
935 |
the workstation in others'.
936 |
Access to that memory and associated services is provided
937 |
by device files served by system
938 |
software on the terminal. One of those files,
939 |
.CW /dev/bitblt ,
940 |
interprets messages written upon it as requests for actions
941 |
corresponding to entry points in the graphics library:
942 |
allocate a bitmap, execute a raster operation, draw a text string, etc.
943 |
The window system
944 |
acts as a multiplexer that mediates access to the services
945 |
and resources of the terminal by simulating in each client window
946 |
a set of files mirroring those provided by the system.
947 |
That is, each window has a distinct
948 |
.CW /dev/mouse ,
949 |
.CW /dev/bitblt ,
950 |
and so on through which applications drive graphical
951 |
input and output.
952 |
953 |
One of the resources managed by
954 |
.CW 8½
955 |
and the terminal is the set of active
956 |
.I subfonts.
957 |
Each subfont holds the
958 |
bitmaps and associated data structures for a sequential set of Unicode
959 |
960 |
Subfonts are stored in files and loaded into the terminal by
961 |
.CW 8½
962 |
or an application.
963 |
For example, one subfont
964 |
might hold the images of the first 256 characters of the Unicode space,
965 |
corresponding to the Latin-1 character set;
966 |
another might hold the standard phonetic character set, Unicode characters
967 |
with value 0250 to 02E9.
968 |
These files are collected in directories corresponding to typefaces:
969 |
.CW /lib/font/bit/pelm
970 |
contains the Pellucida Monospace character set, with subfonts holding
971 |
the Latin-1, Greek, Cyrillic and other components of the typeface.
972 |
A suffix on subfont files encodes (in a subfont-specific
973 |
way) the size of the images:
974 |
.CW /lib/font/bit/pelm/latin1.9
975 |
contains the Latin-1 Pellucida Monospace characters with lower
976 |
case letters 9 pixels high;
977 |
.CW /lib/font/bit/jis/jis5400.16
978 |
contains 16-pixel high
979 |
ideographs starting at Unicode value 5400.
980 |
981 |
The subfonts do not identify which portion of the Unicode space
982 |
they cover. Instead, a
983 |
font file, in plain text,
984 |
describes how to assemble subfonts into a complete
985 |
character set.
986 |
The font file is presented as an argument to the window system
987 |
to determine how plain text is displayed in text windows and
988 |
989 |
Here is the beginning of the font file
990 |
.CW /lib/font/bit/pelm/jis.9.font ,
991 |
which describes the layout of a font covering that portion of
992 |
the Unicode Standard for which we have characters of typical
993 |
display size, using Japanese characters
994 |
to cover the Han space:
995 |
996 |
18 14
997 |
0x0000 0x00FF latin1.9
998 |
0x0100 0x017E latineur.9
999 |
0x0250 0x02E9 ipa.9
1000 |
0x0386 0x03F5 greek.9
1001 |
0x0400 0x0475 cyrillic.9
1002 |
0x2000 0x2044 ../misc/genpunc.9
1003 |
0x2070 0x208E supsub.9
1004 |
0x20A0 0x20AA currency.9
1005 |
0x2100 0x2138 ../misc/letterlike.9
1006 |
0x2190 0x21EA ../misc/arrows
1007 |
0x2200 0x227F ../misc/math1
1008 |
0x2280 0x22F1 ../misc/math2
1009 |
0x2300 0x232C ../misc/tech
1010 |
0x2500 0x257F ../misc/chart
1011 |
0x2600 0x266F ../misc/ding
1012 |
1013 |
1014 |
0x3000 0x303f ../jis/jis3000.16
1015 |
0x30a1 0x30fe ../jis/katakana.16
1016 |
0x3041 0x309e ../jis/hiragana.16
1017 |
0x4e00 0x4fff ../jis/jis4e00.16
1018 |
0x5000 0x51ff ../jis/jis5000.16
1019 |
1020 |
1021 |
The first two numbers set the interline spacing of the font (18
1022 |
pixels) and the distance from the baseline to the top of the
1023 |
line (14 pixels).
1024 |
When characters are displayed, they are placed so as best
1025 |
to fit within those constraints; characters
1026 |
too large to fit will be truncated.
1027 |
The rest of the file associates subfont files
1028 |
with portions of Unicode space.
1029 |
The first four such files are in the Pellucida Monospace typeface
1030 |
and directory; others reside in other directories. The file names
1031 |
are relative to the font file's own location.
1032 |
1033 |
There are several advantages to this two-level structure.
1034 |
First, it simultaneously breaks the huge Unicode space into manageable
1035 |
components and provides a unifying architecture for
1036 |
assembling fonts from disjoint pieces.
1037 |
Second, the structure promotes sharing.
1038 |
For example, we have only one set of Japanese
1039 |
characters but dozens of typefaces for the Latin-1 characters,
1040 |
and this structure permits us to store only one copy of the
1041 |
Japanese set but use it with any Roman typeface.
1042 |
Also, customization is easy.
1043 |
English-speaking users who don't need Japanese characters
1044 |
but may want to read an on-line Oxford English Dictionary can
1045 |
assemble a custom font with the
1046 |
Latin-1 (or even just ASCII) characters and the International
1047 |
Phonetic Alphabet (IPA).
1048 |
Moreover, to do so requires just editing a plain text file,
1049 |
not using a special font editing tool.
1050 |
Finally, the structure guides the design of
1051 |
caching protocols to improve performance and memory usage.
1052 |
1053 |
To load a complete Unicode character set into each application
1054 |
would consume too
1055 |
much memory and, particularly on slow terminal lines, would take
1056 |
unreasonably long.
1057 |
Instead, Plan 9 assembles a multi-level cache structure for
1058 |
each font.
1059 |
An application opens a font file, reads and parses it,
1060 |
and allocates a data structure.
1061 |
A message written to
1062 |
.CW /dev/bitblt
1063 |
allocates an associated structure held in the terminal, in particular,
1064 |
a bitmap to act as a cache
1065 |
for recently used character images.
1066 |
Other messages copy these images to bitmaps such as the screen
1067 |
by loading characters from subfonts into the cache on demand and
1068 |
from there to the destination bitmap.
1069 |
The protocol to draw characters is in terms of cache indices,
1070 |
not Unicode character number or UTF sequences.
1071 |
These details are hidden from the application, which instead
1072 |
sees only a subroutine to draw a string in a bitmap from a
1073 |
given font, functions to discover character size information,
1074 |
and routines to allocate and to free fonts.
1075 |
1076 |
As needed, whole
1077 |
subfonts are opened by the graphics library, read, and then downloaded
1078 |
to the terminal.
1079 |
They are held open by the library in an LRU-replacement list.
1080 |
Even when the program closes a subfont, it is retained
1081 |
in the terminal for later use.
1082 |
When the application opens the subfont, it asks the terminal
1083 |
if it already has a copy to avoid reading it from the file
1084 |
server if possible.
1085 |
This level of cache has the property that the bitmaps for, say,
1086 |
all the Japanese characters are stored only once, in the terminal;
1087 |
the applications read only size and width information from the terminal
1088 |
and share the images.
1089 |
1090 |
The sizes of the character and subfont caches held by the
1091 |
application are adaptive.
1092 |
A simple algorithm monitors the cache miss rate to enlarge and
1093 |
shrink the caches as required.
1094 |
The size of the character cache is limited to 2048 images maximum,
1095 |
which in practice seems enough even for Japanese text.
1096 |
For plain ASCII-like text it naturally stays around 128 images.
1097 |
1098 |
This mechanism sounds complicated but is implemented by only about
1099 |
500 lines in the library and considerably less in each of the
1100 |
terminal's graphics driver and
1101 |
.CW 8½ .
1102 |
It has the advantage that only characters that are
1103 |
being used are loaded into memory.
1104 |
It is also efficient: if the characters being drawn
1105 |
are in the cache the extra overhead is negligible.
1106 |
It works particularly well for alphabetic character sets,
1107 |
but also adapts on demand for ideographic sets.
1108 |
When a user first looks at Japanese text, it takes a few
1109 |
seconds to read all the font data, but thereafter the
1110 |
text is drawn almost as fast as regular text (the images
1111 |
are larger, so draw a little slower).
1112 |
Also, because the bitmaps are remembered by the terminal,
1113 |
if a second application then looks at Japanese text
1114 |
it starts faster than the first.
1115 |
1116 |
We considered
1117 |
building a `font server'
1118 |
to cache character images and associated data
1119 |
for the applications, the window system, and the terminal.
1120 |
We rejected this design because, although isolating
1121 |
many of the problems of font management into a separate program,
1122 |
it didn't simplify the applications.
1123 |
Moreover, in a distributed system such as Plan 9 it is easy
1124 |
to have too many special purpose servers.
1125 |
Making the management of the fonts the concern of only
1126 |
the essential components simplifies the system and makes
1127 |
bootstrapping less intricate.
1128 |
1129 |
1130 |
1131 |
A completely different problem is how to type Unicode characters
1132 |
as input to the system.
1133 |
We selected an unused key on our ASCII keyboards
1134 |
to serve as a prefix for multi-keystroke
1135 |
sequences that generate Unicode characters.
1136 |
For example, the character
1137 |
.CW ü
1138 |
is generated by the prefix key
1139 |
1140 |
1141 |
1142 |
.CW Compose )
1143 |
followed by a double quote and a lower-case
1144 |
.CW u .
1145 |
When that character is read by the application, from the file
1146 |
.CW /dev/cons ,
1147 |
it is of course presented as its UTF encoding.
1148 |
Such sequences generate characters from an arbitrary set that
1149 |
includes all of Latin-1 plus a selection of mathematical
1150 |
and technical characters.
1151 |
An arbitrary Unicode character may be generated by typing the prefix,
1152 |
an upper case X, and four hexadecimal digits that identify
1153 |
the Unicode value.
1154 |
1155 |
These simple mechanisms are adequate for most of our day-to-day needs:
1156 |
it's easy to remember to type `ALT 1 2' for ½\^ or `ALT accent letter'
1157 |
for accented Latin letters.
1158 |
For the occasional unusual character, the cut and paste features of
1159 |
.CW 8½
1160 |
serve well. A program called (perhaps misleadingly)
1161 |
.CW unicode
1162 |
takes as argument a hexadecimal value, and prints the UTF representation of that character,
1163 |
which may then be picked up with the mouse and used as input.
1164 |
1165 |
These methods
1166 |
are clearly unsatisfactory when working in a non-English language.
1167 |
In the native country of such a language
1168 |
the appropriate keyboard is likely to be at hand.
1169 |
But it's also reasonable\(emespecially now that the system handles Unicode characters\(emto
1170 |
work in a language foreign to the keyboard.
1171 |
1172 |
For alphabetic languages such as Greek or Russian, it is
1173 |
straightforward to construct a program that does phonetic substitution,
1174 |
so that, for example, typing a Latin `a' yields the Greek `α'.
1175 |
Within Plan 9, such a program can be inserted transparently
1176 |
between the real keyboard and a program such as the window system,
1177 |
providing a manageable input device for such languages.
1178 |
1179 |
For ideographic languages such as Chinese or Japanese the problem is harder.
1180 |
Native users of such languages have adopted methods for dealing with
1181 |
Latin keyboards that involve a hybrid technique based on phonetics
1182 |
to generate a list of possible symbols followed by menu selection to
1183 |
choose the desired one.
1184 |
Such methods can be
1185 |
effective, but their design must be rooted in information about
1186 |
the language unknown to non-native speakers.
1187 |
.CW Cxterm , (
1188 |
a Chinese terminal emulator built by and for
1189 |
Chinese programmers,
1190 |
employs such a technique
1191 |
[Pong and Zhang].)
1192 |
Although the technical problem of implementing such a device
1193 |
is easy in Plan 9\(emit is just an elaboration of the technique for
1194 |
alphabetic languages\(emour lack of familiarity with such languages
1195 |
has restrained our enthusiasm for building one.
1196 |
1197 |
The input problem is technically the least interesting but perhaps
1198 |
emotionally the most important of the problems of converting a system
1199 |
to an international character set.
1200 |
Beyond that remain the deeper problems of internationalization
1201 |
such as multi-lingual error messages and command names,
1202 |
problems we are not qualified to solve.
1203 |
With the ability to treat text of most languages on an equal
1204 |
footing, though, we can begin down that path.
1205 |
Perhaps people in non-English speaking countries will
1206 |
consider adopting Plan 9, solving the input problem locally\(emperhaps
1207 |
just by plugging in their local terminals\(emand begin to use
1208 |
a system with at least the capacity to be international.
1209 |
1210 |
1211 |
1212 |
Dennis Ritchie provided consultation and encouragement.
1213 |
Bob Flandrena converted most of the standard tools to UTF.
1214 |
Brian Kernighan suffered cheerfully with several
1215 |
inadequate implementations and converted
1216 |
.CW troff
1217 |
to UTF.
1218 |
Rich Drechsler converted his Postscript driver to UTF.
1219 |
John Hobby built the Postscript ☺.
1220 |
We thank them all.
1221 |
1222 |
1223 |
1224 |
[ANSIC] \f2American National Standard for Information Systems \-
1225 |
Programming Language C\f1, American National Standards Institute, Inc.,
1226 |
New York, 1990.
1227 |
1228 |
1229 |
ISO/IEC DIS 10646-1:1993
1230 |
\f2Information technology \-
1231 |
Universal Multiple-Octet Coded Character Set (UCS) \(em
1232 |
Part 1: Architecture and Basic Multilingual Plane\fP.
1233 |
1234 |
[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
1235 |
``Plan 9 from Bell Labs'',
1236 |
UKUUG Proc. of the Summer 1990 Conf.,
1237 |
London, England,
1238 |
1239 |
1240 |
[Pike91] R. Pike, ``8½, The Plan 9 Window System'', USENIX Summer
1241 |
Conf. Proc., Nashville, 1991, reprinted in this volume.
1242 |
1243 |
[Pike92] R. Pike, ``How to Use the Plan 9 C Compiler'', this volume.
1244 |
1245 |
[Pong and Zhang] Man-Chi Pong and Yongguang Zhang, ``cxterm:
1246 |
A Chinese Terminal Emulator for the X Window System'',
1247 |
1248 |
Software\(emPractice and Experience,
1249 |
1250 |
Vol 22(1), 809-926, October 1992.
1251 |
1252 |
1253 |
\f2The Unicode Standard,
1254 |
Worldwide Character Encoding,
1255 |
Version 1.0, Volume 1\f1,
1256 |
The Unicode Consortium,
1257 |
Addison Wesley,
1258 |
New York,
1259 |