Subversion Repositories planix.SVN

Rev

Details | Last modification | View Log | RSS feed

Rev Author Line No. Line
2 - 1
.HTML "Hello World or Καλημέρα κόσμε or こんにちは 世界
2
.TL
3
Hello World
4
.br
5
or
6
.br
7
.ft R
8
Καλημέρα κόσμε
9
.ft
10
.br
11
or
12
.br
13
\f(Jpこんにちは 世界\fP
14
.AU
15
Rob Pike
16
Ken Thompson
17
.sp
18
rob,ken@plan9.bell-labs.com
19
.AB
20
.FS
21
Originally appeared, in a slightly different form, in
22
.I
23
Proc. of the Winter 1993 USENIX Conf.,
24
.R
25
pp. 43-50,
26
San Diego.
27
It has been revised to reflect the move to 21-bit Unicode.
28
.FE
29
Plan 9 from Bell Labs has recently been converted from ASCII
30
to an ASCII-compatible variant of the Unicode Standard,
31
a 16-bit (now 21-bit) character set.
32
In this paper we explain the reasons for the change,
33
describe the character set and representation we chose,
34
and present the programming models and software changes
35
that support the new text format.
36
Although we stopped short of full internationalization\(emfor
37
example, system error messages are in Unixese, not Japanese\(emwe
38
believe Plan 9 is the first system to treat the representation
39
of all major languages on a uniform, equal footing throughout all its
40
software.
41
.AE
42
.SH
43
Introduction
44
.PP
45
The world is multilingual but most computer systems
46
are based on English and ASCII.
47
The first release of Plan 9 [Pike90], a new distributed operating
48
system from Bell Laboratories, seemed a good occasion
49
to correct this chauvinism.
50
It is easier to make such deep changes when building new systems than
51
by refitting old ones.
52
.PP
53
The ANSI C standard [ANSIC] contains some guidance on the matter of
54
`wide' and `multi-byte' characters but falls far short of
55
solving the myriad associated problems.
56
We could find no literature on how to convert a
57
.I system
58
to larger character sets, although some individual
59
.I programs
60
had been converted.
61
This paper reports what we discovered as we
62
explored the problem of representing multilingual
63
text at all levels of an operating system,
64
from the file system and kernel through
65
the applications and up to the window system
66
and display.
67
.PP
68
Plan 9 has not been `internationalized':
69
its manuals are in English,
70
its error messages are in English,
71
and it can display text that goes from left to right only.
72
But before we can address these other problems,
73
we need to handle, uniformly and comfortably,
74
the textual representation of all the major written languages.
75
That subproblem is richer than we had anticipated.
76
.SH
77
Standards
78
.PP
79
Our first step was to select a standard.
80
At the time (January 1992),
81
there were only two viable options:
82
ISO 10646 [ISO10646] and Unicode [Unicode].
83
The documents describing both proposals were still in the draft stage.
84
.PP
85
The draft of ISO 10646 was not
86
very attractive to us.
87
It defined a sparse set of 32-bit characters,
88
which would be
89
hard to implement
90
and have punitive storage requirements.
91
Also, the draft attempted to
92
mollify national interests by allocating
93
16-bit subspaces to national committees
94
to partition individually.
95
The suggested mode of use was to
96
``flip'' between separate national
97
standards to implement the international standard.
98
This did not strike us as a sound basis for a character set.
99
As well, transmitting 32-bit values in a byte stream,
100
such as in pipes, would be expensive and hard to implement.
101
Since the standard does not define a byte order for such
102
transmission, the byte stream would also have to carry
103
state to enable the values to be recovered.
104
.PP
105
The Unicode Standard is a proposal by a consortium of mostly American
106
computer companies formed
107
to protest the technical
108
failings of ISO 10646.
109
It defines a uniform 16-bit code based on the
110
principle of unification:
111
two characters are the same if they look the
112
same even though they are from different
113
languages.
114
This principle, called Han unification,
115
allows the large Japanese, Chinese, and Korean
116
character sets to be packed comfortably into a 16-bit representation.
117
.PP
118
We chose the Unicode Standard for its technical merits and because its
119
code space was better defined.
120
Moreover,
121
the Unicode Consortium was derailing the
122
ISO 10646 standard.
123
(Now, in 1995,
124
ISO 10646 is a standard
125
with one 16-bit group defined,
126
which is almost exactly the Unicode Standard.
127
As most people expected, the two standards bodies
128
reached a détente and
129
ISO 10646 and Unicode represent the same character set.)
130
.PP
131
The Unicode Standard defines an adequate character set
132
but an unreasonable representation.
133
It states that all characters
134
are 16 bits wide and are communicated and stored in
135
16-bit units.
136
It also reserves a pair of characters
137
(hexadecimal FFFE and FEFF) to detect byte order
138
in transmitted text, requiring state in the byte stream.
139
(The Unicode Consortium was thinking of files, not pipes.)
140
To adopt this encoding,
141
we would have had to convert all text going
142
into and out of Plan 9 between ASCII and Unicode, which cannot be done.
143
Within a single program, in command of all its input and output,
144
it is possible to define characters as 16-bit quantities;
145
in the context of a networked system with
146
hundreds of applications on diverse machines
147
by different manufacturers,
148
it is impossible.
149
.PP
150
We needed a way to adapt the Unicode Standard to the tools-and-pipes
151
model of text processing embodied by the Unix system.
152
To do that, we
153
needed an ASCII-compatible textual
154
representation of Unicode characters for transmission
155
and storage.
156
In the draft ISO standard there was an informative
157
(non-required)
158
Annex
159
called UTF
160
that provided a byte stream encoding
161
of the 32-bit ISO code.
162
The encoding uses multibyte sequences composed
163
from the 190 printable characters of Latin-1
164
to represent character values larger
165
than 159.
166
.PP
167
The UTF encoding has several good properties.
168
By far the most important is that
169
a byte in the ASCII range 0-127 represents
170
itself in UTF.
171
Thus UTF is backward compatible with ASCII.
172
.PP
173
UTF has other advantages.
174
It is a byte encoding and is
175
therefore byte-order independent.
176
ASCII control characters appear in the byte stream
177
only as themselves, never as an element of a sequence
178
encoding another character,
179
so newline bytes separate lines of UTF text.
180
Finally, ANSI C's
181
.CW strcmp
182
function applied to UTF strings preserves the ordering of Unicode characters.
183
.PP
184
To encode and decode UTF is expensive (involving multiplication,
185
division, and modulo operations) but workable.
186
UTF's major disadvantage is that the encoding
187
is not self-synchronizing.
188
It is in general impossible to find the character
189
boundaries in a UTF string without reading from
190
the beginning of the string, although in practice
191
control characters such as newlines,
192
tabs, and blanks provide synchronization points.
193
.PP
194
In August 1992,
195
X-Open circulated a proposal for another UTF-like
196
byte encoding of Unicode characters.
197
Their major concern was that an embedded character
198
in a file name
199
(in particular a slash)
200
could be part of an escape sequence in UTF and
201
therefore confuse a traditional file system.
202
Their proposal would allow all 7-bit ASCII characters
203
to represent themselves
204
.I "and only themselves"
205
in text.
206
Multibyte sequences would contain only characters
207
with the high bit set.
208
We proposed a modification to the new UTF that
209
would address our synchronization problem.
210
Our proposal, which was  originally known informally as UTF-2 and FSS-UTF,
211
is now referred to as UTF-8 and has been approved by ISO to become
212
Annex P to ISO 10646.
213
.PP
214
The model for text in Plan 9 is chosen from these
215
three standards*:
216
.FS
217
* ``That's the nice thing about standards\(emthere's so many to choose from.'' \- Andy Tannenbaum (no, the other one)
218
.FE
219
the Unicode character set encoded as a byte stream by
220
UTF-8, from
221
(soon to be) Annex P of ISO 10646.
222
Although this mixture may seem like a precarious position for us to adopt,
223
it is not as bad as it sounds.
224
ISO 10646 and the Unicode Standard have converged,
225
other systems such as Linux have adopted the same character set and encoding,
226
and the general feeling seems to be that Unicode and UTF-8 will be accepted as the way
227
to exchange text between systems.
228
The prognosis for wide acceptance is good.
229
.PP
230
There are a couple of aspects of the Unicode Standard we have not faced.
231
One is the issue of right-to-left text such as Hebrew or Arabic.
232
Since that is an issue of display, not representation, we believe
233
we can defer that problem for the moment without affecting our
234
ability to solve it later.
235
Another issue is diacriticals and `combining characters',
236
which cause overstriking of multiple Unicode characters.
237
Although necessary for some scripts, such as Thai, Arabic, and Hebrew,
238
such characters confuse the issues for Latin languages because they
239
generate multiple representations for accented characters.
240
ISO 10646 describes three levels of implementation;
241
in Plan 9 we decided not to address the issue.
242
Again, this can be labeled as a display issue and its finer points are still being debated,
243
so we felt comfortable deferring.  Mañana.
244
.PP
245
Although we converted Plan 9 in the altruistic interests of
246
serving foreign languages, we have found the large character
247
set attractive for other reasons.  The Unicode Standard includes many
248
characters\(emmathematical symbols, scientific notation,
249
more general punctuation, and more\(emthat we now use
250
daily in our work.  We no longer test our imaginations
251
to find ways to include non-ASCII symbols in our text;
252
why type
253
.CW :-)
254
when you can use the character ☺?
255
Most compelling is the ability to absorb documents
256
and data that contain non-ASCII characters; our browser for the
257
Oxford English Dictionary
258
lets us see the dictionary as it really is, with pronunciation
259
in the IPA font, foreign phrases properly rendered, and so on,
260
.I "in plain text.
261
.PP
262
As of Unicode 4.0,
263
characters are now 21 bits wide and the longest UTF-8 encoding of a character
264
requires 4 bytes.
265
We are adapting the system to match.
266
.PP
267
In the rest of this paper, except when
268
stated otherwise, the term `UTF' refers to the UTF-8 encoding
269
of Unicode characters as adopted by Plan 9.
270
.SH
271
C Compiler
272
.PP
273
The first program to be converted to UTF
274
was the C Compiler.
275
There are two levels of conversion.
276
On the syntactic level,
277
input to the C compiler
278
is UTF; on the semantic level,
279
the C language needs to define
280
how compiled programs manipulate
281
the UTF set.
282
.PP
283
The syntactic part is simple.
284
The ANSI C language standard defines the
285
source character set to be ASCII.
286
Since UTF is backward compatible with ASCII,
287
the compiler needs little change.
288
The only places where a larger character set
289
is allowed are in character constants, strings, and comments.
290
Since 7-bit ASCII characters can represent only
291
themselves in UTF,
292
the compiler does not have to be careful while looking
293
for the termination of a string or comment.
294
.PP
295
The Plan 9 compiler extends ANSI C to treat any Unicode
296
character with a value outside of the ASCII range as
297
an alphabetic.
298
To a Greek programmer or an English mathematician,
299
α is a sensible and now valid variable name.
300
.PP
301
On the semantic level, ANSI C allows,
302
but does not tie down,
303
the notion of a
304
.I "wide character
305
and admits string and character constants
306
of this type.
307
We chose the wide character type to be
308
.CW unsigned
309
.CW short
310
(now
311
.CW unsigned
312
.CW long) .
313
In the libraries, the word
314
.CW Rune
315
is now defined by a
316
.CW typedef
317
to be equivalent to
318
.CW unsigned
319
.CW long
320
and is
321
used to signify a Unicode character.
322
.PP
323
There are surprises; for example:
324
.P1
325
L'x'	\f1is 120\fP
326
\&'x'	\f1is 120\fP
327
L'ÿ'	\f1is 255\fP
328
\&'ÿ'	\f1is -1, stdio \fPEOF\f1 (if \fPchar\f1 is signed)\fP
329
L'\f1α\fP'	\f1is 945\fP
330
\&'\f1α\fP'	\f1is illegal\fP
331
.P2
332
In the string constants,
333
.P1
334
"\f(Jpこんにちは 世界\fP"
335
L"\f(Jpこんにちは 世界\fP",
336
.P2
337
the former is an array of
338
.CW chars
339
with 22 elements
340
and a null byte,
341
while the latter is an array of
342
.CW unsigned
343
.CW long s
344
.CW Runes ) (
345
with 8 elements and a null
346
.CW Rune .
347
.PP
348
The Plan 9 library provides an output conversion function,
349
.CW print
350
(analogous to
351
.CW printf ),
352
with formats
353
.CW %c ,
354
.CW %C ,
355
.CW %s ,
356
and
357
.CW %S .
358
Since
359
.CW print
360
produces text, its output is always UTF.
361
The character conversion
362
.CW %c
363
(lower case) masks its argument
364
to 8 bits before converting to UTF.
365
Thus
366
.CW L'ÿ'
367
and
368
.CW 'ÿ'
369
printed under
370
.CW %c
371
will be identical,
372
but
373
.CW L'\f1α\fP'
374
will print as the Unicode
375
character with decimal value 177.
376
The character conversion
377
.CW %C
378
(upper case) masks its argument
379
to 16 bits before converting to UTF.
380
Thus
381
.CW L'ÿ'
382
and
383
.CW L'\f1α\fP'
384
will print correctly under
385
.CW %C ,
386
but
387
.CW 'ÿ'
388
will not.
389
The conversion
390
.CW %s
391
(lower case)
392
expects a pointer to
393
.CW char
394
and copies UTF sequences up to a null byte.
395
The conversion
396
.CW %S
397
(upper case) expects a pointer to
398
.CW Rune
399
and
400
performs sequential
401
.CW %C
402
conversions until a null
403
.CW Rune
404
is encountered.
405
.PP
406
Another problem in format conversion
407
is the definition of
408
.CW %10s :
409
does the number refer to bytes or characters?
410
We decided that such formats were most
411
often used to align output columns and
412
so made the number count characters.
413
Some programs, however, use the count
414
to place blank-padded strings
415
in fixed-sized arrays.
416
These programs must be found and corrected.
417
.PP
418
Here is a complete example:
419
.P1
420
#include <u.h>
421
 
422
char c[] = "\f(Jpこんにちは 世界\fP";
423
Rune s[] = L"\f(Jpこんにちは 世界\fP";
424
 
425
main(void)
426
{
427
	print("%d, %d\en", sizeof(c), sizeof(s));
428
	print("%s\en", c);
429
	print("%S\en", s);
430
}
431
.P2
432
.PP
433
This program prints
434
.CW 23,
435
.CW 18
436
and then two identical lines of
437
UTF text.
438
In practice,
439
.CW %S
440
and
441
.CW L"..."
442
are rare in programs; one reason is
443
that most formatted I/O is done in unconverted UTF.
444
.SH
445
Ramifications
446
.PP
447
All programs in Plan 9 now read and write text as UTF, not ASCII.
448
This change breaks two deep-rooted symmetries implicit in most C programs:
449
.IP 1.
450
A character is no longer a
451
.CW char .
452
.IP 2.
453
The internal representation (Rune) of a character now differs from its
454
external representation (UTF).
455
.PP
456
In the sections that follow,
457
we show how these issues were faced in the layers of
458
system software from the operating system up to the applications.
459
The effects are wide-reaching and often surprising.
460
.SH
461
Operating system
462
.PP
463
Since UTF is the only format for text in Plan 9,
464
the interface to the operating system had to be converted to UTF.
465
Text strings cross the interface in several places:
466
command arguments,
467
file names,
468
user names (people can log in using their native name),
469
error messages,
470
and miscellaneous minor places such as commands to the I/O system.
471
Little change was required: null-terminated UTF strings
472
are equivalent to null-terminated ASCII strings for most purposes
473
of the operating system.
474
The library routines described in the next section made that
475
change straightforward.
476
.PP
477
The window system, once called
478
.CW 8.5 ,
479
is now rightfully called
480
.CW 8½ .
481
.SH
482
Libraries
483
.PP
484
A header file included by all programs (see [Pike92]) declares
485
the
486
.CW Rune
487
type to hold 21-bit character values:
488
.P1
489
typedef unsigned long Rune;
490
.P2
491
Also defined are several constants relevant to UTF:
492
.P1
493
enum
494
{
495
    UTFmax	= 4,	/* maximum bytes per rune */
496
    Runesync	= 0x80,	/* cannot be in a UTF sequence (<) */
497
    Runeself	= 0x80,	/* rune==UTF sequence (<) */
498
    Runeerror	= 0xFFFD,	/* decoding error in UTF */
499
    Runemax	= 0x10FFFF,	/* largest 21-bit rune */
500
    Runemask	= 0x1FFFFF,	/* bits used by runes (see grep) */
501
};
502
.P2
503
(With the original UTF,
504
.CW Runesync
505
was hexadecimal 21 and
506
.CW Runeself
507
was A0.)
508
.CW UTFmax
509
bytes are sufficient
510
to hold the UTF encoding of any Unicode character.
511
Characters of value less than
512
.CW Runesync
513
only appear in a UTF string as
514
themselves, never as part of a sequence encoding another character.
515
Characters of value less than
516
.CW Runeself
517
encode into single bytes
518
of the same value.
519
Finally, when the library detects errors in UTF input\(embyte sequences
520
that are not valid UTF sequences\(emit converts the first byte of the
521
error sequence to the character
522
.CW Runeerror .
523
There is little a rune-oriented program can do when given bad data
524
except exit, which is unreasonable, or carry on.
525
Originally the conversion routines, described below,
526
returned errors when given invalid UTF,
527
but we found ourselves repeatedly checking for errors and ignoring them.
528
We therefore decided to convert a bad sequence to a valid rune
529
and continue processing.
530
(The ANSI C routines, on the other hand, return errors.)
531
.PP
532
This technique does have the unfortunate property that converting
533
invalid UTF byte strings in and out of runes does not preserve the input,
534
but this circumstance only occurs when non-textual input is
535
given to a textual program.
536
The Unicode Standard defines an error character, value FFFD, to stand for
537
characters from other sets that it does not represent.
538
The
539
.CW Runeerror
540
character is a different concept, related to the encoding rather than the character set.
541
.PP
542
The Plan 9 C library contains a number of routines for
543
manipulating runes.
544
The first set converts between runes and UTF strings:
545
.P1
546
extern	int	runetochar(char*, Rune*);
547
extern	int	chartorune(Rune*, char*);
548
extern	int	runelen(long);
549
extern	int	fullrune(char*, int);
550
.P2
551
.CW Runetochar
552
translates a single
553
.CW Rune
554
to a UTF sequence and returns the number of bytes produced.
555
.CW Chartorune
556
goes the other way, reporting how many bytes were consumed.
557
.CW Runelen
558
returns the number of bytes in the UTF encoding of a rune.
559
.CW Fullrune
560
examines a UTF string up to a specified number of bytes
561
and reports whether the string begins with a complete UTF encoding.
562
All these routines use the
563
.CW Runeerror
564
character to work around encoding problems.
565
.PP
566
There is also a set of routines for examining null-terminated UTF strings,
567
based on the model of the ANSI standard
568
.CW str
569
routines, but with
570
.CW utf
571
substituted for
572
.CW str
573
and
574
.CW rune
575
for
576
.CW chr :
577
.P1
578
extern	int	utflen(char*);
579
extern	char*	utfrune(char*, long);
580
extern	char*	utfrrune(char*, long);
581
extern	char*	utfutf(char*, char*);
582
.P2
583
.CW Utflen
584
returns the number of runes in a UTF string;
585
.CW utfrune
586
returns a pointer to the first occurrence of a rune in a UTF string;
587
and
588
.CW utfrrune
589
a pointer to the last.
590
.CW Utfutf
591
searches for the first occurrence of a UTF string in another UTF string.
592
Given the synchronizing property of UTF-8,
593
.CW utfutf
594
is the same as
595
.CW strstr
596
if the arguments point to valid UTF strings.
597
.PP
598
It is a mistake to use
599
.CW strchr
600
or
601
.CW strrchr
602
unless searching for a 7-bit ASCII character, that is, a character
603
less than
604
.CW Runeself .
605
.PP
606
We have no routines for manipulating null-terminated arrays of
607
.CW Runes .
608
Although they should probably exist for completeness, we have
609
found no need for them, for the same reason that
610
.CW %S
611
and
612
.CW L"..."
613
are rarely used.
614
.PP
615
Most Plan 9 programs use a new buffered I/O library, BIO, in place of
616
Standard I/O.
617
BIO contains routines to read and write UTF streams, converting to and from
618
runes.
619
.CW Bgetrune
620
returns, as a
621
.CW Rune
622
within a
623
.CW long ,
624
the next character in the UTF input stream;
625
.CW Bputrune
626
takes a rune and writes its UTF representation.
627
.CW Bungetrune
628
puts a rune back into the input stream for rereading.
629
.PP
630
Plan 9 programs use a simple set of macros to process command line arguments.
631
Converting these macros to UTF automatically updated the
632
argument processing of most programs.
633
In general,
634
argument flag names can no longer be held in bytes and
635
arrays of 256 bytes cannot be used to hold a set of flags.
636
.PP
637
We have done nothing analogous to ANSI C's locales, partly because
638
we do not feel qualified to define locales and partly because we remain
639
unconvinced of that model for dealing with the problems.
640
That is really more an issue of internationalization than conversion
641
to a larger character set; on the other hand,
642
because we have chosen a single character set that encompasses
643
most languages, some of the need for
644
locales is eliminated.
645
(We have a utility,
646
.CW tcs ,
647
that translates between UTF and other character sets.)
648
.PP
649
There are several reasons why our library does not follow the ANSI design
650
for wide and multi-byte characters.
651
The ANSI model was designed by a committee, untried, almost
652
as an afterthought, whereas
653
we wanted to design as we built.
654
(We made several major changes to the interface
655
as we became familiar with the problems involved.)
656
We disagree with ANSI C's handling of invalid multi-byte sequences.
657
Also, the ANSI C library is incomplete:
658
although it contains some crucial routines for handling
659
wide and multi-byte characters, there are some serious omissions.
660
For example, our software can exploit
661
the fact that UTF preserves ASCII characters in the byte stream.
662
We could remove that assumption by replacing all
663
calls to
664
.CW strchr
665
with
666
.CW utfrune
667
and so on.
668
(Because of the weaker properties of the original UTF,
669
we have actually done so.)
670
ANSI C cannot:
671
the standard says nothing about the representation, so portable code should
672
.I never
673
call
674
.CW strchr ,
675
yet there is no ANSI equivalent to
676
.CW utfrune .
677
ANSI C simultaneously invalidates
678
.CW strchr
679
and offers no replacement.
680
.PP
681
Finally, ANSI did nothing to integrate wide characters
682
into the I/O system: it gives no method for printing
683
wide characters.
684
We therefore needed to invent some things and decided to invent
685
everything.
686
In the end, some of our entry points do correspond closely to
687
ANSI routines\(emfor example
688
.CW chartorune
689
and
690
.CW runetochar
691
are similar to
692
.CW mbtowc
693
and
694
.CW wctomb \(embut
695
Plan 9's library defines more functionality, enough
696
to write real applications comfortably.
697
.SH
698
Converting the tools
699
.PP
700
The source for our tools and applications had already been converted to
701
work with Latin-1, so it was `8-bit safe', but the conversion to the Unicode
702
Standard and UTF is more involved.
703
Some programs needed no change at all:
704
.CW cat ,
705
for instance,
706
interprets its argument strings, delivered in UTF,
707
as file names that it passes uninterpreted to the
708
.CW open
709
system call,
710
and then just copies bytes from its input to its output;
711
it never makes decisions based on the values of the bytes.
712
(Plan 9
713
.CW cat
714
has no options such as
715
.CW -v
716
to complicate matters.)
717
Most programs, however, needed modest change.
718
.PP
719
It is difficult to
720
find automatically the places that need attention,
721
but
722
.CW grep
723
helps.
724
Software that uses the libraries conscientiously can be searched
725
for calls to library routines that examine bytes as characters:
726
.CW strchr ,
727
.CW strrchr ,
728
.CW strstr ,
729
etc.
730
Replacing these by calls to
731
.CW utfrune ,
732
.CW utfrrune ,
733
and
734
.CW utfutf
735
is enough to fix many programs.
736
Few tools actually need to operate on runes internally;
737
more typically they need only to look for the final slash in a file
738
name and similar trivial tasks.
739
Of the 170 C source programs in the top levels of
740
.CW /sys/src/cmd ,
741
only 23 now contain the word
742
.CW Rune .
743
.PP
744
The programs that
745
.I do
746
store runes internally
747
are mostly those whose
748
.I raison
749
.I d'être
750
is character manipulation:
751
.CW sam
752
(the text editor),
753
.CW sed ,
754
.CW sort ,
755
.CW tr ,
756
.CW troff ,
757
.CW 8½
758
(the window system and terminal emulator),
759
and so on.
760
To decide whether to compute using runes
761
or UTF-encoded byte strings requires balancing the cost of converting
762
the data when read and written
763
against the cost of converting relevant text on demand.
764
For programs such as editors that run a long time with a relatively
765
constant dataset, runes are the better choice.
766
There are space considerations too, but they are more complicated:
767
plain ASCII text grows when converted to runes; UTF-encoded Japanese
768
shrinks.
769
.PP
770
Again, it is hard to automate the conversion of a program from
771
.CW chars
772
to
773
.CW Runes .
774
It is not enough just to change the type of variables; the assumption
775
that bytes and characters are equivalent can be insidious.
776
For instance, to clear a character array by
777
.P1
778
memset(buf, 0, BUFSIZE)
779
.P2
780
becomes wrong if
781
.CW buf
782
is changed from an array of
783
.CW chars
784
to an array of
785
.CW Runes .
786
Any program that indexes tables based on character values needs
787
rethinking.
788
Consider
789
.CW tr ,
790
which originally used multiple 256-byte arrays for the mapping.
791
The naïve conversion would yield multiple 1,114,112-rune arrays.
792
Instead Plan 9
793
.CW tr
794
saves space by building in effect
795
a run-encoded version of the map.
796
.PP
797
.CW Sort
798
has related problems.
799
The cooperation of UTF and
800
.CW strcmp
801
means that a simple sort\(emone with no options\(emcan be done
802
on the original UTF strings using
803
.CW strcmp .
804
With sorting options enabled, however,
805
.CW sort
806
may need to convert its input to runes: for example,
807
option
808
.CW -t\f1α\fP
809
requires searching for alphas in the input text to
810
crack the input into fields.
811
The field specifier
812
.CW +3.2
813
refers to 2 runes beyond the third field.
814
Some of the other options are hopelessly provincial:
815
consider the case-folding and dictionary order options
816
(Japanese doesn't even have an official dictionary order) or
817
.CW -M
818
which compares by case-insensitive English month name.
819
Handling these options involves the
820
larger issues of internationalization and is beyond the scope
821
of this paper and our expertise.
822
Plan 9
823
.CW sort
824
works sensibly with options that make sense relative to the input.
825
The simple and most important options are, however, usually meaningful.
826
In particular,
827
.CW sort
828
sorts UTF into the same order that
829
.CW look
830
expects.
831
.PP
832
Regular expression-matching algorithms need rethinking to
833
be applied to UTF text.
834
Deterministic automata are usually applied to bytes;
835
converting them to operate on variable-sized byte sequences is awkward.
836
On the other hand, converting the input stream to runes adds measurable
837
expense
838
and the state tables expand
839
from size 256 to 1,114,112; it can be expensive just to generate them.
840
For simple string searching,
841
the Boyer-Moore algorithm works with UTF provided the input is
842
guaranteed to be only valid UTF strings; however, it does not work
843
with the old UTF encoding.
844
At a more mundane level, even character classes are harder:
845
the usual bit-vector representation within a non-deterministic automaton
846
is unwieldy with 1,114,112 characters in the alphabet.
847
.PP
848
We compromised.
849
An existing library for compiling and executing regular expressions
850
was adapted to work on runes, with two entry points for searching
851
in arrays of runes and arrays of chars (the pattern is always UTF text).
852
Character classes are represented internally as runs of runes;
853
the reserved value
854
.CW FFFF
855
marks the end of the class.
856
Then
857
.I all
858
utilities that use regular expressions\(emeditors,
859
.CW grep ,
860
.CW awk ,
861
etc.\(emexcept the shell, whose notation
862
was grandfathered, were converted to use the library.
863
For some programs, there was a concomitant loss of performance,
864
but there was also a strong advantage.
865
To our knowledge, Plan 9 is the only Unix-like system
866
that has a single definition and implementation of
867
regular expressions; patterns are written and interpreted
868
identically by all the programs in the system.
869
.PP
870
A handful of programs have the notion of character built into them
871
so strongly as to confuse the issue of what they should do with UTF input.
872
Such programs were treated as individual special cases.
873
For example,
874
.CW wc
875
is, by default, unchanged in behavior and output; a new option,
876
.CW -r ,
877
counts the number of correctly encoded runes\(emvalid UTF sequences\(emin
878
its input;
879
.CW -b
880
the number of invalid sequences.
881
.PP
882
It took us several months to convert all the software in the system
883
to the Unicode Standard and the old UTF.
884
When we decided to convert from that to the new UTF,
885
only three things needed to be done.
886
First, we rewrote the library routines to encode and decode the
887
new UTF.  This took an evening.
888
Next, we converted all the files containing UTF
889
to the new encoding.
890
We wrote a trivial program to look for non-ASCII bytes in
891
text files and used a Plan 9 program called
892
.CW tcs
893
(translate character set) to change encodings.
894
Finally, we recompiled all the system software;
895
the library interface was unchanged, so recompilation was sufficient
896
to effect the transformation.
897
The second two steps were done concurrently and took an afternoon.
898
We concluded that the actual encoding is relatively unimportant to the
899
software; the adoption of large characters and a byte-stream encoding
900
.I per
901
.I se
902
are much deeper issues.
903
.SH
904
Graphics and fonts
905
.PP
906
Plan 9 provides only minimal support for plain text terminals.
907
It is instead designed to be used with all character input and
908
output mediated by a window system such as
909
.CW 8½ .
910
The window system and related software are responsible for the
911
display of UTF text as Unicode character images.
912
For plain text, the window system must provide a user-settable
913
.I font
914
that provides a (possibly empty) picture for each Unicode character.
915
Fancier applications that use bold and Italic characters
916
need multiple fonts storing multiple pictures for each
917
Unicode value.
918
All the issues are apparent, though,
919
in just the problem of
920
displaying a single image for each character, that is, the
921
Unicode equivalent of a plain text terminal.
922
With 128 or even 256 characters, a font can be just
923
an array of bitmaps.  With 1,114,112 characters,
924
a more sophisticated design is necessary.  To store the ideographs
925
for just Japanese as 16×16×1 bit images,
926
the smallest they can reasonably be, takes over a quarter of a
927
megabyte.  Make the images a little larger, store more bits per
928
pixel, and hold a copy in every running application, and the
929
memory cost becomes unreasonable.
930
.PP
931
The structure of the bitmap graphics services is described at length elsewhere
932
[Pike91].
933
In summary, the memory holding the bitmaps is stored in the same machine that has
934
the display, mouse, and keyboard: the terminal in Plan 9 terminology,
935
the workstation in others'.
936
Access to that memory and associated services is provided
937
by device files served by system
938
software on the terminal.  One of those files,
939
.CW /dev/bitblt ,
940
interprets messages written upon it as requests for actions
941
corresponding to entry points in the graphics library:
942
allocate a bitmap, execute a raster operation, draw a text string, etc.
943
The window system
944
acts as a multiplexer that mediates access to the services
945
and resources of the terminal by simulating in each client window
946
a set of files mirroring those provided by the system.
947
That is, each window has a distinct
948
.CW /dev/mouse ,
949
.CW /dev/bitblt ,
950
and so on through which applications drive graphical
951
input and output.
952
.PP
953
One of the resources managed by
954
.CW 8½
955
and the terminal is the set of active
956
.I subfonts.
957
Each subfont holds the
958
bitmaps and associated data structures for a sequential set of Unicode
959
characters.
960
Subfonts are stored in files and loaded into the terminal by
961
.CW 8½
962
or an application.
963
For example, one subfont
964
might hold the images of the first 256 characters of the Unicode space,
965
corresponding to the Latin-1 character set;
966
another might hold the standard phonetic character set, Unicode characters
967
with value 0250 to 02E9.
968
These files are collected in directories corresponding to typefaces:
969
.CW /lib/font/bit/pelm
970
contains the Pellucida Monospace character set, with subfonts holding
971
the Latin-1, Greek, Cyrillic and other components of the typeface.
972
A suffix on subfont files encodes (in a subfont-specific
973
way) the size of the images:
974
.CW /lib/font/bit/pelm/latin1.9
975
contains the Latin-1 Pellucida Monospace characters with lower
976
case letters 9 pixels high;
977
.CW /lib/font/bit/jis/jis5400.16
978
contains 16-pixel high
979
ideographs starting at Unicode value 5400.
980
.PP
981
The subfonts do not identify which portion of the Unicode space
982
they cover.  Instead, a
983
font file, in plain text,
984
describes how to assemble subfonts into a complete
985
character set.
986
The font file is presented as an argument to the window system
987
to determine how plain text is displayed in text windows and
988
applications.
989
Here is the beginning of the font file
990
.CW /lib/font/bit/pelm/jis.9.font ,
991
which describes the layout of a font covering that portion of
992
the Unicode Standard for which we have characters of typical
993
display size, using Japanese characters
994
to cover the Han space:
995
.P1
996
18	14
997
0x0000	0x00FF	latin1.9
998
0x0100	0x017E	latineur.9
999
0x0250	0x02E9	ipa.9
1000
0x0386	0x03F5	greek.9
1001
0x0400	0x0475	cyrillic.9
1002
0x2000	0x2044	../misc/genpunc.9
1003
0x2070	0x208E	supsub.9
1004
0x20A0	0x20AA	currency.9
1005
0x2100	0x2138	../misc/letterlike.9
1006
0x2190	0x21EA	../misc/arrows
1007
0x2200	0x227F	../misc/math1
1008
0x2280	0x22F1	../misc/math2
1009
0x2300	0x232C	../misc/tech
1010
0x2500	0x257F	../misc/chart
1011
0x2600	0x266F	../misc/ding
1012
.P2
1013
.P1
1014
0x3000	0x303f	../jis/jis3000.16
1015
0x30a1	0x30fe	../jis/katakana.16
1016
0x3041	0x309e	../jis/hiragana.16
1017
0x4e00	0x4fff	../jis/jis4e00.16
1018
0x5000	0x51ff	../jis/jis5000.16
1019
\&...
1020
.P2
1021
The first two numbers set the interline spacing of the font (18
1022
pixels) and the distance from the baseline to the top of the
1023
line (14 pixels).
1024
When characters are displayed, they are placed so as best
1025
to fit within those constraints; characters
1026
too large to fit will be truncated.
1027
The rest of the file associates subfont files
1028
with portions of Unicode space.
1029
The first four such files are in the Pellucida Monospace typeface
1030
and directory; others reside in other directories.  The file names
1031
are relative to the font file's own location.
1032
.PP
1033
There are several advantages to this two-level structure.
1034
First, it simultaneously breaks the huge Unicode space into manageable
1035
components and provides a unifying architecture for
1036
assembling fonts from disjoint pieces.
1037
Second, the structure promotes sharing.
1038
For example, we have only one set of Japanese
1039
characters but dozens of typefaces for the Latin-1 characters,
1040
and this structure permits us to store only one copy of the
1041
Japanese set but use it with any Roman typeface.
1042
Also, customization is easy.
1043
English-speaking users who don't need Japanese characters
1044
but may want to read an on-line Oxford English Dictionary can
1045
assemble a custom font with the
1046
Latin-1 (or even just ASCII) characters and the International
1047
Phonetic Alphabet (IPA).
1048
Moreover, to do so requires just editing a plain text file,
1049
not using a special font editing tool.
1050
Finally, the structure guides the design of
1051
caching protocols to improve performance and memory usage.
1052
.PP
1053
To load a complete Unicode character set into each application
1054
would consume too
1055
much memory and, particularly on slow terminal lines, would take
1056
unreasonably long.
1057
Instead, Plan 9 assembles a multi-level cache structure for
1058
each font.
1059
An application opens a font file, reads and parses it,
1060
and allocates a data structure.
1061
A message written to
1062
.CW /dev/bitblt
1063
allocates an associated structure held in the terminal, in particular,
1064
a bitmap to act as a cache
1065
for recently used character images.
1066
Other messages copy these images to bitmaps such as the screen
1067
by loading characters from subfonts into the cache on demand and
1068
from there to the destination bitmap.
1069
The protocol to draw characters is in terms of cache indices,
1070
not Unicode character number or UTF sequences.
1071
These details are hidden from the application, which instead
1072
sees only a subroutine to draw a string in a bitmap from a
1073
given font, functions to discover character size information,
1074
and routines to allocate and to free fonts.
1075
.PP
1076
As needed, whole
1077
subfonts are opened by the graphics library, read, and then downloaded
1078
to the terminal.
1079
They are held open by the library in an LRU-replacement list.
1080
Even when the program closes a subfont, it is retained
1081
in the terminal for later use.
1082
When the application opens the subfont, it asks the terminal
1083
if it already has a copy to avoid reading it from the file
1084
server if possible.
1085
This level of cache has the property that the bitmaps for, say,
1086
all the Japanese characters are stored only once, in the terminal;
1087
the applications read only size and width information from the terminal
1088
and share the images.
1089
.PP
1090
The sizes of the character and subfont caches held by the
1091
application are adaptive.
1092
A simple algorithm monitors the cache miss rate to enlarge and
1093
shrink the caches as required.
1094
The size of the character cache is limited to 2048 images maximum,
1095
which in practice seems enough even for Japanese text.
1096
For plain ASCII-like text it naturally stays around 128 images.
1097
.PP
1098
This mechanism sounds complicated but is implemented by only about
1099
500 lines in the library and considerably less in each of the
1100
terminal's graphics driver and
1101
.CW 8½ .
1102
It has the advantage that only characters that are
1103
being used are loaded into memory.
1104
It is also efficient: if the characters being drawn
1105
are in the cache the extra overhead is negligible.
1106
It works particularly well for alphabetic character sets,
1107
but also adapts on demand for ideographic sets.
1108
When a user first looks at Japanese text, it takes a few
1109
seconds to read all the font data, but thereafter the
1110
text is drawn almost as fast as regular text (the images
1111
are larger, so draw a little slower).
1112
Also, because the bitmaps are remembered by the terminal,
1113
if a second application then looks at Japanese text
1114
it starts faster than the first.
1115
.PP
1116
We considered
1117
building a `font server'
1118
to cache character images and associated data
1119
for the applications, the window system, and the terminal.
1120
We rejected this design because, although isolating
1121
many of the problems of font management into a separate program,
1122
it didn't simplify the applications.
1123
Moreover, in a distributed system such as Plan 9 it is easy
1124
to have too many special purpose servers.
1125
Making the management of the fonts the concern of only
1126
the essential components simplifies the system and makes
1127
bootstrapping less intricate.
1128
.SH
1129
Input
1130
.PP
1131
A completely different problem is how to type Unicode characters
1132
as input to the system.
1133
We selected an unused key on our ASCII keyboards
1134
to serve as a prefix for multi-keystroke
1135
sequences that generate Unicode characters.
1136
For example, the character
1137
.CW ü
1138
is generated by the prefix key
1139
(typically
1140
.CW ALT
1141
or
1142
.CW Compose )
1143
followed by a double quote and a lower-case
1144
.CW u .
1145
When that character is read by the application, from the file
1146
.CW /dev/cons ,
1147
it is of course presented as its UTF encoding.
1148
Such sequences generate characters from an arbitrary set that
1149
includes all of Latin-1 plus a selection of mathematical
1150
and technical characters.
1151
An arbitrary Unicode character may be generated by typing the prefix,
1152
an upper case X, and four hexadecimal digits that identify
1153
the Unicode value.
1154
.PP
1155
These simple mechanisms are adequate for most of our day-to-day needs:
1156
it's easy to remember to type `ALT 1 2' for ½\^ or `ALT accent letter'
1157
for accented Latin letters.
1158
For the occasional unusual character, the cut and paste features of
1159
.CW 8½
1160
serve well.  A program called (perhaps misleadingly)
1161
.CW unicode
1162
takes as argument a hexadecimal value, and prints the UTF representation of that character,
1163
which may then be picked up with the mouse and used as input.
1164
.PP
1165
These methods
1166
are clearly unsatisfactory when working in a non-English language.
1167
In the native country of such a language
1168
the appropriate keyboard is likely to be at hand.
1169
But it's also reasonable\(emespecially now that the system handles Unicode characters\(emto
1170
work in a language foreign to the keyboard.
1171
.PP
1172
For alphabetic languages such as Greek or Russian, it is
1173
straightforward to construct a program that does phonetic substitution,
1174
so that, for example, typing a Latin `a' yields the Greek `α'.
1175
Within Plan 9, such a program can be inserted transparently
1176
between the real keyboard and a program such as the window system,
1177
providing a manageable input device for such languages.
1178
.PP
1179
For ideographic languages such as Chinese or Japanese the problem is harder.
1180
Native users of such languages have adopted methods for dealing with
1181
Latin keyboards that involve a hybrid technique based on phonetics
1182
to generate a list of possible symbols followed by menu selection to
1183
choose the desired one.
1184
Such methods can be
1185
effective, but their design must be rooted in information about
1186
the language unknown to non-native speakers.
1187
.CW Cxterm , (
1188
a Chinese terminal emulator built by and for
1189
Chinese programmers,
1190
employs such a technique
1191
[Pong and Zhang].)
1192
Although the technical problem of implementing such a device
1193
is easy in Plan 9\(emit is just an elaboration of the technique for
1194
alphabetic languages\(emour lack of familiarity with such languages
1195
has restrained our enthusiasm for building one.
1196
.PP
1197
The input problem is technically the least interesting but perhaps
1198
emotionally the most important of the problems of converting a system
1199
to an international character set.
1200
Beyond that remain the deeper problems of internationalization
1201
such as multi-lingual error messages and command names,
1202
problems we are not qualified to solve.
1203
With the ability to treat text of most languages on an equal
1204
footing, though, we can begin down that path.
1205
Perhaps people in non-English speaking countries will
1206
consider adopting Plan 9, solving the input problem locally\(emperhaps
1207
just by plugging in their local terminals\(emand begin to use
1208
a system with at least the capacity to be international.
1209
.SH
1210
Acknowledgements
1211
.PP
1212
Dennis Ritchie provided consultation and encouragement.
1213
Bob Flandrena converted most of the standard tools to UTF.
1214
Brian Kernighan suffered cheerfully with several
1215
inadequate implementations and converted
1216
.CW troff
1217
to UTF.
1218
Rich Drechsler converted his Postscript driver to UTF.
1219
John Hobby built the Postscript ☺.
1220
We thank them all.
1221
.SH
1222
References
1223
.LP
1224
[ANSIC] \f2American National Standard for Information Systems \-
1225
Programming Language C\f1, American National Standards Institute, Inc.,
1226
New York, 1990.
1227
.LP
1228
[ISO10646]
1229
ISO/IEC DIS 10646-1:1993
1230
\f2Information technology \-
1231
Universal Multiple-Octet Coded Character Set (UCS) \(em
1232
Part 1: Architecture and Basic Multilingual Plane\fP.
1233
.LP
1234
[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
1235
``Plan 9 from Bell Labs'',
1236
UKUUG Proc. of the Summer 1990 Conf.,
1237
London, England,
1238
1990.
1239
.LP
1240
[Pike91] R. Pike, ``8½, The Plan 9 Window System'', USENIX Summer
1241
Conf. Proc., Nashville, 1991, reprinted in this volume.
1242
.LP
1243
[Pike92] R. Pike, ``How to Use the Plan 9 C Compiler'', this volume.
1244
.LP
1245
[Pong and Zhang] Man-Chi Pong and Yongguang Zhang, ``cxterm:
1246
A Chinese Terminal Emulator for the X Window System'',
1247
.I
1248
Software\(emPractice and Experience,
1249
.R
1250
Vol 22(1), 809-926, October 1992.
1251
.LP
1252
[Unicode]
1253
\f2The Unicode Standard,
1254
Worldwide Character Encoding,
1255
Version 1.0, Volume 1\f1,
1256
The Unicode Consortium,
1257
Addison Wesley,
1258
New York,
1259
1991.