2 |
- |
1 |
.HTML "Hello World or Καλημέρα κόσμε or こんにちは 世界
|
|
|
2 |
.TL
|
|
|
3 |
Hello World
|
|
|
4 |
.br
|
|
|
5 |
or
|
|
|
6 |
.br
|
|
|
7 |
.ft R
|
|
|
8 |
Καλημέρα κόσμε
|
|
|
9 |
.ft
|
|
|
10 |
.br
|
|
|
11 |
or
|
|
|
12 |
.br
|
|
|
13 |
\f(Jpこんにちは 世界\fP
|
|
|
14 |
.AU
|
|
|
15 |
Rob Pike
|
|
|
16 |
Ken Thompson
|
|
|
17 |
.sp
|
|
|
18 |
rob,ken@plan9.bell-labs.com
|
|
|
19 |
.AB
|
|
|
20 |
.FS
|
|
|
21 |
Originally appeared, in a slightly different form, in
|
|
|
22 |
.I
|
|
|
23 |
Proc. of the Winter 1993 USENIX Conf.,
|
|
|
24 |
.R
|
|
|
25 |
pp. 43-50,
|
|
|
26 |
San Diego.
|
|
|
27 |
It has been revised to reflect the move to 21-bit Unicode.
|
|
|
28 |
.FE
|
|
|
29 |
Plan 9 from Bell Labs has recently been converted from ASCII
|
|
|
30 |
to an ASCII-compatible variant of the Unicode Standard,
|
|
|
31 |
a 16-bit (now 21-bit) character set.
|
|
|
32 |
In this paper we explain the reasons for the change,
|
|
|
33 |
describe the character set and representation we chose,
|
|
|
34 |
and present the programming models and software changes
|
|
|
35 |
that support the new text format.
|
|
|
36 |
Although we stopped short of full internationalization\(emfor
|
|
|
37 |
example, system error messages are in Unixese, not Japanese\(emwe
|
|
|
38 |
believe Plan 9 is the first system to treat the representation
|
|
|
39 |
of all major languages on a uniform, equal footing throughout all its
|
|
|
40 |
software.
|
|
|
41 |
.AE
|
|
|
42 |
.SH
|
|
|
43 |
Introduction
|
|
|
44 |
.PP
|
|
|
45 |
The world is multilingual but most computer systems
|
|
|
46 |
are based on English and ASCII.
|
|
|
47 |
The first release of Plan 9 [Pike90], a new distributed operating
|
|
|
48 |
system from Bell Laboratories, seemed a good occasion
|
|
|
49 |
to correct this chauvinism.
|
|
|
50 |
It is easier to make such deep changes when building new systems than
|
|
|
51 |
by refitting old ones.
|
|
|
52 |
.PP
|
|
|
53 |
The ANSI C standard [ANSIC] contains some guidance on the matter of
|
|
|
54 |
`wide' and `multi-byte' characters but falls far short of
|
|
|
55 |
solving the myriad associated problems.
|
|
|
56 |
We could find no literature on how to convert a
|
|
|
57 |
.I system
|
|
|
58 |
to larger character sets, although some individual
|
|
|
59 |
.I programs
|
|
|
60 |
had been converted.
|
|
|
61 |
This paper reports what we discovered as we
|
|
|
62 |
explored the problem of representing multilingual
|
|
|
63 |
text at all levels of an operating system,
|
|
|
64 |
from the file system and kernel through
|
|
|
65 |
the applications and up to the window system
|
|
|
66 |
and display.
|
|
|
67 |
.PP
|
|
|
68 |
Plan 9 has not been `internationalized':
|
|
|
69 |
its manuals are in English,
|
|
|
70 |
its error messages are in English,
|
|
|
71 |
and it can display text that goes from left to right only.
|
|
|
72 |
But before we can address these other problems,
|
|
|
73 |
we need to handle, uniformly and comfortably,
|
|
|
74 |
the textual representation of all the major written languages.
|
|
|
75 |
That subproblem is richer than we had anticipated.
|
|
|
76 |
.SH
|
|
|
77 |
Standards
|
|
|
78 |
.PP
|
|
|
79 |
Our first step was to select a standard.
|
|
|
80 |
At the time (January 1992),
|
|
|
81 |
there were only two viable options:
|
|
|
82 |
ISO 10646 [ISO10646] and Unicode [Unicode].
|
|
|
83 |
The documents describing both proposals were still in the draft stage.
|
|
|
84 |
.PP
|
|
|
85 |
The draft of ISO 10646 was not
|
|
|
86 |
very attractive to us.
|
|
|
87 |
It defined a sparse set of 32-bit characters,
|
|
|
88 |
which would be
|
|
|
89 |
hard to implement
|
|
|
90 |
and have punitive storage requirements.
|
|
|
91 |
Also, the draft attempted to
|
|
|
92 |
mollify national interests by allocating
|
|
|
93 |
16-bit subspaces to national committees
|
|
|
94 |
to partition individually.
|
|
|
95 |
The suggested mode of use was to
|
|
|
96 |
``flip'' between separate national
|
|
|
97 |
standards to implement the international standard.
|
|
|
98 |
This did not strike us as a sound basis for a character set.
|
|
|
99 |
As well, transmitting 32-bit values in a byte stream,
|
|
|
100 |
such as in pipes, would be expensive and hard to implement.
|
|
|
101 |
Since the standard does not define a byte order for such
|
|
|
102 |
transmission, the byte stream would also have to carry
|
|
|
103 |
state to enable the values to be recovered.
|
|
|
104 |
.PP
|
|
|
105 |
The Unicode Standard is a proposal by a consortium of mostly American
|
|
|
106 |
computer companies formed
|
|
|
107 |
to protest the technical
|
|
|
108 |
failings of ISO 10646.
|
|
|
109 |
It defines a uniform 16-bit code based on the
|
|
|
110 |
principle of unification:
|
|
|
111 |
two characters are the same if they look the
|
|
|
112 |
same even though they are from different
|
|
|
113 |
languages.
|
|
|
114 |
This principle, called Han unification,
|
|
|
115 |
allows the large Japanese, Chinese, and Korean
|
|
|
116 |
character sets to be packed comfortably into a 16-bit representation.
|
|
|
117 |
.PP
|
|
|
118 |
We chose the Unicode Standard for its technical merits and because its
|
|
|
119 |
code space was better defined.
|
|
|
120 |
Moreover,
|
|
|
121 |
the Unicode Consortium was derailing the
|
|
|
122 |
ISO 10646 standard.
|
|
|
123 |
(Now, in 1995,
|
|
|
124 |
ISO 10646 is a standard
|
|
|
125 |
with one 16-bit group defined,
|
|
|
126 |
which is almost exactly the Unicode Standard.
|
|
|
127 |
As most people expected, the two standards bodies
|
|
|
128 |
reached a détente and
|
|
|
129 |
ISO 10646 and Unicode represent the same character set.)
|
|
|
130 |
.PP
|
|
|
131 |
The Unicode Standard defines an adequate character set
|
|
|
132 |
but an unreasonable representation.
|
|
|
133 |
It states that all characters
|
|
|
134 |
are 16 bits wide and are communicated and stored in
|
|
|
135 |
16-bit units.
|
|
|
136 |
It also reserves a pair of characters
|
|
|
137 |
(hexadecimal FFFE and FEFF) to detect byte order
|
|
|
138 |
in transmitted text, requiring state in the byte stream.
|
|
|
139 |
(The Unicode Consortium was thinking of files, not pipes.)
|
|
|
140 |
To adopt this encoding,
|
|
|
141 |
we would have had to convert all text going
|
|
|
142 |
into and out of Plan 9 between ASCII and Unicode, which cannot be done.
|
|
|
143 |
Within a single program, in command of all its input and output,
|
|
|
144 |
it is possible to define characters as 16-bit quantities;
|
|
|
145 |
in the context of a networked system with
|
|
|
146 |
hundreds of applications on diverse machines
|
|
|
147 |
by different manufacturers,
|
|
|
148 |
it is impossible.
|
|
|
149 |
.PP
|
|
|
150 |
We needed a way to adapt the Unicode Standard to the tools-and-pipes
|
|
|
151 |
model of text processing embodied by the Unix system.
|
|
|
152 |
To do that, we
|
|
|
153 |
needed an ASCII-compatible textual
|
|
|
154 |
representation of Unicode characters for transmission
|
|
|
155 |
and storage.
|
|
|
156 |
In the draft ISO standard there was an informative
|
|
|
157 |
(non-required)
|
|
|
158 |
Annex
|
|
|
159 |
called UTF
|
|
|
160 |
that provided a byte stream encoding
|
|
|
161 |
of the 32-bit ISO code.
|
|
|
162 |
The encoding uses multibyte sequences composed
|
|
|
163 |
from the 190 printable characters of Latin-1
|
|
|
164 |
to represent character values larger
|
|
|
165 |
than 159.
|
|
|
166 |
.PP
|
|
|
167 |
The UTF encoding has several good properties.
|
|
|
168 |
By far the most important is that
|
|
|
169 |
a byte in the ASCII range 0-127 represents
|
|
|
170 |
itself in UTF.
|
|
|
171 |
Thus UTF is backward compatible with ASCII.
|
|
|
172 |
.PP
|
|
|
173 |
UTF has other advantages.
|
|
|
174 |
It is a byte encoding and is
|
|
|
175 |
therefore byte-order independent.
|
|
|
176 |
ASCII control characters appear in the byte stream
|
|
|
177 |
only as themselves, never as an element of a sequence
|
|
|
178 |
encoding another character,
|
|
|
179 |
so newline bytes separate lines of UTF text.
|
|
|
180 |
Finally, ANSI C's
|
|
|
181 |
.CW strcmp
|
|
|
182 |
function applied to UTF strings preserves the ordering of Unicode characters.
|
|
|
183 |
.PP
|
|
|
184 |
To encode and decode UTF is expensive (involving multiplication,
|
|
|
185 |
division, and modulo operations) but workable.
|
|
|
186 |
UTF's major disadvantage is that the encoding
|
|
|
187 |
is not self-synchronizing.
|
|
|
188 |
It is in general impossible to find the character
|
|
|
189 |
boundaries in a UTF string without reading from
|
|
|
190 |
the beginning of the string, although in practice
|
|
|
191 |
control characters such as newlines,
|
|
|
192 |
tabs, and blanks provide synchronization points.
|
|
|
193 |
.PP
|
|
|
194 |
In August 1992,
|
|
|
195 |
X-Open circulated a proposal for another UTF-like
|
|
|
196 |
byte encoding of Unicode characters.
|
|
|
197 |
Their major concern was that an embedded character
|
|
|
198 |
in a file name
|
|
|
199 |
(in particular a slash)
|
|
|
200 |
could be part of an escape sequence in UTF and
|
|
|
201 |
therefore confuse a traditional file system.
|
|
|
202 |
Their proposal would allow all 7-bit ASCII characters
|
|
|
203 |
to represent themselves
|
|
|
204 |
.I "and only themselves"
|
|
|
205 |
in text.
|
|
|
206 |
Multibyte sequences would contain only characters
|
|
|
207 |
with the high bit set.
|
|
|
208 |
We proposed a modification to the new UTF that
|
|
|
209 |
would address our synchronization problem.
|
|
|
210 |
Our proposal, which was originally known informally as UTF-2 and FSS-UTF,
|
|
|
211 |
is now referred to as UTF-8 and has been approved by ISO to become
|
|
|
212 |
Annex P to ISO 10646.
|
|
|
213 |
.PP
|
|
|
214 |
The model for text in Plan 9 is chosen from these
|
|
|
215 |
three standards*:
|
|
|
216 |
.FS
|
|
|
217 |
* ``That's the nice thing about standards\(emthere's so many to choose from.'' \- Andy Tannenbaum (no, the other one)
|
|
|
218 |
.FE
|
|
|
219 |
the Unicode character set encoded as a byte stream by
|
|
|
220 |
UTF-8, from
|
|
|
221 |
(soon to be) Annex P of ISO 10646.
|
|
|
222 |
Although this mixture may seem like a precarious position for us to adopt,
|
|
|
223 |
it is not as bad as it sounds.
|
|
|
224 |
ISO 10646 and the Unicode Standard have converged,
|
|
|
225 |
other systems such as Linux have adopted the same character set and encoding,
|
|
|
226 |
and the general feeling seems to be that Unicode and UTF-8 will be accepted as the way
|
|
|
227 |
to exchange text between systems.
|
|
|
228 |
The prognosis for wide acceptance is good.
|
|
|
229 |
.PP
|
|
|
230 |
There are a couple of aspects of the Unicode Standard we have not faced.
|
|
|
231 |
One is the issue of right-to-left text such as Hebrew or Arabic.
|
|
|
232 |
Since that is an issue of display, not representation, we believe
|
|
|
233 |
we can defer that problem for the moment without affecting our
|
|
|
234 |
ability to solve it later.
|
|
|
235 |
Another issue is diacriticals and `combining characters',
|
|
|
236 |
which cause overstriking of multiple Unicode characters.
|
|
|
237 |
Although necessary for some scripts, such as Thai, Arabic, and Hebrew,
|
|
|
238 |
such characters confuse the issues for Latin languages because they
|
|
|
239 |
generate multiple representations for accented characters.
|
|
|
240 |
ISO 10646 describes three levels of implementation;
|
|
|
241 |
in Plan 9 we decided not to address the issue.
|
|
|
242 |
Again, this can be labeled as a display issue and its finer points are still being debated,
|
|
|
243 |
so we felt comfortable deferring. Mañana.
|
|
|
244 |
.PP
|
|
|
245 |
Although we converted Plan 9 in the altruistic interests of
|
|
|
246 |
serving foreign languages, we have found the large character
|
|
|
247 |
set attractive for other reasons. The Unicode Standard includes many
|
|
|
248 |
characters\(emmathematical symbols, scientific notation,
|
|
|
249 |
more general punctuation, and more\(emthat we now use
|
|
|
250 |
daily in our work. We no longer test our imaginations
|
|
|
251 |
to find ways to include non-ASCII symbols in our text;
|
|
|
252 |
why type
|
|
|
253 |
.CW :-)
|
|
|
254 |
when you can use the character ☺?
|
|
|
255 |
Most compelling is the ability to absorb documents
|
|
|
256 |
and data that contain non-ASCII characters; our browser for the
|
|
|
257 |
Oxford English Dictionary
|
|
|
258 |
lets us see the dictionary as it really is, with pronunciation
|
|
|
259 |
in the IPA font, foreign phrases properly rendered, and so on,
|
|
|
260 |
.I "in plain text.
|
|
|
261 |
.PP
|
|
|
262 |
As of Unicode 4.0,
|
|
|
263 |
characters are now 21 bits wide and the longest UTF-8 encoding of a character
|
|
|
264 |
requires 4 bytes.
|
|
|
265 |
We are adapting the system to match.
|
|
|
266 |
.PP
|
|
|
267 |
In the rest of this paper, except when
|
|
|
268 |
stated otherwise, the term `UTF' refers to the UTF-8 encoding
|
|
|
269 |
of Unicode characters as adopted by Plan 9.
|
|
|
270 |
.SH
|
|
|
271 |
C Compiler
|
|
|
272 |
.PP
|
|
|
273 |
The first program to be converted to UTF
|
|
|
274 |
was the C Compiler.
|
|
|
275 |
There are two levels of conversion.
|
|
|
276 |
On the syntactic level,
|
|
|
277 |
input to the C compiler
|
|
|
278 |
is UTF; on the semantic level,
|
|
|
279 |
the C language needs to define
|
|
|
280 |
how compiled programs manipulate
|
|
|
281 |
the UTF set.
|
|
|
282 |
.PP
|
|
|
283 |
The syntactic part is simple.
|
|
|
284 |
The ANSI C language standard defines the
|
|
|
285 |
source character set to be ASCII.
|
|
|
286 |
Since UTF is backward compatible with ASCII,
|
|
|
287 |
the compiler needs little change.
|
|
|
288 |
The only places where a larger character set
|
|
|
289 |
is allowed are in character constants, strings, and comments.
|
|
|
290 |
Since 7-bit ASCII characters can represent only
|
|
|
291 |
themselves in UTF,
|
|
|
292 |
the compiler does not have to be careful while looking
|
|
|
293 |
for the termination of a string or comment.
|
|
|
294 |
.PP
|
|
|
295 |
The Plan 9 compiler extends ANSI C to treat any Unicode
|
|
|
296 |
character with a value outside of the ASCII range as
|
|
|
297 |
an alphabetic.
|
|
|
298 |
To a Greek programmer or an English mathematician,
|
|
|
299 |
α is a sensible and now valid variable name.
|
|
|
300 |
.PP
|
|
|
301 |
On the semantic level, ANSI C allows,
|
|
|
302 |
but does not tie down,
|
|
|
303 |
the notion of a
|
|
|
304 |
.I "wide character
|
|
|
305 |
and admits string and character constants
|
|
|
306 |
of this type.
|
|
|
307 |
We chose the wide character type to be
|
|
|
308 |
.CW unsigned
|
|
|
309 |
.CW short
|
|
|
310 |
(now
|
|
|
311 |
.CW unsigned
|
|
|
312 |
.CW long) .
|
|
|
313 |
In the libraries, the word
|
|
|
314 |
.CW Rune
|
|
|
315 |
is now defined by a
|
|
|
316 |
.CW typedef
|
|
|
317 |
to be equivalent to
|
|
|
318 |
.CW unsigned
|
|
|
319 |
.CW long
|
|
|
320 |
and is
|
|
|
321 |
used to signify a Unicode character.
|
|
|
322 |
.PP
|
|
|
323 |
There are surprises; for example:
|
|
|
324 |
.P1
|
|
|
325 |
L'x' \f1is 120\fP
|
|
|
326 |
\&'x' \f1is 120\fP
|
|
|
327 |
L'ÿ' \f1is 255\fP
|
|
|
328 |
\&'ÿ' \f1is -1, stdio \fPEOF\f1 (if \fPchar\f1 is signed)\fP
|
|
|
329 |
L'\f1α\fP' \f1is 945\fP
|
|
|
330 |
\&'\f1α\fP' \f1is illegal\fP
|
|
|
331 |
.P2
|
|
|
332 |
In the string constants,
|
|
|
333 |
.P1
|
|
|
334 |
"\f(Jpこんにちは 世界\fP"
|
|
|
335 |
L"\f(Jpこんにちは 世界\fP",
|
|
|
336 |
.P2
|
|
|
337 |
the former is an array of
|
|
|
338 |
.CW chars
|
|
|
339 |
with 22 elements
|
|
|
340 |
and a null byte,
|
|
|
341 |
while the latter is an array of
|
|
|
342 |
.CW unsigned
|
|
|
343 |
.CW long s
|
|
|
344 |
.CW Runes ) (
|
|
|
345 |
with 8 elements and a null
|
|
|
346 |
.CW Rune .
|
|
|
347 |
.PP
|
|
|
348 |
The Plan 9 library provides an output conversion function,
|
|
|
349 |
.CW print
|
|
|
350 |
(analogous to
|
|
|
351 |
.CW printf ),
|
|
|
352 |
with formats
|
|
|
353 |
.CW %c ,
|
|
|
354 |
.CW %C ,
|
|
|
355 |
.CW %s ,
|
|
|
356 |
and
|
|
|
357 |
.CW %S .
|
|
|
358 |
Since
|
|
|
359 |
.CW print
|
|
|
360 |
produces text, its output is always UTF.
|
|
|
361 |
The character conversion
|
|
|
362 |
.CW %c
|
|
|
363 |
(lower case) masks its argument
|
|
|
364 |
to 8 bits before converting to UTF.
|
|
|
365 |
Thus
|
|
|
366 |
.CW L'ÿ'
|
|
|
367 |
and
|
|
|
368 |
.CW 'ÿ'
|
|
|
369 |
printed under
|
|
|
370 |
.CW %c
|
|
|
371 |
will be identical,
|
|
|
372 |
but
|
|
|
373 |
.CW L'\f1α\fP'
|
|
|
374 |
will print as the Unicode
|
|
|
375 |
character with decimal value 177.
|
|
|
376 |
The character conversion
|
|
|
377 |
.CW %C
|
|
|
378 |
(upper case) masks its argument
|
|
|
379 |
to 16 bits before converting to UTF.
|
|
|
380 |
Thus
|
|
|
381 |
.CW L'ÿ'
|
|
|
382 |
and
|
|
|
383 |
.CW L'\f1α\fP'
|
|
|
384 |
will print correctly under
|
|
|
385 |
.CW %C ,
|
|
|
386 |
but
|
|
|
387 |
.CW 'ÿ'
|
|
|
388 |
will not.
|
|
|
389 |
The conversion
|
|
|
390 |
.CW %s
|
|
|
391 |
(lower case)
|
|
|
392 |
expects a pointer to
|
|
|
393 |
.CW char
|
|
|
394 |
and copies UTF sequences up to a null byte.
|
|
|
395 |
The conversion
|
|
|
396 |
.CW %S
|
|
|
397 |
(upper case) expects a pointer to
|
|
|
398 |
.CW Rune
|
|
|
399 |
and
|
|
|
400 |
performs sequential
|
|
|
401 |
.CW %C
|
|
|
402 |
conversions until a null
|
|
|
403 |
.CW Rune
|
|
|
404 |
is encountered.
|
|
|
405 |
.PP
|
|
|
406 |
Another problem in format conversion
|
|
|
407 |
is the definition of
|
|
|
408 |
.CW %10s :
|
|
|
409 |
does the number refer to bytes or characters?
|
|
|
410 |
We decided that such formats were most
|
|
|
411 |
often used to align output columns and
|
|
|
412 |
so made the number count characters.
|
|
|
413 |
Some programs, however, use the count
|
|
|
414 |
to place blank-padded strings
|
|
|
415 |
in fixed-sized arrays.
|
|
|
416 |
These programs must be found and corrected.
|
|
|
417 |
.PP
|
|
|
418 |
Here is a complete example:
|
|
|
419 |
.P1
|
|
|
420 |
#include <u.h>
|
|
|
421 |
|
|
|
422 |
char c[] = "\f(Jpこんにちは 世界\fP";
|
|
|
423 |
Rune s[] = L"\f(Jpこんにちは 世界\fP";
|
|
|
424 |
|
|
|
425 |
main(void)
|
|
|
426 |
{
|
|
|
427 |
print("%d, %d\en", sizeof(c), sizeof(s));
|
|
|
428 |
print("%s\en", c);
|
|
|
429 |
print("%S\en", s);
|
|
|
430 |
}
|
|
|
431 |
.P2
|
|
|
432 |
.PP
|
|
|
433 |
This program prints
|
|
|
434 |
.CW 23,
|
|
|
435 |
.CW 18
|
|
|
436 |
and then two identical lines of
|
|
|
437 |
UTF text.
|
|
|
438 |
In practice,
|
|
|
439 |
.CW %S
|
|
|
440 |
and
|
|
|
441 |
.CW L"..."
|
|
|
442 |
are rare in programs; one reason is
|
|
|
443 |
that most formatted I/O is done in unconverted UTF.
|
|
|
444 |
.SH
|
|
|
445 |
Ramifications
|
|
|
446 |
.PP
|
|
|
447 |
All programs in Plan 9 now read and write text as UTF, not ASCII.
|
|
|
448 |
This change breaks two deep-rooted symmetries implicit in most C programs:
|
|
|
449 |
.IP 1.
|
|
|
450 |
A character is no longer a
|
|
|
451 |
.CW char .
|
|
|
452 |
.IP 2.
|
|
|
453 |
The internal representation (Rune) of a character now differs from its
|
|
|
454 |
external representation (UTF).
|
|
|
455 |
.PP
|
|
|
456 |
In the sections that follow,
|
|
|
457 |
we show how these issues were faced in the layers of
|
|
|
458 |
system software from the operating system up to the applications.
|
|
|
459 |
The effects are wide-reaching and often surprising.
|
|
|
460 |
.SH
|
|
|
461 |
Operating system
|
|
|
462 |
.PP
|
|
|
463 |
Since UTF is the only format for text in Plan 9,
|
|
|
464 |
the interface to the operating system had to be converted to UTF.
|
|
|
465 |
Text strings cross the interface in several places:
|
|
|
466 |
command arguments,
|
|
|
467 |
file names,
|
|
|
468 |
user names (people can log in using their native name),
|
|
|
469 |
error messages,
|
|
|
470 |
and miscellaneous minor places such as commands to the I/O system.
|
|
|
471 |
Little change was required: null-terminated UTF strings
|
|
|
472 |
are equivalent to null-terminated ASCII strings for most purposes
|
|
|
473 |
of the operating system.
|
|
|
474 |
The library routines described in the next section made that
|
|
|
475 |
change straightforward.
|
|
|
476 |
.PP
|
|
|
477 |
The window system, once called
|
|
|
478 |
.CW 8.5 ,
|
|
|
479 |
is now rightfully called
|
|
|
480 |
.CW 8½ .
|
|
|
481 |
.SH
|
|
|
482 |
Libraries
|
|
|
483 |
.PP
|
|
|
484 |
A header file included by all programs (see [Pike92]) declares
|
|
|
485 |
the
|
|
|
486 |
.CW Rune
|
|
|
487 |
type to hold 21-bit character values:
|
|
|
488 |
.P1
|
|
|
489 |
typedef unsigned long Rune;
|
|
|
490 |
.P2
|
|
|
491 |
Also defined are several constants relevant to UTF:
|
|
|
492 |
.P1
|
|
|
493 |
enum
|
|
|
494 |
{
|
|
|
495 |
UTFmax = 4, /* maximum bytes per rune */
|
|
|
496 |
Runesync = 0x80, /* cannot be in a UTF sequence (<) */
|
|
|
497 |
Runeself = 0x80, /* rune==UTF sequence (<) */
|
|
|
498 |
Runeerror = 0xFFFD, /* decoding error in UTF */
|
|
|
499 |
Runemax = 0x10FFFF, /* largest 21-bit rune */
|
|
|
500 |
Runemask = 0x1FFFFF, /* bits used by runes (see grep) */
|
|
|
501 |
};
|
|
|
502 |
.P2
|
|
|
503 |
(With the original UTF,
|
|
|
504 |
.CW Runesync
|
|
|
505 |
was hexadecimal 21 and
|
|
|
506 |
.CW Runeself
|
|
|
507 |
was A0.)
|
|
|
508 |
.CW UTFmax
|
|
|
509 |
bytes are sufficient
|
|
|
510 |
to hold the UTF encoding of any Unicode character.
|
|
|
511 |
Characters of value less than
|
|
|
512 |
.CW Runesync
|
|
|
513 |
only appear in a UTF string as
|
|
|
514 |
themselves, never as part of a sequence encoding another character.
|
|
|
515 |
Characters of value less than
|
|
|
516 |
.CW Runeself
|
|
|
517 |
encode into single bytes
|
|
|
518 |
of the same value.
|
|
|
519 |
Finally, when the library detects errors in UTF input\(embyte sequences
|
|
|
520 |
that are not valid UTF sequences\(emit converts the first byte of the
|
|
|
521 |
error sequence to the character
|
|
|
522 |
.CW Runeerror .
|
|
|
523 |
There is little a rune-oriented program can do when given bad data
|
|
|
524 |
except exit, which is unreasonable, or carry on.
|
|
|
525 |
Originally the conversion routines, described below,
|
|
|
526 |
returned errors when given invalid UTF,
|
|
|
527 |
but we found ourselves repeatedly checking for errors and ignoring them.
|
|
|
528 |
We therefore decided to convert a bad sequence to a valid rune
|
|
|
529 |
and continue processing.
|
|
|
530 |
(The ANSI C routines, on the other hand, return errors.)
|
|
|
531 |
.PP
|
|
|
532 |
This technique does have the unfortunate property that converting
|
|
|
533 |
invalid UTF byte strings in and out of runes does not preserve the input,
|
|
|
534 |
but this circumstance only occurs when non-textual input is
|
|
|
535 |
given to a textual program.
|
|
|
536 |
The Unicode Standard defines an error character, value FFFD, to stand for
|
|
|
537 |
characters from other sets that it does not represent.
|
|
|
538 |
The
|
|
|
539 |
.CW Runeerror
|
|
|
540 |
character is a different concept, related to the encoding rather than the character set.
|
|
|
541 |
.PP
|
|
|
542 |
The Plan 9 C library contains a number of routines for
|
|
|
543 |
manipulating runes.
|
|
|
544 |
The first set converts between runes and UTF strings:
|
|
|
545 |
.P1
|
|
|
546 |
extern int runetochar(char*, Rune*);
|
|
|
547 |
extern int chartorune(Rune*, char*);
|
|
|
548 |
extern int runelen(long);
|
|
|
549 |
extern int fullrune(char*, int);
|
|
|
550 |
.P2
|
|
|
551 |
.CW Runetochar
|
|
|
552 |
translates a single
|
|
|
553 |
.CW Rune
|
|
|
554 |
to a UTF sequence and returns the number of bytes produced.
|
|
|
555 |
.CW Chartorune
|
|
|
556 |
goes the other way, reporting how many bytes were consumed.
|
|
|
557 |
.CW Runelen
|
|
|
558 |
returns the number of bytes in the UTF encoding of a rune.
|
|
|
559 |
.CW Fullrune
|
|
|
560 |
examines a UTF string up to a specified number of bytes
|
|
|
561 |
and reports whether the string begins with a complete UTF encoding.
|
|
|
562 |
All these routines use the
|
|
|
563 |
.CW Runeerror
|
|
|
564 |
character to work around encoding problems.
|
|
|
565 |
.PP
|
|
|
566 |
There is also a set of routines for examining null-terminated UTF strings,
|
|
|
567 |
based on the model of the ANSI standard
|
|
|
568 |
.CW str
|
|
|
569 |
routines, but with
|
|
|
570 |
.CW utf
|
|
|
571 |
substituted for
|
|
|
572 |
.CW str
|
|
|
573 |
and
|
|
|
574 |
.CW rune
|
|
|
575 |
for
|
|
|
576 |
.CW chr :
|
|
|
577 |
.P1
|
|
|
578 |
extern int utflen(char*);
|
|
|
579 |
extern char* utfrune(char*, long);
|
|
|
580 |
extern char* utfrrune(char*, long);
|
|
|
581 |
extern char* utfutf(char*, char*);
|
|
|
582 |
.P2
|
|
|
583 |
.CW Utflen
|
|
|
584 |
returns the number of runes in a UTF string;
|
|
|
585 |
.CW utfrune
|
|
|
586 |
returns a pointer to the first occurrence of a rune in a UTF string;
|
|
|
587 |
and
|
|
|
588 |
.CW utfrrune
|
|
|
589 |
a pointer to the last.
|
|
|
590 |
.CW Utfutf
|
|
|
591 |
searches for the first occurrence of a UTF string in another UTF string.
|
|
|
592 |
Given the synchronizing property of UTF-8,
|
|
|
593 |
.CW utfutf
|
|
|
594 |
is the same as
|
|
|
595 |
.CW strstr
|
|
|
596 |
if the arguments point to valid UTF strings.
|
|
|
597 |
.PP
|
|
|
598 |
It is a mistake to use
|
|
|
599 |
.CW strchr
|
|
|
600 |
or
|
|
|
601 |
.CW strrchr
|
|
|
602 |
unless searching for a 7-bit ASCII character, that is, a character
|
|
|
603 |
less than
|
|
|
604 |
.CW Runeself .
|
|
|
605 |
.PP
|
|
|
606 |
We have no routines for manipulating null-terminated arrays of
|
|
|
607 |
.CW Runes .
|
|
|
608 |
Although they should probably exist for completeness, we have
|
|
|
609 |
found no need for them, for the same reason that
|
|
|
610 |
.CW %S
|
|
|
611 |
and
|
|
|
612 |
.CW L"..."
|
|
|
613 |
are rarely used.
|
|
|
614 |
.PP
|
|
|
615 |
Most Plan 9 programs use a new buffered I/O library, BIO, in place of
|
|
|
616 |
Standard I/O.
|
|
|
617 |
BIO contains routines to read and write UTF streams, converting to and from
|
|
|
618 |
runes.
|
|
|
619 |
.CW Bgetrune
|
|
|
620 |
returns, as a
|
|
|
621 |
.CW Rune
|
|
|
622 |
within a
|
|
|
623 |
.CW long ,
|
|
|
624 |
the next character in the UTF input stream;
|
|
|
625 |
.CW Bputrune
|
|
|
626 |
takes a rune and writes its UTF representation.
|
|
|
627 |
.CW Bungetrune
|
|
|
628 |
puts a rune back into the input stream for rereading.
|
|
|
629 |
.PP
|
|
|
630 |
Plan 9 programs use a simple set of macros to process command line arguments.
|
|
|
631 |
Converting these macros to UTF automatically updated the
|
|
|
632 |
argument processing of most programs.
|
|
|
633 |
In general,
|
|
|
634 |
argument flag names can no longer be held in bytes and
|
|
|
635 |
arrays of 256 bytes cannot be used to hold a set of flags.
|
|
|
636 |
.PP
|
|
|
637 |
We have done nothing analogous to ANSI C's locales, partly because
|
|
|
638 |
we do not feel qualified to define locales and partly because we remain
|
|
|
639 |
unconvinced of that model for dealing with the problems.
|
|
|
640 |
That is really more an issue of internationalization than conversion
|
|
|
641 |
to a larger character set; on the other hand,
|
|
|
642 |
because we have chosen a single character set that encompasses
|
|
|
643 |
most languages, some of the need for
|
|
|
644 |
locales is eliminated.
|
|
|
645 |
(We have a utility,
|
|
|
646 |
.CW tcs ,
|
|
|
647 |
that translates between UTF and other character sets.)
|
|
|
648 |
.PP
|
|
|
649 |
There are several reasons why our library does not follow the ANSI design
|
|
|
650 |
for wide and multi-byte characters.
|
|
|
651 |
The ANSI model was designed by a committee, untried, almost
|
|
|
652 |
as an afterthought, whereas
|
|
|
653 |
we wanted to design as we built.
|
|
|
654 |
(We made several major changes to the interface
|
|
|
655 |
as we became familiar with the problems involved.)
|
|
|
656 |
We disagree with ANSI C's handling of invalid multi-byte sequences.
|
|
|
657 |
Also, the ANSI C library is incomplete:
|
|
|
658 |
although it contains some crucial routines for handling
|
|
|
659 |
wide and multi-byte characters, there are some serious omissions.
|
|
|
660 |
For example, our software can exploit
|
|
|
661 |
the fact that UTF preserves ASCII characters in the byte stream.
|
|
|
662 |
We could remove that assumption by replacing all
|
|
|
663 |
calls to
|
|
|
664 |
.CW strchr
|
|
|
665 |
with
|
|
|
666 |
.CW utfrune
|
|
|
667 |
and so on.
|
|
|
668 |
(Because of the weaker properties of the original UTF,
|
|
|
669 |
we have actually done so.)
|
|
|
670 |
ANSI C cannot:
|
|
|
671 |
the standard says nothing about the representation, so portable code should
|
|
|
672 |
.I never
|
|
|
673 |
call
|
|
|
674 |
.CW strchr ,
|
|
|
675 |
yet there is no ANSI equivalent to
|
|
|
676 |
.CW utfrune .
|
|
|
677 |
ANSI C simultaneously invalidates
|
|
|
678 |
.CW strchr
|
|
|
679 |
and offers no replacement.
|
|
|
680 |
.PP
|
|
|
681 |
Finally, ANSI did nothing to integrate wide characters
|
|
|
682 |
into the I/O system: it gives no method for printing
|
|
|
683 |
wide characters.
|
|
|
684 |
We therefore needed to invent some things and decided to invent
|
|
|
685 |
everything.
|
|
|
686 |
In the end, some of our entry points do correspond closely to
|
|
|
687 |
ANSI routines\(emfor example
|
|
|
688 |
.CW chartorune
|
|
|
689 |
and
|
|
|
690 |
.CW runetochar
|
|
|
691 |
are similar to
|
|
|
692 |
.CW mbtowc
|
|
|
693 |
and
|
|
|
694 |
.CW wctomb \(embut
|
|
|
695 |
Plan 9's library defines more functionality, enough
|
|
|
696 |
to write real applications comfortably.
|
|
|
697 |
.SH
|
|
|
698 |
Converting the tools
|
|
|
699 |
.PP
|
|
|
700 |
The source for our tools and applications had already been converted to
|
|
|
701 |
work with Latin-1, so it was `8-bit safe', but the conversion to the Unicode
|
|
|
702 |
Standard and UTF is more involved.
|
|
|
703 |
Some programs needed no change at all:
|
|
|
704 |
.CW cat ,
|
|
|
705 |
for instance,
|
|
|
706 |
interprets its argument strings, delivered in UTF,
|
|
|
707 |
as file names that it passes uninterpreted to the
|
|
|
708 |
.CW open
|
|
|
709 |
system call,
|
|
|
710 |
and then just copies bytes from its input to its output;
|
|
|
711 |
it never makes decisions based on the values of the bytes.
|
|
|
712 |
(Plan 9
|
|
|
713 |
.CW cat
|
|
|
714 |
has no options such as
|
|
|
715 |
.CW -v
|
|
|
716 |
to complicate matters.)
|
|
|
717 |
Most programs, however, needed modest change.
|
|
|
718 |
.PP
|
|
|
719 |
It is difficult to
|
|
|
720 |
find automatically the places that need attention,
|
|
|
721 |
but
|
|
|
722 |
.CW grep
|
|
|
723 |
helps.
|
|
|
724 |
Software that uses the libraries conscientiously can be searched
|
|
|
725 |
for calls to library routines that examine bytes as characters:
|
|
|
726 |
.CW strchr ,
|
|
|
727 |
.CW strrchr ,
|
|
|
728 |
.CW strstr ,
|
|
|
729 |
etc.
|
|
|
730 |
Replacing these by calls to
|
|
|
731 |
.CW utfrune ,
|
|
|
732 |
.CW utfrrune ,
|
|
|
733 |
and
|
|
|
734 |
.CW utfutf
|
|
|
735 |
is enough to fix many programs.
|
|
|
736 |
Few tools actually need to operate on runes internally;
|
|
|
737 |
more typically they need only to look for the final slash in a file
|
|
|
738 |
name and similar trivial tasks.
|
|
|
739 |
Of the 170 C source programs in the top levels of
|
|
|
740 |
.CW /sys/src/cmd ,
|
|
|
741 |
only 23 now contain the word
|
|
|
742 |
.CW Rune .
|
|
|
743 |
.PP
|
|
|
744 |
The programs that
|
|
|
745 |
.I do
|
|
|
746 |
store runes internally
|
|
|
747 |
are mostly those whose
|
|
|
748 |
.I raison
|
|
|
749 |
.I d'être
|
|
|
750 |
is character manipulation:
|
|
|
751 |
.CW sam
|
|
|
752 |
(the text editor),
|
|
|
753 |
.CW sed ,
|
|
|
754 |
.CW sort ,
|
|
|
755 |
.CW tr ,
|
|
|
756 |
.CW troff ,
|
|
|
757 |
.CW 8½
|
|
|
758 |
(the window system and terminal emulator),
|
|
|
759 |
and so on.
|
|
|
760 |
To decide whether to compute using runes
|
|
|
761 |
or UTF-encoded byte strings requires balancing the cost of converting
|
|
|
762 |
the data when read and written
|
|
|
763 |
against the cost of converting relevant text on demand.
|
|
|
764 |
For programs such as editors that run a long time with a relatively
|
|
|
765 |
constant dataset, runes are the better choice.
|
|
|
766 |
There are space considerations too, but they are more complicated:
|
|
|
767 |
plain ASCII text grows when converted to runes; UTF-encoded Japanese
|
|
|
768 |
shrinks.
|
|
|
769 |
.PP
|
|
|
770 |
Again, it is hard to automate the conversion of a program from
|
|
|
771 |
.CW chars
|
|
|
772 |
to
|
|
|
773 |
.CW Runes .
|
|
|
774 |
It is not enough just to change the type of variables; the assumption
|
|
|
775 |
that bytes and characters are equivalent can be insidious.
|
|
|
776 |
For instance, to clear a character array by
|
|
|
777 |
.P1
|
|
|
778 |
memset(buf, 0, BUFSIZE)
|
|
|
779 |
.P2
|
|
|
780 |
becomes wrong if
|
|
|
781 |
.CW buf
|
|
|
782 |
is changed from an array of
|
|
|
783 |
.CW chars
|
|
|
784 |
to an array of
|
|
|
785 |
.CW Runes .
|
|
|
786 |
Any program that indexes tables based on character values needs
|
|
|
787 |
rethinking.
|
|
|
788 |
Consider
|
|
|
789 |
.CW tr ,
|
|
|
790 |
which originally used multiple 256-byte arrays for the mapping.
|
|
|
791 |
The naïve conversion would yield multiple 1,114,112-rune arrays.
|
|
|
792 |
Instead Plan 9
|
|
|
793 |
.CW tr
|
|
|
794 |
saves space by building in effect
|
|
|
795 |
a run-encoded version of the map.
|
|
|
796 |
.PP
|
|
|
797 |
.CW Sort
|
|
|
798 |
has related problems.
|
|
|
799 |
The cooperation of UTF and
|
|
|
800 |
.CW strcmp
|
|
|
801 |
means that a simple sort\(emone with no options\(emcan be done
|
|
|
802 |
on the original UTF strings using
|
|
|
803 |
.CW strcmp .
|
|
|
804 |
With sorting options enabled, however,
|
|
|
805 |
.CW sort
|
|
|
806 |
may need to convert its input to runes: for example,
|
|
|
807 |
option
|
|
|
808 |
.CW -t\f1α\fP
|
|
|
809 |
requires searching for alphas in the input text to
|
|
|
810 |
crack the input into fields.
|
|
|
811 |
The field specifier
|
|
|
812 |
.CW +3.2
|
|
|
813 |
refers to 2 runes beyond the third field.
|
|
|
814 |
Some of the other options are hopelessly provincial:
|
|
|
815 |
consider the case-folding and dictionary order options
|
|
|
816 |
(Japanese doesn't even have an official dictionary order) or
|
|
|
817 |
.CW -M
|
|
|
818 |
which compares by case-insensitive English month name.
|
|
|
819 |
Handling these options involves the
|
|
|
820 |
larger issues of internationalization and is beyond the scope
|
|
|
821 |
of this paper and our expertise.
|
|
|
822 |
Plan 9
|
|
|
823 |
.CW sort
|
|
|
824 |
works sensibly with options that make sense relative to the input.
|
|
|
825 |
The simple and most important options are, however, usually meaningful.
|
|
|
826 |
In particular,
|
|
|
827 |
.CW sort
|
|
|
828 |
sorts UTF into the same order that
|
|
|
829 |
.CW look
|
|
|
830 |
expects.
|
|
|
831 |
.PP
|
|
|
832 |
Regular expression-matching algorithms need rethinking to
|
|
|
833 |
be applied to UTF text.
|
|
|
834 |
Deterministic automata are usually applied to bytes;
|
|
|
835 |
converting them to operate on variable-sized byte sequences is awkward.
|
|
|
836 |
On the other hand, converting the input stream to runes adds measurable
|
|
|
837 |
expense
|
|
|
838 |
and the state tables expand
|
|
|
839 |
from size 256 to 1,114,112; it can be expensive just to generate them.
|
|
|
840 |
For simple string searching,
|
|
|
841 |
the Boyer-Moore algorithm works with UTF provided the input is
|
|
|
842 |
guaranteed to be only valid UTF strings; however, it does not work
|
|
|
843 |
with the old UTF encoding.
|
|
|
844 |
At a more mundane level, even character classes are harder:
|
|
|
845 |
the usual bit-vector representation within a non-deterministic automaton
|
|
|
846 |
is unwieldy with 1,114,112 characters in the alphabet.
|
|
|
847 |
.PP
|
|
|
848 |
We compromised.
|
|
|
849 |
An existing library for compiling and executing regular expressions
|
|
|
850 |
was adapted to work on runes, with two entry points for searching
|
|
|
851 |
in arrays of runes and arrays of chars (the pattern is always UTF text).
|
|
|
852 |
Character classes are represented internally as runs of runes;
|
|
|
853 |
the reserved value
|
|
|
854 |
.CW FFFF
|
|
|
855 |
marks the end of the class.
|
|
|
856 |
Then
|
|
|
857 |
.I all
|
|
|
858 |
utilities that use regular expressions\(emeditors,
|
|
|
859 |
.CW grep ,
|
|
|
860 |
.CW awk ,
|
|
|
861 |
etc.\(emexcept the shell, whose notation
|
|
|
862 |
was grandfathered, were converted to use the library.
|
|
|
863 |
For some programs, there was a concomitant loss of performance,
|
|
|
864 |
but there was also a strong advantage.
|
|
|
865 |
To our knowledge, Plan 9 is the only Unix-like system
|
|
|
866 |
that has a single definition and implementation of
|
|
|
867 |
regular expressions; patterns are written and interpreted
|
|
|
868 |
identically by all the programs in the system.
|
|
|
869 |
.PP
|
|
|
870 |
A handful of programs have the notion of character built into them
|
|
|
871 |
so strongly as to confuse the issue of what they should do with UTF input.
|
|
|
872 |
Such programs were treated as individual special cases.
|
|
|
873 |
For example,
|
|
|
874 |
.CW wc
|
|
|
875 |
is, by default, unchanged in behavior and output; a new option,
|
|
|
876 |
.CW -r ,
|
|
|
877 |
counts the number of correctly encoded runes\(emvalid UTF sequences\(emin
|
|
|
878 |
its input;
|
|
|
879 |
.CW -b
|
|
|
880 |
the number of invalid sequences.
|
|
|
881 |
.PP
|
|
|
882 |
It took us several months to convert all the software in the system
|
|
|
883 |
to the Unicode Standard and the old UTF.
|
|
|
884 |
When we decided to convert from that to the new UTF,
|
|
|
885 |
only three things needed to be done.
|
|
|
886 |
First, we rewrote the library routines to encode and decode the
|
|
|
887 |
new UTF. This took an evening.
|
|
|
888 |
Next, we converted all the files containing UTF
|
|
|
889 |
to the new encoding.
|
|
|
890 |
We wrote a trivial program to look for non-ASCII bytes in
|
|
|
891 |
text files and used a Plan 9 program called
|
|
|
892 |
.CW tcs
|
|
|
893 |
(translate character set) to change encodings.
|
|
|
894 |
Finally, we recompiled all the system software;
|
|
|
895 |
the library interface was unchanged, so recompilation was sufficient
|
|
|
896 |
to effect the transformation.
|
|
|
897 |
The second two steps were done concurrently and took an afternoon.
|
|
|
898 |
We concluded that the actual encoding is relatively unimportant to the
|
|
|
899 |
software; the adoption of large characters and a byte-stream encoding
|
|
|
900 |
.I per
|
|
|
901 |
.I se
|
|
|
902 |
are much deeper issues.
|
|
|
903 |
.SH
|
|
|
904 |
Graphics and fonts
|
|
|
905 |
.PP
|
|
|
906 |
Plan 9 provides only minimal support for plain text terminals.
|
|
|
907 |
It is instead designed to be used with all character input and
|
|
|
908 |
output mediated by a window system such as
|
|
|
909 |
.CW 8½ .
|
|
|
910 |
The window system and related software are responsible for the
|
|
|
911 |
display of UTF text as Unicode character images.
|
|
|
912 |
For plain text, the window system must provide a user-settable
|
|
|
913 |
.I font
|
|
|
914 |
that provides a (possibly empty) picture for each Unicode character.
|
|
|
915 |
Fancier applications that use bold and Italic characters
|
|
|
916 |
need multiple fonts storing multiple pictures for each
|
|
|
917 |
Unicode value.
|
|
|
918 |
All the issues are apparent, though,
|
|
|
919 |
in just the problem of
|
|
|
920 |
displaying a single image for each character, that is, the
|
|
|
921 |
Unicode equivalent of a plain text terminal.
|
|
|
922 |
With 128 or even 256 characters, a font can be just
|
|
|
923 |
an array of bitmaps. With 1,114,112 characters,
|
|
|
924 |
a more sophisticated design is necessary. To store the ideographs
|
|
|
925 |
for just Japanese as 16×16×1 bit images,
|
|
|
926 |
the smallest they can reasonably be, takes over a quarter of a
|
|
|
927 |
megabyte. Make the images a little larger, store more bits per
|
|
|
928 |
pixel, and hold a copy in every running application, and the
|
|
|
929 |
memory cost becomes unreasonable.
|
|
|
930 |
.PP
|
|
|
931 |
The structure of the bitmap graphics services is described at length elsewhere
|
|
|
932 |
[Pike91].
|
|
|
933 |
In summary, the memory holding the bitmaps is stored in the same machine that has
|
|
|
934 |
the display, mouse, and keyboard: the terminal in Plan 9 terminology,
|
|
|
935 |
the workstation in others'.
|
|
|
936 |
Access to that memory and associated services is provided
|
|
|
937 |
by device files served by system
|
|
|
938 |
software on the terminal. One of those files,
|
|
|
939 |
.CW /dev/bitblt ,
|
|
|
940 |
interprets messages written upon it as requests for actions
|
|
|
941 |
corresponding to entry points in the graphics library:
|
|
|
942 |
allocate a bitmap, execute a raster operation, draw a text string, etc.
|
|
|
943 |
The window system
|
|
|
944 |
acts as a multiplexer that mediates access to the services
|
|
|
945 |
and resources of the terminal by simulating in each client window
|
|
|
946 |
a set of files mirroring those provided by the system.
|
|
|
947 |
That is, each window has a distinct
|
|
|
948 |
.CW /dev/mouse ,
|
|
|
949 |
.CW /dev/bitblt ,
|
|
|
950 |
and so on through which applications drive graphical
|
|
|
951 |
input and output.
|
|
|
952 |
.PP
|
|
|
953 |
One of the resources managed by
|
|
|
954 |
.CW 8½
|
|
|
955 |
and the terminal is the set of active
|
|
|
956 |
.I subfonts.
|
|
|
957 |
Each subfont holds the
|
|
|
958 |
bitmaps and associated data structures for a sequential set of Unicode
|
|
|
959 |
characters.
|
|
|
960 |
Subfonts are stored in files and loaded into the terminal by
|
|
|
961 |
.CW 8½
|
|
|
962 |
or an application.
|
|
|
963 |
For example, one subfont
|
|
|
964 |
might hold the images of the first 256 characters of the Unicode space,
|
|
|
965 |
corresponding to the Latin-1 character set;
|
|
|
966 |
another might hold the standard phonetic character set, Unicode characters
|
|
|
967 |
with value 0250 to 02E9.
|
|
|
968 |
These files are collected in directories corresponding to typefaces:
|
|
|
969 |
.CW /lib/font/bit/pelm
|
|
|
970 |
contains the Pellucida Monospace character set, with subfonts holding
|
|
|
971 |
the Latin-1, Greek, Cyrillic and other components of the typeface.
|
|
|
972 |
A suffix on subfont files encodes (in a subfont-specific
|
|
|
973 |
way) the size of the images:
|
|
|
974 |
.CW /lib/font/bit/pelm/latin1.9
|
|
|
975 |
contains the Latin-1 Pellucida Monospace characters with lower
|
|
|
976 |
case letters 9 pixels high;
|
|
|
977 |
.CW /lib/font/bit/jis/jis5400.16
|
|
|
978 |
contains 16-pixel high
|
|
|
979 |
ideographs starting at Unicode value 5400.
|
|
|
980 |
.PP
|
|
|
981 |
The subfonts do not identify which portion of the Unicode space
|
|
|
982 |
they cover. Instead, a
|
|
|
983 |
font file, in plain text,
|
|
|
984 |
describes how to assemble subfonts into a complete
|
|
|
985 |
character set.
|
|
|
986 |
The font file is presented as an argument to the window system
|
|
|
987 |
to determine how plain text is displayed in text windows and
|
|
|
988 |
applications.
|
|
|
989 |
Here is the beginning of the font file
|
|
|
990 |
.CW /lib/font/bit/pelm/jis.9.font ,
|
|
|
991 |
which describes the layout of a font covering that portion of
|
|
|
992 |
the Unicode Standard for which we have characters of typical
|
|
|
993 |
display size, using Japanese characters
|
|
|
994 |
to cover the Han space:
|
|
|
995 |
.P1
|
|
|
996 |
18 14
|
|
|
997 |
0x0000 0x00FF latin1.9
|
|
|
998 |
0x0100 0x017E latineur.9
|
|
|
999 |
0x0250 0x02E9 ipa.9
|
|
|
1000 |
0x0386 0x03F5 greek.9
|
|
|
1001 |
0x0400 0x0475 cyrillic.9
|
|
|
1002 |
0x2000 0x2044 ../misc/genpunc.9
|
|
|
1003 |
0x2070 0x208E supsub.9
|
|
|
1004 |
0x20A0 0x20AA currency.9
|
|
|
1005 |
0x2100 0x2138 ../misc/letterlike.9
|
|
|
1006 |
0x2190 0x21EA ../misc/arrows
|
|
|
1007 |
0x2200 0x227F ../misc/math1
|
|
|
1008 |
0x2280 0x22F1 ../misc/math2
|
|
|
1009 |
0x2300 0x232C ../misc/tech
|
|
|
1010 |
0x2500 0x257F ../misc/chart
|
|
|
1011 |
0x2600 0x266F ../misc/ding
|
|
|
1012 |
.P2
|
|
|
1013 |
.P1
|
|
|
1014 |
0x3000 0x303f ../jis/jis3000.16
|
|
|
1015 |
0x30a1 0x30fe ../jis/katakana.16
|
|
|
1016 |
0x3041 0x309e ../jis/hiragana.16
|
|
|
1017 |
0x4e00 0x4fff ../jis/jis4e00.16
|
|
|
1018 |
0x5000 0x51ff ../jis/jis5000.16
|
|
|
1019 |
\&...
|
|
|
1020 |
.P2
|
|
|
1021 |
The first two numbers set the interline spacing of the font (18
|
|
|
1022 |
pixels) and the distance from the baseline to the top of the
|
|
|
1023 |
line (14 pixels).
|
|
|
1024 |
When characters are displayed, they are placed so as best
|
|
|
1025 |
to fit within those constraints; characters
|
|
|
1026 |
too large to fit will be truncated.
|
|
|
1027 |
The rest of the file associates subfont files
|
|
|
1028 |
with portions of Unicode space.
|
|
|
1029 |
The first four such files are in the Pellucida Monospace typeface
|
|
|
1030 |
and directory; others reside in other directories. The file names
|
|
|
1031 |
are relative to the font file's own location.
|
|
|
1032 |
.PP
|
|
|
1033 |
There are several advantages to this two-level structure.
|
|
|
1034 |
First, it simultaneously breaks the huge Unicode space into manageable
|
|
|
1035 |
components and provides a unifying architecture for
|
|
|
1036 |
assembling fonts from disjoint pieces.
|
|
|
1037 |
Second, the structure promotes sharing.
|
|
|
1038 |
For example, we have only one set of Japanese
|
|
|
1039 |
characters but dozens of typefaces for the Latin-1 characters,
|
|
|
1040 |
and this structure permits us to store only one copy of the
|
|
|
1041 |
Japanese set but use it with any Roman typeface.
|
|
|
1042 |
Also, customization is easy.
|
|
|
1043 |
English-speaking users who don't need Japanese characters
|
|
|
1044 |
but may want to read an on-line Oxford English Dictionary can
|
|
|
1045 |
assemble a custom font with the
|
|
|
1046 |
Latin-1 (or even just ASCII) characters and the International
|
|
|
1047 |
Phonetic Alphabet (IPA).
|
|
|
1048 |
Moreover, to do so requires just editing a plain text file,
|
|
|
1049 |
not using a special font editing tool.
|
|
|
1050 |
Finally, the structure guides the design of
|
|
|
1051 |
caching protocols to improve performance and memory usage.
|
|
|
1052 |
.PP
|
|
|
1053 |
To load a complete Unicode character set into each application
|
|
|
1054 |
would consume too
|
|
|
1055 |
much memory and, particularly on slow terminal lines, would take
|
|
|
1056 |
unreasonably long.
|
|
|
1057 |
Instead, Plan 9 assembles a multi-level cache structure for
|
|
|
1058 |
each font.
|
|
|
1059 |
An application opens a font file, reads and parses it,
|
|
|
1060 |
and allocates a data structure.
|
|
|
1061 |
A message written to
|
|
|
1062 |
.CW /dev/bitblt
|
|
|
1063 |
allocates an associated structure held in the terminal, in particular,
|
|
|
1064 |
a bitmap to act as a cache
|
|
|
1065 |
for recently used character images.
|
|
|
1066 |
Other messages copy these images to bitmaps such as the screen
|
|
|
1067 |
by loading characters from subfonts into the cache on demand and
|
|
|
1068 |
from there to the destination bitmap.
|
|
|
1069 |
The protocol to draw characters is in terms of cache indices,
|
|
|
1070 |
not Unicode character number or UTF sequences.
|
|
|
1071 |
These details are hidden from the application, which instead
|
|
|
1072 |
sees only a subroutine to draw a string in a bitmap from a
|
|
|
1073 |
given font, functions to discover character size information,
|
|
|
1074 |
and routines to allocate and to free fonts.
|
|
|
1075 |
.PP
|
|
|
1076 |
As needed, whole
|
|
|
1077 |
subfonts are opened by the graphics library, read, and then downloaded
|
|
|
1078 |
to the terminal.
|
|
|
1079 |
They are held open by the library in an LRU-replacement list.
|
|
|
1080 |
Even when the program closes a subfont, it is retained
|
|
|
1081 |
in the terminal for later use.
|
|
|
1082 |
When the application opens the subfont, it asks the terminal
|
|
|
1083 |
if it already has a copy to avoid reading it from the file
|
|
|
1084 |
server if possible.
|
|
|
1085 |
This level of cache has the property that the bitmaps for, say,
|
|
|
1086 |
all the Japanese characters are stored only once, in the terminal;
|
|
|
1087 |
the applications read only size and width information from the terminal
|
|
|
1088 |
and share the images.
|
|
|
1089 |
.PP
|
|
|
1090 |
The sizes of the character and subfont caches held by the
|
|
|
1091 |
application are adaptive.
|
|
|
1092 |
A simple algorithm monitors the cache miss rate to enlarge and
|
|
|
1093 |
shrink the caches as required.
|
|
|
1094 |
The size of the character cache is limited to 2048 images maximum,
|
|
|
1095 |
which in practice seems enough even for Japanese text.
|
|
|
1096 |
For plain ASCII-like text it naturally stays around 128 images.
|
|
|
1097 |
.PP
|
|
|
1098 |
This mechanism sounds complicated but is implemented by only about
|
|
|
1099 |
500 lines in the library and considerably less in each of the
|
|
|
1100 |
terminal's graphics driver and
|
|
|
1101 |
.CW 8½ .
|
|
|
1102 |
It has the advantage that only characters that are
|
|
|
1103 |
being used are loaded into memory.
|
|
|
1104 |
It is also efficient: if the characters being drawn
|
|
|
1105 |
are in the cache the extra overhead is negligible.
|
|
|
1106 |
It works particularly well for alphabetic character sets,
|
|
|
1107 |
but also adapts on demand for ideographic sets.
|
|
|
1108 |
When a user first looks at Japanese text, it takes a few
|
|
|
1109 |
seconds to read all the font data, but thereafter the
|
|
|
1110 |
text is drawn almost as fast as regular text (the images
|
|
|
1111 |
are larger, so draw a little slower).
|
|
|
1112 |
Also, because the bitmaps are remembered by the terminal,
|
|
|
1113 |
if a second application then looks at Japanese text
|
|
|
1114 |
it starts faster than the first.
|
|
|
1115 |
.PP
|
|
|
1116 |
We considered
|
|
|
1117 |
building a `font server'
|
|
|
1118 |
to cache character images and associated data
|
|
|
1119 |
for the applications, the window system, and the terminal.
|
|
|
1120 |
We rejected this design because, although isolating
|
|
|
1121 |
many of the problems of font management into a separate program,
|
|
|
1122 |
it didn't simplify the applications.
|
|
|
1123 |
Moreover, in a distributed system such as Plan 9 it is easy
|
|
|
1124 |
to have too many special purpose servers.
|
|
|
1125 |
Making the management of the fonts the concern of only
|
|
|
1126 |
the essential components simplifies the system and makes
|
|
|
1127 |
bootstrapping less intricate.
|
|
|
1128 |
.SH
|
|
|
1129 |
Input
|
|
|
1130 |
.PP
|
|
|
1131 |
A completely different problem is how to type Unicode characters
|
|
|
1132 |
as input to the system.
|
|
|
1133 |
We selected an unused key on our ASCII keyboards
|
|
|
1134 |
to serve as a prefix for multi-keystroke
|
|
|
1135 |
sequences that generate Unicode characters.
|
|
|
1136 |
For example, the character
|
|
|
1137 |
.CW ü
|
|
|
1138 |
is generated by the prefix key
|
|
|
1139 |
(typically
|
|
|
1140 |
.CW ALT
|
|
|
1141 |
or
|
|
|
1142 |
.CW Compose )
|
|
|
1143 |
followed by a double quote and a lower-case
|
|
|
1144 |
.CW u .
|
|
|
1145 |
When that character is read by the application, from the file
|
|
|
1146 |
.CW /dev/cons ,
|
|
|
1147 |
it is of course presented as its UTF encoding.
|
|
|
1148 |
Such sequences generate characters from an arbitrary set that
|
|
|
1149 |
includes all of Latin-1 plus a selection of mathematical
|
|
|
1150 |
and technical characters.
|
|
|
1151 |
An arbitrary Unicode character may be generated by typing the prefix,
|
|
|
1152 |
an upper case X, and four hexadecimal digits that identify
|
|
|
1153 |
the Unicode value.
|
|
|
1154 |
.PP
|
|
|
1155 |
These simple mechanisms are adequate for most of our day-to-day needs:
|
|
|
1156 |
it's easy to remember to type `ALT 1 2' for ½\^ or `ALT accent letter'
|
|
|
1157 |
for accented Latin letters.
|
|
|
1158 |
For the occasional unusual character, the cut and paste features of
|
|
|
1159 |
.CW 8½
|
|
|
1160 |
serve well. A program called (perhaps misleadingly)
|
|
|
1161 |
.CW unicode
|
|
|
1162 |
takes as argument a hexadecimal value, and prints the UTF representation of that character,
|
|
|
1163 |
which may then be picked up with the mouse and used as input.
|
|
|
1164 |
.PP
|
|
|
1165 |
These methods
|
|
|
1166 |
are clearly unsatisfactory when working in a non-English language.
|
|
|
1167 |
In the native country of such a language
|
|
|
1168 |
the appropriate keyboard is likely to be at hand.
|
|
|
1169 |
But it's also reasonable\(emespecially now that the system handles Unicode characters\(emto
|
|
|
1170 |
work in a language foreign to the keyboard.
|
|
|
1171 |
.PP
|
|
|
1172 |
For alphabetic languages such as Greek or Russian, it is
|
|
|
1173 |
straightforward to construct a program that does phonetic substitution,
|
|
|
1174 |
so that, for example, typing a Latin `a' yields the Greek `α'.
|
|
|
1175 |
Within Plan 9, such a program can be inserted transparently
|
|
|
1176 |
between the real keyboard and a program such as the window system,
|
|
|
1177 |
providing a manageable input device for such languages.
|
|
|
1178 |
.PP
|
|
|
1179 |
For ideographic languages such as Chinese or Japanese the problem is harder.
|
|
|
1180 |
Native users of such languages have adopted methods for dealing with
|
|
|
1181 |
Latin keyboards that involve a hybrid technique based on phonetics
|
|
|
1182 |
to generate a list of possible symbols followed by menu selection to
|
|
|
1183 |
choose the desired one.
|
|
|
1184 |
Such methods can be
|
|
|
1185 |
effective, but their design must be rooted in information about
|
|
|
1186 |
the language unknown to non-native speakers.
|
|
|
1187 |
.CW Cxterm , (
|
|
|
1188 |
a Chinese terminal emulator built by and for
|
|
|
1189 |
Chinese programmers,
|
|
|
1190 |
employs such a technique
|
|
|
1191 |
[Pong and Zhang].)
|
|
|
1192 |
Although the technical problem of implementing such a device
|
|
|
1193 |
is easy in Plan 9\(emit is just an elaboration of the technique for
|
|
|
1194 |
alphabetic languages\(emour lack of familiarity with such languages
|
|
|
1195 |
has restrained our enthusiasm for building one.
|
|
|
1196 |
.PP
|
|
|
1197 |
The input problem is technically the least interesting but perhaps
|
|
|
1198 |
emotionally the most important of the problems of converting a system
|
|
|
1199 |
to an international character set.
|
|
|
1200 |
Beyond that remain the deeper problems of internationalization
|
|
|
1201 |
such as multi-lingual error messages and command names,
|
|
|
1202 |
problems we are not qualified to solve.
|
|
|
1203 |
With the ability to treat text of most languages on an equal
|
|
|
1204 |
footing, though, we can begin down that path.
|
|
|
1205 |
Perhaps people in non-English speaking countries will
|
|
|
1206 |
consider adopting Plan 9, solving the input problem locally\(emperhaps
|
|
|
1207 |
just by plugging in their local terminals\(emand begin to use
|
|
|
1208 |
a system with at least the capacity to be international.
|
|
|
1209 |
.SH
|
|
|
1210 |
Acknowledgements
|
|
|
1211 |
.PP
|
|
|
1212 |
Dennis Ritchie provided consultation and encouragement.
|
|
|
1213 |
Bob Flandrena converted most of the standard tools to UTF.
|
|
|
1214 |
Brian Kernighan suffered cheerfully with several
|
|
|
1215 |
inadequate implementations and converted
|
|
|
1216 |
.CW troff
|
|
|
1217 |
to UTF.
|
|
|
1218 |
Rich Drechsler converted his Postscript driver to UTF.
|
|
|
1219 |
John Hobby built the Postscript ☺.
|
|
|
1220 |
We thank them all.
|
|
|
1221 |
.SH
|
|
|
1222 |
References
|
|
|
1223 |
.LP
|
|
|
1224 |
[ANSIC] \f2American National Standard for Information Systems \-
|
|
|
1225 |
Programming Language C\f1, American National Standards Institute, Inc.,
|
|
|
1226 |
New York, 1990.
|
|
|
1227 |
.LP
|
|
|
1228 |
[ISO10646]
|
|
|
1229 |
ISO/IEC DIS 10646-1:1993
|
|
|
1230 |
\f2Information technology \-
|
|
|
1231 |
Universal Multiple-Octet Coded Character Set (UCS) \(em
|
|
|
1232 |
Part 1: Architecture and Basic Multilingual Plane\fP.
|
|
|
1233 |
.LP
|
|
|
1234 |
[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
|
|
|
1235 |
``Plan 9 from Bell Labs'',
|
|
|
1236 |
UKUUG Proc. of the Summer 1990 Conf.,
|
|
|
1237 |
London, England,
|
|
|
1238 |
1990.
|
|
|
1239 |
.LP
|
|
|
1240 |
[Pike91] R. Pike, ``8½, The Plan 9 Window System'', USENIX Summer
|
|
|
1241 |
Conf. Proc., Nashville, 1991, reprinted in this volume.
|
|
|
1242 |
.LP
|
|
|
1243 |
[Pike92] R. Pike, ``How to Use the Plan 9 C Compiler'', this volume.
|
|
|
1244 |
.LP
|
|
|
1245 |
[Pong and Zhang] Man-Chi Pong and Yongguang Zhang, ``cxterm:
|
|
|
1246 |
A Chinese Terminal Emulator for the X Window System'',
|
|
|
1247 |
.I
|
|
|
1248 |
Software\(emPractice and Experience,
|
|
|
1249 |
.R
|
|
|
1250 |
Vol 22(1), 809-926, October 1992.
|
|
|
1251 |
.LP
|
|
|
1252 |
[Unicode]
|
|
|
1253 |
\f2The Unicode Standard,
|
|
|
1254 |
Worldwide Character Encoding,
|
|
|
1255 |
Version 1.0, Volume 1\f1,
|
|
|
1256 |
The Unicode Consortium,
|
|
|
1257 |
Addison Wesley,
|
|
|
1258 |
New York,
|
|
|
1259 |
1991.
|