Subversion Repositories planix.SVN

Rev

Rev 2 | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
2 - 1
.HTML "How to Use the Plan 9 C Compiler
2
.TL
3
How to Use the Plan 9 C Compiler*
4
.AU
5
Rob Pike
6
rob@plan9.bell-labs.com
7
.SH
8
Introduction
9
.FS
10
* This paper has been revised to reflect the move to 21-bit Unicode.
11
.FE
12
.PP
13
The C compiler on Plan 9 is a wholly new program; in fact
14
it was the first piece of software written for what would
15
eventually become Plan 9 from Bell Labs.
16
Programmers familiar with existing C compilers will find
17
a number of differences in both the language the Plan 9 compiler
18
accepts and in how the compiler is used.
19
.PP
20
The compiler is really a set of compilers, one for each
21
architecture \(em MIPS, SPARC, Intel 386, Power PC, ARM, etc. \(em
22
that accept a dialect of ANSI C and efficiently produce
23
fairly good code for the target machine.
24
There is a packaging of the compiler that accepts strict ANSI C for
25
a POSIX environment, but this document focuses on the
26
native Plan 9 environment, that in which all the system source and
27
almost all the utilities are written.
28
.SH
29
Source
30
.PP
31
The language accepted by the compilers is the core 1989 ANSI C language
32
with some modest extensions,
33
a greatly simplified preprocessor,
34
a smaller library that includes system calls and related facilities,
35
and a completely different structure for include files.
36
.PP
37
Official ANSI C accepts the old (K&R) style of declarations for
38
functions; the Plan 9 compilers
39
are more demanding.
40
Without an explicit run-time flag
41
.CW -B ) (
42
whose use is discouraged, the compilers insist
43
on new-style function declarations, that is, prototypes for
44
function arguments.
45
The function declarations in the libraries' include files are
46
all in the new style so the interfaces are checked at compile time.
47
For C programmers who have not yet switched to function prototypes
48
the clumsy syntax may seem repellent but the payoff in stronger typing
49
is substantial.
50
Those who wish to import existing software to Plan 9 are urged
51
to use the opportunity to update their code.
52
.PP
53
The compilers include an integrated preprocessor that accepts the familiar
54
.CW #include ,
55
.CW #define
56
for macros both with and without arguments,
57
.CW #undef ,
58
.CW #line ,
59
.CW #ifdef ,
60
.CW #ifndef ,
61
and
62
.CW #endif .
63
It
64
supports neither
65
.CW #if
66
nor
67
.CW ## ,
68
although it does
69
honor a few
70
.CW #pragmas .
71
The
72
.CW #if
73
directive was omitted because it greatly complicates the
74
preprocessor, is never necessary, and is usually abused.
75
Conditional compilation in general makes code hard to understand;
76
the Plan 9 source uses it sparingly.
77
Also, because the compilers remove dead code, regular
78
.CW if
79
statements with constant conditions are more readable equivalents to many
80
.CW #ifs .
81
To compile imported code ineluctably fouled by
82
.CW #if
83
there is a separate command,
84
.CW /bin/cpp ,
85
that implements the complete ANSI C preprocessor specification.
86
.PP
87
Include files fall into two groups: machine-dependent and machine-independent.
88
The machine-independent files occupy the directory
89
.CW /sys/include ;
90
the others are placed in a directory appropriate to the machine, such as
91
.CW /mips/include .
92
The compiler searches for include files
93
first in the machine-dependent directory and then
94
in the machine-independent directory.
95
At the time of writing there are thirty-one machine-independent include
96
files and two (per machine) machine-dependent ones:
97
.CW <ureg.h>
98
and
99
.CW <u.h> .
100
The first describes the layout of registers on the system stack,
101
for use by the debugger.
102
The second defines some
103
architecture-dependent types such as
104
.CW jmp_buf
105
for
106
.CW setjmp
107
and the
108
.CW va_arg
109
and
110
.CW va_list
111
macros for handling arguments to variadic functions,
112
as well as a set of
113
.CW typedef
114
abbreviations for
115
.CW unsigned
116
.CW short
117
and so on.
118
.PP
119
Here is an excerpt from
120
.CW /386/include/u.h :
121
.P1
122
#define nil		((void*)0)
123
typedef	unsigned short	ushort;
124
typedef	unsigned char	uchar;
125
typedef unsigned long	ulong;
126
typedef unsigned int	uint;
127
typedef   signed char	schar;
128
typedef	long long       vlong;
129
 
130
typedef long	jmp_buf[2];
131
#define	JMPBUFSP	0
132
#define	JMPBUFPC	1
133
#define	JMPBUFDPC	0
134
.P2
135
Plan 9 programs use
136
.CW nil
137
for the name of the zero-valued pointer.
138
The type
139
.CW vlong
140
is the largest integer type available; on most architectures it
141
is a 64-bit value.
142
A couple of other types in
143
.CW <u.h>
144
are
145
.CW u32int ,
146
which is guaranteed to have exactly 32 bits (a possibility on all the supported architectures) and
147
.CW mpdigit ,
148
which is used by the multiprecision math package
149
.CW <mp.h> .
150
The
151
.CW #define
152
constants permit an architecture-independent (but compiler-dependent)
153
implementation of stack-switching using
154
.CW setjmp
155
and
156
.CW longjmp .
157
.PP
158
Every Plan 9 C program begins
159
.P1
160
#include <u.h>
161
.P2
162
because all the other installed header files use the
163
.CW typedefs
164
declared in
165
.CW <u.h> .
166
.PP
167
In strict ANSI C, include files are grouped to collect related functions
168
in a single file: one for string functions, one for memory functions,
169
one for I/O, and none for system calls.
170
Each include file is protected by an
171
.CW #ifdef
172
to guarantee its contents are seen by the compiler only once.
173
Plan 9 takes a different approach.  Other than a few include
174
files that define external formats such as archives, the files in
175
.CW /sys/include
176
correspond to
177
.I libraries.
178
If a program is using a library, it includes the corresponding header.
179
The default C library comprises string functions, memory functions, and
180
so on, largely as in ANSI C, some formatted I/O routines,
181
plus all the system calls and related functions.
182
To use these functions, one must
183
.CW #include
184
the file
185
.CW <libc.h> ,
186
which in turn must follow
187
.CW <u.h> ,
188
to define their prototypes for the compiler.
189
Here is the complete source to the traditional first C program:
190
.P1
191
#include <u.h>
192
#include <libc.h>
193
 
194
void
195
main(void)
196
{
197
	print("hello world\en");
198
	exits(0);
199
}
200
.P2
201
The
202
.CW print
203
routine and its relatives
204
.CW fprint
205
and
206
.CW sprint
207
resemble the similarly-named functions in Standard I/O but are not
208
attached to a specific I/O library.
209
In Plan 9
210
.CW main
211
is not integer-valued; it should call
212
.CW exits ,
213
which takes a string argument (or null; here ANSI C promotes the 0 to a
214
.CW char* ).
215
All these functions are, of course, documented in the Programmer's Manual.
216
.PP
217
To use
218
.CW printf ,
219
.CW <stdio.h>
220
must be included to define the function prototype for
221
.CW printf :
222
.P1
223
#include <u.h>
224
#include <libc.h>
225
#include <stdio.h>
226
 
227
void
228
main(int argc, char *argv[])
229
{
230
	printf("%s: hello world; argc = %d\en", argv[0], argc);
231
	exits(0);
232
}
233
.P2
234
In practice, Standard I/O is not used much in Plan 9.  I/O libraries are
235
discussed in a later section of this document.
236
.PP
237
There are libraries for handling regular expressions, raster graphics,
238
windows, and so on, and each has an associated include file.
239
The manual for each library states which include files are needed.
240
The files are not protected against multiple inclusion and themselves
241
contain no nested
242
.CW #includes .
243
Instead the
244
programmer is expected to sort out the requirements
245
and to
246
.CW #include
247
the necessary files once at the top of each source file.  In practice this is
248
trivial: this way of handling include files is so straightforward
249
that it is rare for a source file to contain more than half a dozen
250
.CW #includes .
251
.PP
252
The compilers do their own register allocation so the
253
.CW register
254
keyword is ignored.
255
For different reasons,
256
.CW volatile
257
and
258
.CW const
259
are also ignored.
260
.PP
261
To make it easier to share code with other systems, Plan 9 has a version
262
of the compiler,
263
.CW pcc ,
264
that provides the standard ANSI C preprocessor, headers, and libraries
265
with POSIX extensions.
266
.CW Pcc
267
is recommended only
268
when broad external portability is mandated.  It compiles slower,
269
produces slower code (it takes extra work to simulate POSIX on Plan 9),
270
eliminates those parts of the Plan 9 interface
271
not related to POSIX, and illustrates the clumsiness of an environment
272
designed by committee.
273
.CW Pcc
274
is described in more detail in
275
.I
276
APE\(emThe ANSI/POSIX Environment,
277
.R
278
by Howard Trickey.
279
.SH
280
Process
281
.PP
282
Each CPU architecture supported by Plan 9 is identified by a single,
283
arbitrary, alphanumeric character:
284
.CW k
285
for SPARC,
286
.CW q
287
for 32-bit Power PC,
288
.CW v
289
for MIPS,
290
.CW 0
291
for little-endian MIPS,
292
.CW 5
293
for ARM v5 and later 32-bit architectures,
294
.CW 6
295
for AMD64,
296
.CW 8
297
for Intel 386, and
298
.CW 9
299
for 64-bit Power PC.
300
The character labels the support tools and files for that architecture.
301
For instance, for the 386 the compiler is
302
.CW 8c ,
303
the assembler is
304
.CW 8a ,
305
the link editor/loader is
306
.CW 8l ,
307
the object files are suffixed
308
.CW \&.8 ,
309
and the default name for an executable file is
310
.CW 8.out .
311
Before we can use the compiler we therefore need to know which
312
machine we are compiling for.
313
The next section explains how this decision is made; for the moment
314
assume we are building 386 binaries and make the mental substitution for
315
.CW 8
316
appropriate to the machine you are actually using.
317
.PP
318
To convert source to an executable binary is a two-step process.
319
First run the compiler,
320
.CW 8c ,
321
on the source, say
322
.CW file.c ,
323
to generate an object file
324
.CW file.8 .
325
Then run the loader,
326
.CW 8l ,
327
to generate an executable
328
.CW 8.out
329
that may be run (on a 386 machine):
330
.P1
331
8c file.c
332
8l file.8
333
8.out
334
.P2
335
The loader automatically links with whatever libraries the program
336
needs, usually including the standard C library as defined by
337
.CW <libc.h> .
338
Of course the compiler and loader have lots of options, both familiar and new;
339
see the manual for details.
340
The compiler does not generate an executable automatically;
341
the output of the compiler must be given to the loader.
342
Since most compilation is done under the control of
343
.CW mk
344
(see below), this is rarely an inconvenience.
345
.PP
346
The distribution of work between the compiler and loader is unusual.
347
The compiler integrates preprocessing, parsing, register allocation,
348
code generation and some assembly.
349
Combining these tasks in a single program is part of the reason for
350
the compiler's efficiency.
351
The loader does instruction selection, branch folding,
352
instruction scheduling,
353
and writes the final executable.
354
There is no separate C preprocessor and no assembler in the usual pipeline.
355
Instead the intermediate object file
356
(here a
357
.CW \&.8
358
file) is a type of binary assembly language.
359
The instructions in the intermediate format are not exactly those in
360
the machine.  For example, on the 68020 the object file may specify
361
a MOVE instruction but the loader will decide just which variant of
362
the MOVE instruction \(em MOVE immediate, MOVE quick, MOVE address,
363
etc. \(em is most efficient.
364
.PP
365
The assembler,
366
.CW 8a ,
367
is just a translator between the textual and binary
368
representations of the object file format.
369
It is not an assembler in the traditional sense.  It has limited
370
macro capabilities (the same as the integral C preprocessor in the compiler),
371
clumsy syntax, and minimal error checking.  For instance, the assembler
372
will accept an instruction (such as memory-to-memory MOVE on the MIPS) that the
373
machine does not actually support; only when the output of the assembler
374
is passed to the loader will the error be discovered.
375
The assembler is intended only for writing things that need access to instructions
376
invisible from C,
377
such as the machine-dependent
378
part of an operating system;
379
very little code in Plan 9 is in assembly language.
380
.PP
381
The compilers take an option
382
.CW -S
383
that causes them to print on their standard output the generated code
384
in a format acceptable as input to the assemblers.
385
This is of course merely a formatting of the
386
data in the object file; therefore the assembler is just
387
an
388
ASCII-to-binary converter for this format.
389
Other than the specific instructions, the input to the assemblers
390
is largely architecture-independent; see
391
``A Manual for the Plan 9 Assembler'',
392
by Rob Pike,
393
for more information.
394
.PP
395
The loader is an integral part of the compilation process.
396
Each library header file contains a
397
.CW #pragma
398
that tells the loader the name of the associated archive; it is
399
not necessary to tell the loader which libraries a program uses.
400
The C run-time startup is found, by default, in the C library.
401
The loader starts with an undefined
402
symbol,
403
.CW _main ,
404
that is resolved by pulling in the run-time startup code from the library.
405
(The loader undefines
406
.CW _mainp
407
when profiling is enabled, to force loading of the profiling start-up
408
instead.)
409
.PP
410
Unlike its counterpart on other systems, the Plan 9 loader rearranges
411
data to optimize access.  This means the order of variables in the
412
loaded program is unrelated to its order in the source.
413
Most programs don't care, but some assume that, for example, the
414
variables declared by
415
.P1
416
int a;
417
int b;
418
.P2
419
will appear at adjacent addresses in memory.  On Plan 9, they won't.
420
.SH
421
Heterogeneity
422
.PP
423
When the system starts or a user logs in the environment is configured
424
so the appropriate binaries are available in
425
.CW /bin .
426
The configuration process is controlled by an environment variable,
427
.CW $cputype ,
428
with value such as
429
.CW mips ,
430
.CW 386 ,
431
.CW arm ,
432
or
433
.CW sparc .
434
For each architecture there is a directory in the root,
435
with the appropriate name,
436
that holds the binary and library files for that architecture.
437
Thus
438
.CW /mips/lib
439
contains the object code libraries for MIPS programs,
440
.CW /mips/include
441
holds MIPS-specific include files, and
442
.CW /mips/bin
443
has the MIPS binaries.
444
These binaries are attached to
445
.CW /bin
446
at boot time by binding
447
.CW /$cputype/bin
448
to
449
.CW /bin ,
450
so
451
.CW /bin
452
always contains the correct files.
453
.PP
454
The MIPS compiler,
455
.CW vc ,
456
by definition
457
produces object files for the MIPS architecture,
458
regardless of the architecture of the machine on which the compiler is running.
459
There is a version of
460
.CW vc
461
compiled for each architecture:
462
.CW /mips/bin/vc ,
463
.CW /arm/bin/vc ,
464
.CW /sparc/bin/vc ,
465
and so on,
466
each capable of producing MIPS object files regardless of the native
467
instruction set.
468
If one is running on a SPARC,
469
.CW /sparc/bin/vc
470
will compile programs for the MIPS;
471
if one is running on machine
472
.CW $cputype ,
473
.CW /$cputype/bin/vc
474
will compile programs for the MIPS.
475
.PP
476
Because of the bindings that assemble
477
.CW /bin ,
478
the shell always looks for a command, say
479
.CW date ,
480
in
481
.CW /bin
482
and automatically finds the file
483
.CW /$cputype/bin/date .
484
Therefore the MIPS compiler is known as just
485
.CW vc ;
486
the shell will invoke
487
.CW /bin/vc
488
and that is guaranteed to be the version of the MIPS compiler
489
appropriate for the machine running the command.
490
Regardless of the architecture of the compiling machine,
491
.CW /bin/vc
492
is
493
.I always
494
the MIPS compiler.
495
.PP
496
Also, the output of
497
.CW vc
498
and
499
.CW vl
500
is completely independent of the machine type on which they are executed:
501
.CW \&.v
502
files compiled (with
503
.CW vc )
504
on a SPARC may be linked (with
505
.CW vl )
506
on a 386.
507
(The resulting
508
.CW v.out
509
will run, of course, only on a MIPS.)
510
Similarly, the MIPS libraries in
511
.CW /mips/lib
512
are suitable for loading with
513
.CW vl
514
on any machine; there is only one set of MIPS libraries, not one
515
set for each architecture that supports the MIPS compiler.
516
.SH
517
Heterogeneity and \f(CWmk\fP
518
.PP
519
Most software on Plan 9 is compiled under the control of
520
.CW mk ,
521
a descendant of
522
.CW make
523
that is documented in the Programmer's Manual.
524
A convention used throughout the
525
.CW mkfiles
526
makes it easy to compile the source into binary suitable for any architecture.
527
.PP
528
The variable
529
.CW $cputype
530
is advisory: it reports the architecture of the current environment, and should
531
not be modified.  A second variable,
532
.CW $objtype ,
533
is used to set which architecture is being
534
.I compiled
535
for.
536
The value of
537
.CW $objtype
538
can be used by a
539
.CW mkfile
540
to configure the compilation environment.
541
.PP
542
In each machine's root directory there is a short
543
.CW mkfile
544
that defines a set of macros for the compiler, loader, etc.
545
Here is
546
.CW /mips/mkfile :
547
.P1
548
</sys/src/mkfile.proto
549
 
550
CC=vc
551
LD=vl
552
O=v
553
AS=va
554
.P2
555
The line
556
.P1
557
</sys/src/mkfile.proto
558
.P2
559
causes
560
.CW mk
561
to include the file
562
.CW /sys/src/mkfile.proto ,
563
which contains general definitions:
564
.P1
565
#
566
# common mkfile parameters shared by all architectures
567
#
568
 
569
OS=5689qv
570
CPUS=arm amd64 386 power mips
571
CFLAGS=-FTVw
572
LEX=lex
573
YACC=yacc
574
MK=/bin/mk
575
.P2
576
.CW CC
577
is obviously the compiler,
578
.CW AS
579
the assembler, and
580
.CW LD
581
the loader.
582
.CW O
583
is the suffix for the object files and
584
.CW CPUS
585
and
586
.CW OS
587
are used in special rules described below.
588
.PP
589
Here is a
590
.CW mkfile
591
to build the installed source for
592
.CW sam :
593
.P1
594
</$objtype/mkfile
595
OBJ=sam.$O address.$O buffer.$O cmd.$O disc.$O error.$O \e
596
	file.$O io.$O list.$O mesg.$O moveto.$O multi.$O \e
597
	plan9.$O rasp.$O regexp.$O string.$O sys.$O xec.$O
598
 
599
$O.out:	$OBJ
600
	$LD $OBJ
601
 
602
install:	$O.out
603
	cp $O.out /$objtype/bin/sam
604
 
605
installall:
606
	for(objtype in $CPUS) mk install
607
 
608
%.$O:	%.c
609
	$CC $CFLAGS $stem.c
610
 
611
$OBJ:	sam.h errors.h mesg.h
612
address.$O cmd.$O parse.$O xec.$O unix.$O:	parse.h
613
 
614
clean:V:
615
	rm -f [$OS].out *.[$OS] y.tab.?
616
.P2
617
(The actual
618
.CW mkfile
619
imports most of its rules from other secondary files, but
620
this example works and is not misleading.)
621
The first line causes
622
.CW mk
623
to include the contents of
624
.CW /$objtype/mkfile
625
in the current
626
.CW mkfile .
627
If
628
.CW $objtype
629
is
630
.CW mips ,
631
this inserts the MIPS macro definitions into the
632
.CW mkfile .
633
In this case the rule for
634
.CW $O.out
635
uses the MIPS tools to build
636
.CW v.out .
637
The
638
.CW %.$O
639
rule in the file uses
640
.CW mk 's
641
pattern matching facilities to convert the source files to the object
642
files through the compiler.
643
(The text of the rules is passed directly to the shell,
644
.CW rc ,
645
without further translation.
646
See the
647
.CW mk
648
manual if any of this is unfamiliar.)
649
Because the default rule builds
650
.CW $O.out
651
rather than
652
.CW sam ,
653
it is possible to maintain binaries for multiple machines in the
654
same source directory without conflict.
655
This is also, of course, why the output files from the various
656
compilers and loaders
657
have distinct names.
658
.PP
659
The rest of the
660
.CW mkfile
661
should be easy to follow; notice how the rules for
662
.CW clean
663
and
664
.CW installall
665
(that is, install versions for all architectures) use other macros
666
defined in
667
.CW /$objtype/mkfile .
668
In Plan 9,
669
.CW mkfiles
670
for commands conventionally contain rules to
671
.CW install
672
(compile and install the version for
673
.CW $objtype ),
674
.CW installall
675
(compile and install for all
676
.CW $objtypes ),
677
and
678
.CW clean
679
(remove all object files, binaries, etc.).
680
.PP
681
The
682
.CW mkfile
683
is easy to use.  To build a MIPS binary,
684
.CW v.out :
685
.P1
686
% objtype=mips
687
% mk
688
.P2
689
To build and install a MIPS binary:
690
.P1
691
% objtype=mips
692
% mk install
693
.P2
694
To build and install all versions:
695
.P1
696
% mk installall
697
.P2
698
These conventions make cross-compilation as easy to manage
699
as traditional native compilation.
700
Plan 9 programs compile and run without change on machines from
701
large multiprocessors to laptops.  For more information about this process, see
702
``Plan 9 Mkfiles'',
703
by Bob Flandrena.
704
.SH
705
Portability
706
.PP
707
Within Plan 9, it is painless to write portable programs, programs whose
708
source is independent of the machine on which they execute.
709
The operating system is fixed and the compiler, headers and libraries
710
are constant so most of the stumbling blocks to portability are removed.
711
Attention to a few details can avoid those that remain.
712
.PP
713
Plan 9 is a heterogeneous environment, so programs must
714
.I expect
715
that external files will be written by programs on machines of different
716
architectures.
717
The compilers, for instance, must handle without confusion
718
object files written by other machines.
719
The traditional approach to this problem is to pepper the source with
720
.CW #ifdefs
721
to turn byte-swapping on and off.
722
Plan 9 takes a different approach: of the handful of machine-dependent
723
.CW #ifdefs
724
in all the source, almost all are deep in the libraries.
725
Instead programs read and write files in a defined format,
726
either (for low volume applications) as formatted text, or
727
(for high volume applications) as binary in a known byte order.
728
If the external data were written with the most significant
729
byte first, the following code reads a 4-byte integer correctly
730
regardless of the architecture of the executing machine (assuming
731
an unsigned long holds 4 bytes):
732
.P1
733
ulong
734
getlong(void)
735
{
736
	ulong l;
737
 
738
	l = (getchar()&0xFF)<<24;
739
	l |= (getchar()&0xFF)<<16;
740
	l |= (getchar()&0xFF)<<8;
741
	l |= (getchar()&0xFF)<<0;
742
	return l;
743
}
744
.P2
745
Note that this code does not `swap' the bytes; instead it just reads
746
them in the correct order.
747
Variations of this code will handle any binary format
748
and also avoid problems
749
involving how structures are padded, how words are aligned,
750
and other impediments to portability.
751
Be aware, though, that extra care is needed to handle floating point data.
752
.PP
753
Efficiency hounds will argue that this method is unnecessarily slow and clumsy
754
when the executing machine has the same byte order (and padding and alignment)
755
as the data.
756
The CPU cost of I/O processing
757
is rarely the bottleneck for an application, however,
758
and the gain in simplicity of porting and maintaining the code greatly outweighs
759
the minor speed loss from handling data in this general way.
760
This method is how the Plan 9 compilers, the window system, and even the file
761
servers transmit data between programs.
762
.PP
763
To port programs beyond Plan 9, where the system interface is more variable,
764
it is probably necessary to use
765
.CW pcc
766
and hope that the target machine supports ANSI C and POSIX.
767
.SH
768
I/O
769
.PP
770
The default C library, defined by the include file
771
.CW <libc.h> ,
772
contains no buffered I/O package.
773
It does have several entry points for printing formatted text:
774
.CW print
775
outputs text to the standard output,
776
.CW fprint
777
outputs text to a specified integer file descriptor, and
778
.CW sprint
779
places text in a character array.
780
To access library routines for buffered I/O, a program must
781
explicitly include the header file associated with an appropriate library.
782
.PP
783
The recommended I/O library, used by most Plan 9 utilities, is
784
.CW bio
785
(buffered I/O), defined by
786
.CW <bio.h> .
787
There also exists an implementation of ANSI Standard I/O,
788
.CW stdio .
789
.PP
790
.CW Bio
791
is small and efficient, particularly for buffer-at-a-time or
792
line-at-a-time I/O.
793
Even for character-at-a-time I/O, however, it is significantly faster than
794
the Standard I/O library,
795
.CW stdio .
796
Its interface is compact and regular, although it lacks a few conveniences.
797
The most noticeable is that one must explicitly define buffers for standard
798
input and output;
799
.CW bio
800
does not predefine them.  Here is a program to copy input to output a byte
801
at a time using
802
.CW bio :
803
.P1
804
#include <u.h>
805
#include <libc.h>
806
#include <bio.h>
807
 
808
Biobuf	bin;
809
Biobuf	bout;
810
 
811
main(void)
812
{
813
	int c;
814
 
815
	Binit(&bin, 0, OREAD);
816
	Binit(&bout, 1, OWRITE);
817
 
818
	while((c=Bgetc(&bin)) != Beof)
819
		Bputc(&bout, c);
820
	exits(0);
821
}
822
.P2
823
For peak performance, we could replace
824
.CW Bgetc
825
and
826
.CW Bputc
827
by their equivalent in-line macros
828
.CW BGETC
829
and
830
.CW BPUTC
831
but 
832
the performance gain would be modest.
833
For more information on
834
.CW bio ,
835
see the Programmer's Manual.
836
.PP
837
Perhaps the most dramatic difference in the I/O interface of Plan 9 from other
838
systems' is that text is not ASCII.
839
The format for
840
text in Plan 9 is a byte-stream encoding of 21-bit characters.
841
The character set is based on the Unicode Standard and is backward compatible with
842
ASCII:
843
characters with value 0 through 127 are the same in both sets.
844
The 21-bit characters, called
845
.I runes
846
in Plan 9, are encoded using a representation called
847
UTF,
848
an encoding that is becoming accepted as a standard.
849
(ISO calls it UTF-8;
850
throughout Plan 9 it's just called
851
UTF.)
852
UTF
853
defines multibyte sequences to
854
represent character values from 0 to 1,114,111.
855
In
856
UTF,
857
character values up to 127 decimal, 7F hexadecimal, represent themselves,
858
so straight
859
ASCII
860
files are also valid
861
UTF.
862
Also,
863
UTF
864
guarantees that bytes with values 0 to 127 (NUL to DEL, inclusive)
865
will appear only when they represent themselves, so programs that read bytes
866
looking for plain ASCII characters will continue to work.
867
Any program that expects a one-to-one correspondence between bytes and
868
characters will, however, need to be modified.
869
An example is parsing file names.
870
File names, like all text, are in
871
UTF,
872
so it is incorrect to search for a character in a string by
873
.CW strchr(filename,
874
.CW c)
875
because the character might have a multi-byte encoding.
876
The correct method is to call
877
.CW utfrune(filename,
878
.CW c) ,
879
defined in
880
.I rune (2),
881
which interprets the file name as a sequence of encoded characters
882
rather than bytes.
883
In fact, even when you know the character is a single byte
884
that can represent only itself,
885
it is safer to use
886
.CW utfrune
887
because that assumes nothing about the character set
888
and its representation.
889
.PP
890
The library defines several symbols relevant to the representation of characters.
891
Any byte with unsigned value less than
892
.CW Runesync
893
will not appear in any multi-byte encoding of a character.
894
.CW Utfrune
895
compares the character being searched against
896
.CW Runesync
897
to see if it is sufficient to call
898
.CW strchr
899
or if the byte stream must be interpreted.
900
Any byte with unsigned value less than
901
.CW Runeself
902
is represented by a single byte with the same value.
903
Finally, when errors are encountered converting
904
to runes from a byte stream, the library returns the rune value
905
.CW Runeerror
906
and advances a single byte.  This permits programs to find runes
907
embedded in binary data.
908
.PP
909
.CW Bio
910
includes routines
911
.CW Bgetrune
912
and
913
.CW Bputrune
914
to transform the external byte stream
915
UTF
916
format to and from
917
internal 21-bit runes.
918
Also, the
919
.CW %s
920
format to
921
.CW print
922
accepts
923
UTF;
924
.CW %c
925
prints a character after narrowing it to 8 bits.
926
The
927
.CW %S
928
format prints a null-terminated sequence of runes;
929
.CW %C
930
prints a character after narrowing it to 21 bits.
931
For more information, see the Programmer's Manual, in particular
932
.I utf (6)
933
and
934
.I rune (2),
935
and the paper,
936
``Hello world, or
937
Καλημέρα κόσμε, or\ 
938
\f(Jpこんにちは 世界\f1'',
939
by Rob Pike and
940
Ken Thompson;
941
there is not room for the full story here.
942
.PP
943
These issues affect the compiler in several ways.
944
First, the C source is in
945
UTF.
946
ANSI says C variables are formed from
947
ASCII
948
alphanumerics, but comments and literal strings may contain any characters
949
encoded in the native encoding, here
950
UTF.
951
The declaration
952
.P1
953
char *cp = "abcÿ";
954
.P2
955
initializes the variable
956
.CW cp
957
to point to an array of bytes holding the
958
UTF
959
representation of the characters
960
.CW abcÿ.
961
The type
962
.CW Rune
963
is defined in
964
.CW <u.h>
965
to be
966
.CW ushort ,
967
which is also the  `wide character' type in the compiler.
968
Therefore the declaration
969
.P1
970
Rune *rp = L"abcÿ";
971
.P2
972
initializes the variable
973
.CW rp
974
to point to an array of unsigned long integers holding the 21-bit
975
values of the characters
976
.CW abcÿ .
977
Note that in both these declarations the characters in the source
978
that represent
979
.CW "abcÿ"
980
are the same; what changes is how those characters are represented
981
in memory in the program.
982
The following two lines:
983
.P1
984
print("%s\en", "abcÿ");
985
print("%S\en", L"abcÿ");
986
.P2
987
produce the same
988
UTF
989
string on their output, the first by copying the bytes, the second
990
by converting from runes to bytes.
991
.PP
992
In C, character constants are integers but narrowed through the
993
.CW char
994
type.
995
The Unicode character
996
.CW ÿ
997
has value 255, so if the
998
.CW char
999
type is signed,
1000
the constant
1001
.CW 'ÿ'
1002
has value \-1 (which is equal to EOF).
1003
On the other hand,
1004
.CW L'ÿ'
1005
narrows through the wide character type,
1006
.CW ushort ,
1007
and therefore has value 255.
1008
.PP
1009
Finally, although it's not ANSI C, the Plan 9 C compilers
1010
assume any character with value above
1011
.CW Runeself
1012
is an alphanumeric,
1013
so α is a legal, if non-portable, variable name.
1014
.SH
1015
Arguments
1016
.PP
1017
Some macros are defined
1018
in
1019
.CW <libc.h>
1020
for parsing the arguments to
1021
.CW main() .
1022
They are described in
1023
.I ARG (2)
1024
but are fairly self-explanatory.
1025
There are four macros:
1026
.CW ARGBEGIN
1027
and
1028
.CW ARGEND
1029
are used to bracket a hidden
1030
.CW switch
1031
statement within which
1032
.CW ARGC
1033
returns the current option character (rune) being processed and
1034
.CW ARGF
1035
returns the argument to the option, as in the loader option
1036
.CW -o
1037
.CW file .
1038
Here, for example, is the code at the beginning of
1039
.CW main()
1040
in
1041
.CW ramfs.c
1042
(see
1043
.I ramfs (1))
1044
that cracks its arguments:
1045
.P1
1046
void
1047
main(int argc, char *argv[])
1048
{
1049
	char *defmnt;
1050
	int p[2];
1051
	int mfd[2];
1052
	int stdio = 0;
1053
 
1054
	defmnt = "/tmp";
1055
	ARGBEGIN{
1056
	case 'i':
1057
		defmnt = 0;
1058
		stdio = 1;
1059
		mfd[0] = 0;
1060
		mfd[1] = 1;
1061
		break;
1062
	case 's':
1063
		defmnt = 0;
1064
		break;
1065
	case 'm':
1066
		defmnt = ARGF();
1067
		break;
1068
	default:
1069
		usage();
1070
	}ARGEND
1071
.P2
1072
.SH
1073
Extensions
1074
.PP
1075
The compiler has several extensions to 1989 ANSI C, all of which are used
1076
extensively in the system source.
1077
Some of these have been adopted in later ANSI C standards.
1078
First,
1079
.I structure
1080
.I displays
1081
permit 
1082
.CW struct
1083
expressions to be formed dynamically.
1084
Given these declarations:
1085
.P1
1086
typedef struct Point Point;
1087
typedef struct Rectangle Rectangle;
1088
 
1089
struct Point
1090
{
1091
	int x, y;
1092
};
1093
 
1094
struct Rectangle
1095
{
1096
	Point min, max;
1097
};
1098
 
1099
Point	p, q, add(Point, Point);
1100
Rectangle r;
1101
int	x, y;
1102
.P2
1103
this assignment may appear anywhere an assignment is legal:
1104
.P1
1105
r = (Rectangle){add(p, q), (Point){x, y+3}};
1106
.P2
1107
The syntax is the same as for initializing a structure but with
1108
a leading cast.
1109
.PP
1110
If an
1111
.I anonymous
1112
.I structure
1113
or
1114
.I union
1115
is declared within another structure or union, the members of the internal
1116
structure or union are addressable without prefix in the outer structure.
1117
This feature eliminates the clumsy naming of nested structures and,
1118
particularly, unions.
1119
For example, after these declarations,
1120
.P1
1121
struct Lock
1122
{
1123
	int	locked;
1124
};
1125
 
1126
struct Node
1127
{
1128
	int	type;
1129
	union{
1130
		double  dval;
1131
		double  fval;
1132
		long    lval;
1133
	};		/* anonymous union */
1134
	struct Lock;	/* anonymous structure */
1135
} *node;
1136
 
1137
void	lock(struct Lock*);
1138
.P2
1139
one may refer to
1140
.CW node->type ,
1141
.CW node->dval ,
1142
.CW node->fval ,
1143
.CW node->lval ,
1144
and
1145
.CW node->locked .
1146
Moreover, the address of a
1147
.CW struct
1148
.CW Node
1149
may be used without a cast anywhere that the address of a
1150
.CW struct
1151
.CW Lock
1152
is used, such as in argument lists.
1153
The compiler automatically promotes the type and adjusts the address.
1154
Thus one may invoke
1155
.CW lock(node) .
1156
.PP
1157
Anonymous structures and unions may be accessed by type name
1158
if (and only if) they are declared using a
1159
.CW typedef
1160
name.
1161
For example, using the above declaration for
1162
.CW Point ,
1163
one may declare
1164
.P1
1165
struct
1166
{
1167
	int	type;
1168
	Point;
1169
} p;
1170
.P2
1171
and refer to
1172
.CW p.Point .
1173
.PP
1174
In the initialization of arrays, a number in square brackets before an
1175
element sets the index for the initialization.  For example, to initialize
1176
some elements in
1177
a table of function pointers indexed by
1178
ASCII
1179
character,
1180
.P1
1181
void	percent(void), slash(void);
1182
 
1183
void	(*func[128])(void) =
1184
{
1185
	['%']	percent,
1186
	['/']	slash,
1187
};
1188
.P2
1189
.LP
1190
A similar syntax allows one to initialize structure elements:
1191
.P1
1192
Point p =
1193
{
1194
	.y 100,
1195
	.x 200
1196
};
1197
.P2
1198
These initialization syntaxes were later added to ANSI C, with the addition of an
1199
equals sign between the index or tag and the value.
1200
The Plan 9 compiler accepts either form.
1201
.PP
1202
Finally, the declaration
1203
.P1
1204
extern register reg;
1205
.P2
1206
.I this "" (
1207
appearance of the register keyword is not ignored)
1208
allocates a global register to hold the variable
1209
.CW reg .
1210
External registers must be used carefully: they need to be declared in
1211
.I all
1212
source files and libraries in the program to guarantee the register
1213
is not allocated temporarily for other purposes.
1214
Especially on machines with few registers, such as the i386,
1215
it is easy to link accidentally with code that has already usurped
1216
the global registers and there is no diagnostic when this happens.
1217
Used wisely, though, external registers are powerful.
1218
The Plan 9 operating system uses them to access per-process and
1219
per-machine data structures on a multiprocessor.  The storage class they provide
1220
is hard to create in other ways.
1221
.SH
1222
The compile-time environment
1223
.PP
1224
The code generated by the compilers is `optimized' by default:
1225
variables are placed in registers and peephole optimizations are
1226
performed.
1227
The compiler flag
1228
.CW -N
1229
disables these optimizations.
1230
Registerization is done locally rather than throughout a function:
1231
whether a variable occupies a register or
1232
the memory location identified in the symbol
1233
table depends on the activity of the variable and may change
1234
throughout the life of the variable.
1235
The
1236
.CW -N
1237
flag is rarely needed;
1238
its main use is to simplify debugging.
1239
There is no information in the symbol table to identify the
1240
registerization of a variable, so
1241
.CW -N
1242
guarantees the variable is always where the symbol table says it is.
1243
.PP
1244
Another flag,
1245
.CW -w ,
1246
turns
1247
.I on
1248
warnings about portability and problems detected in flow analysis.
1249
Most code in Plan 9 is compiled with warnings enabled;
1250
these warnings plus the type checking offered by function prototypes
1251
provide most of the support of the Unix tool
1252
.CW lint
1253
more accurately and with less chatter.
1254
Two of the warnings,
1255
`used and not set' and `set and not used', are almost always accurate but
1256
may be triggered spuriously by code with invisible control flow,
1257
such as in routines that call
1258
.CW longjmp .
1259
The compiler statements
1260
.P1
1261
SET(v1);
1262
USED(v2);
1263
.P2
1264
decorate the flow graph to silence the compiler.
1265
Either statement accepts a comma-separated list of variables.
1266
Use them carefully: they may silence real errors.
1267
For the common case of unused parameters to a function,
1268
leaving the name off the declaration silences the warnings.
1269
That is, listing the type of a parameter but giving it no
1270
associated variable name does the trick.
1271
.SH
1272
Debugging
1273
.PP
1274
There are two debuggers available on Plan 9.
1275
The first, and older, is
1276
.CW db ,
1277
a revision of Unix
1278
.CW adb .
1279
The other,
1280
.CW acid ,
1281
is a source-level debugger whose commands are statements in
1282
a true programming language.
1283
.CW Acid
1284
is the preferred debugger, but since it
1285
borrows some elements of
1286
.CW db ,
1287
notably the formats for displaying values, it is worth knowing a little bit about
1288
.CW db .
1289
.PP
1290
Both debuggers support multiple architectures in a single program; that is,
1291
the programs are
1292
.CW db
1293
and
1294
.CW acid ,
1295
not for example
1296
.CW vdb
1297
and
1298
.CW vacid .
1299
They also support cross-architecture debugging comfortably:
1300
one may debug a 386 binary on a MIPS.
1301
.PP
1302
Imagine a program has crashed mysteriously:
1303
.P1
1304
% X11/X
1305
Fatal server bug!
1306
failed to create default stipple
1307
X 106: suicide: sys: trap: fault read addr=0x0 pc=0x00105fb8
1308
% 
1309
.P2
1310
When a process dies on Plan 9 it hangs in the `broken' state
1311
for debugging.
1312
Attach a debugger to the process by naming its process id:
1313
.P1
1314
% acid 106
1315
/proc/106/text:mips plan 9 executable
1316
 
1317
/sys/lib/acid/port
1318
/sys/lib/acid/mips
1319
acid: 
1320
.P2
1321
The
1322
.CW acid
1323
function
1324
.CW stk()
1325
reports the stack traceback:
1326
.P1
1327
acid: stk()
1328
At pc:0x105fb8:abort+0x24 /sys/src/ape/lib/ap/stdio/abort.c:6
1329
abort() /sys/src/ape/lib/ap/stdio/abort.c:4
1330
	called from FatalError+#4e
1331
		/sys/src/X/mit/server/dix/misc.c:421
1332
FatalError(s9=#e02, s8=#4901d200, s7=#2, s6=#72701, s5=#1,
1333
    s4=#7270d, s3=#6, s2=#12, s1=#ff37f1c, s0=#6, f=#7270f)
1334
    /sys/src/X/mit/server/dix/misc.c:416
1335
	called from gnotscreeninit+#4ce
1336
		/sys/src/X/mit/server/ddx/gnot/gnot.c:792
1337
gnotscreeninit(snum=#0, sc=#80db0)
1338
    /sys/src/X/mit/server/ddx/gnot/gnot.c:766
1339
	called from AddScreen+#16e
1340
		/n/bootes/sys/src/X/mit/server/dix/main.c:610
1341
AddScreen(pfnInit=0x0000129c,argc=0x00000001,argv=0x7fffffe4)
1342
    /sys/src/X/mit/server/dix/main.c:530
1343
	called from InitOutput+0x80
1344
		/sys/src/X/mit/server/ddx/brazil/brddx.c:522
1345
InitOutput(argc=0x00000001,argv=0x7fffffe4)
1346
    /sys/src/X/mit/server/ddx/brazil/brddx.c:511
1347
	called from main+0x294
1348
		/sys/src/X/mit/server/dix/main.c:225
1349
main(argc=0x00000001,argv=0x7fffffe4)
1350
    /sys/src/X/mit/server/dix/main.c:136
1351
	called from _main+0x24
1352
		/sys/src/ape/lib/ap/mips/main9.s:8
1353
.P2
1354
The function
1355
.CW lstk()
1356
is similar but
1357
also reports the values of local variables.
1358
Note that the traceback includes full file names; this is a boon to debugging,
1359
although it makes the output much noisier.
1360
.PP
1361
To use
1362
.CW acid
1363
well you will need to learn its input language; see the
1364
``Acid Manual'',
1365
by Phil Winterbottom,
1366
for details.  For simple debugging, however, the information in the manual page is
1367
sufficient.  In particular, it describes the most useful functions
1368
for examining a process.
1369
.PP
1370
The compiler does not place
1371
information describing the types of variables in the executable,
1372
but a compile-time flag provides crude support for symbolic debugging.
1373
The
1374
.CW -a
1375
flag to the compiler suppresses code generation
1376
and instead emits source text in the
1377
.CW acid
1378
language to format and display data structure types defined in the program.
1379
The easiest way to use this feature is to put a rule in the
1380
.CW mkfile :
1381
.P1
1382
syms:   main.$O
1383
        $CC -a main.c > syms
1384
.P2
1385
Then from within
1386
.CW acid ,
1387
.P1
1388
acid: include("sourcedirectory/syms")
1389
.P2
1390
to read in the relevant definitions.
1391
(For multi-file source, you need to be a little fancier;
1392
see
1393
.I 8c (1)).
1394
This text includes, for each defined compound
1395
type, a function with that name that may be called with the address of a structure
1396
of that type to display its contents.
1397
For example, if
1398
.CW rect
1399
is a global variable of type
1400
.CW Rectangle ,
1401
one may execute
1402
.P1
1403
Rectangle(*rect)
1404
.P2
1405
to display it.
1406
The
1407
.CW *
1408
(indirection) operator is necessary because
1409
of the way
1410
.CW acid
1411
works: each global symbol in the program is defined as a variable by
1412
.CW acid ,
1413
with value equal to the
1414
.I address
1415
of the symbol.
1416
.PP
1417
Another common technique is to write by hand special
1418
.CW acid
1419
code to define functions to aid debugging, initialize the debugger, and so on.
1420
Conventionally, this is placed in a file called
1421
.CW acid
1422
in the source directory; it has a line
1423
.P1
1424
include("sourcedirectory/syms");
1425
.P2
1426
to load the compiler-produced symbols.  One may edit the compiler output directly but
1427
it is wiser to keep the hand-generated
1428
.CW acid
1429
separate from the machine-generated.
1430
.PP
1431
To make things simple, the default rules in the system
1432
.CW mkfiles
1433
include entries to make
1434
.CW foo.acid
1435
from
1436
.CW foo.c ,
1437
so one may use
1438
.CW mk
1439
to automate the production of
1440
.CW acid
1441
definitions for a given C source file.
1442
.PP
1443
There is much more to say here.  See
1444
.CW acid
1445
manual page, the reference manual, or the paper
1446
``Acid: A Debugger Built From A Language'',
1447
also by Phil Winterbottom.