Subversion Repositories planix.SVN

Rev

Details | Last modification | View Log | RSS feed

Rev Author Line No. Line
2 - 1
.HTML "Plan 9 C Compilers
2
.TL
3
Plan 9 C Compilers
4
.AU
5
Ken Thompson
6
ken@plan9.bell-labs.com
7
.AB
8
.FS
9
Originally appeared, in a different form, in
10
.I
11
Proceedings of the Summer 1990 UKUUG Conference,
12
.R
13
pp. 41-51,
14
London, 1990.
15
.FE
16
This paper describes the overall structure and function of the Plan 9 C compilers.
17
A more detailed implementation document
18
for any one of the compilers
19
is yet to be written.
20
.AE
21
.NH
22
Introduction
23
.LP
24
There are many compilers in the series.
25
Six of the compilers
26
(Intel 386, AMD64, PowerPC, PowerPC 64-bit, ARM, MIPS R3000)
27
are considered active and are used to compile
28
current versions of Plan 9.
29
One of the compilers (SPARC)
30
is maintained but is for older machines
31
for which we have no current ports of Plan 9;
32
we are unlikely to port to any SPARC machines.
33
The DEC Alpha and Motorola 68020 compilers have been retired.
34
Several others (Motorola 68000, Intel 960, AMD 29000)
35
have had only limited use, such as
36
to program peripherals or experimental devices.
37
Any of the disused compilers could be restored if needed.
38
.NH
39
Structure
40
.LP
41
The compiler is a single program that produces an
42
object file.
43
Combined in the compiler are the traditional
44
roles of preprocessor, lexical analyzer, parser, code generator,
45
local optimizer,
46
and first half of the assembler.
47
The object files are binary forms of assembly
48
language,
49
similar to what might be passed between
50
the first and second passes of an assembler.
51
.LP
52
Object files and libraries
53
are combined by a loader
54
program to produce the executable binary.
55
The loader combines the roles of second half
56
of the assembler, global optimizer, and loader.
57
The names of the compilers, loaders, and assemblers
58
are as follows:
59
.DS
60
.ta 1.5i
61
.de Ta
62
\\$1	\f(CW\\$2\fP  \f(CW\\$3\fP  \f(CW\\$4\fP
63
..
64
.Ta SPARC kc kl ka
65
.Ta PowerPC qc ql qa
66
.Ta MIPS vc vl va
67
.Ta MIPS\ little-endian 0c 0l 0a
68
.Ta ARM 5c 5l 5a
69
.Ta AMD64 6c 6l 6a
70
.Ta Intel\ 386 8c 8l 8a
71
.Ta PowerPC\ 64-bit 9c 9l 9a
72
.DE
73
There is a further breakdown
74
in the source of the compilers into
75
object-independent and
76
object-dependent
77
parts.
78
All of the object-independent parts
79
are combined into source files in the
80
directory
81
.CW /sys/src/cmd/cc .
82
The object-dependent parts are collected
83
in a separate directory for each compiler,
84
for example
85
.CW /sys/src/cmd/vc .
86
All of the code,
87
both object-independent and
88
object-dependent,
89
is machine-independent
90
and may be cross-compiled and executed on any
91
of the architectures.
92
.NH
93
The Language
94
.LP
95
The compiler implements ANSI C with some
96
restrictions and extensions
97
[ANSI90].
98
Most of the restrictions are due to
99
personal preference, while
100
most of the extensions were to help in
101
the implementation of Plan 9.
102
There are other departures from the standard,
103
particularly in the libraries,
104
that are beyond the scope of this
105
paper.
106
.NH 2
107
Register, volatile, const
108
.LP
109
The keyword
110
.CW register
111
is recognized syntactically
112
but is semantically ignored.
113
Thus taking the address of a
114
.CW register
115
variable is not diagnosed.
116
The keyword
117
.CW volatile
118
disables all optimizations, in particular registerization, of the corresponding variable.
119
The keyword
120
.CW const
121
generates warnings (if warnings are enabled by the compiler's
122
.CW -w
123
option) of non-constant use of the variable,
124
but does not affect the generated code.
125
.NH 2
126
The preprocessor
127
.LP
128
The C preprocessor is probably the
129
biggest departure from the ANSI standard.
130
.LP
131
The preprocessor built into the Plan 9 compilers does not support
132
.CW #if ,
133
although it does handle
134
.CW #ifdef
135
and
136
.CW #include .
137
If it is necessary to be more standard,
138
the source text can first be run through the separate ANSI C
139
preprocessor,
140
.CW cpp .
141
.NH 2
142
Unnamed substructures
143
.LP
144
The most important and most heavily used of the
145
extensions is the declaration of an
146
unnamed substructure or subunion.
147
For example:
148
.DS
149
.CW
150
.ta .1i .6i 1.1i 1.6i
151
	typedef
152
	struct	lock
153
	{
154
		int    locked;
155
	} Lock;
156
 
157
	typedef
158
	struct	node
159
	{
160
		int	type;
161
		union
162
		{
163
			double dval;
164
			float  fval;
165
			long   lval;
166
		};
167
		Lock;
168
	} Node;
169
 
170
	Lock*	lock;
171
	Node*	node;
172
.DE
173
The declaration of
174
.CW Node
175
has an unnamed substructure of type
176
.CW Lock
177
and an unnamed subunion.
178
One use of this feature allows references to elements of the
179
subunit to be accessed as if they were in
180
the outer structure.
181
Thus
182
.CW node->dval
183
and
184
.CW node->locked
185
are legitimate references.
186
.LP
187
When an outer structure is used
188
in a context that is only legal for
189
an unnamed substructure,
190
the compiler promotes the reference to the
191
unnamed substructure.
192
This is true for references to structures and
193
to references to pointers to structures.
194
This happens in assignment statements and
195
in argument passing where prototypes have been
196
declared.
197
Thus, continuing with the example,
198
.DS
199
.CW
200
.ta .1i .6i 1.1i 1.6i
201
	lock = node;
202
.DE
203
would assign a pointer to the unnamed
204
.CW Lock
205
in
206
the
207
.CW Node
208
to the variable
209
.CW lock .
210
Another example,
211
.DS
212
.CW
213
.ta .1i .6i 1.1i 1.6i
214
	extern void lock(Lock*);
215
	func(...)
216
	{
217
		...
218
		lock(node);
219
		...
220
	}
221
.DE
222
will pass a pointer to the
223
.CW Lock
224
substructure.
225
.LP
226
Finally, in places where context is insufficient to identify the unnamed structure,
227
the type name (it must be a
228
.CW typedef )
229
of the unnamed structure can be used as an identifier.
230
In our example,
231
.CW &node->Lock
232
gives the address of the anonymous
233
.CW Lock
234
structure.
235
.NH 2
236
Structure displays
237
.LP
238
A structure cast followed by a list of expressions in braces is
239
an expression with the type of the structure and elements assigned from
240
the corresponding list.
241
Structures are now almost first-class citizens of the language.
242
It is common to see code like this:
243
.DS
244
.CW
245
.ta .1i
246
	r = (Rectangle){point1, (Point){x,y+2}};
247
.DE
248
.NH 2
249
Initialization indexes
250
.LP
251
In initializers of arrays,
252
one may place a constant expression
253
in square brackets before an initializer.
254
This causes the next initializer to assign
255
the indicated element.
256
For example:
257
.DS
258
.CW
259
.ta .1i .6i 1.6i
260
	enum	errors
261
	{
262
		Etoobig,
263
		Ealarm,
264
		Egreg
265
	};
266
	char* errstrings[] =
267
	{
268
		[Ealarm]	"Alarm call",
269
		[Egreg]	"Panic: out of mbufs",
270
		[Etoobig]	"Arg list too long",
271
	};
272
.DE
273
In the same way,
274
individual structures members may
275
be initialized in any order by preceding the initialization with
276
.CW .tagname .
277
Both forms allow an optional
278
.CW = ,
279
to be compatible with a proposed
280
extension to ANSI C.
281
.NH 2
282
External register
283
.LP
284
The declaration
285
.CW extern
286
.CW register
287
will dedicate a register to
288
a variable on a global basis.
289
It can be used only under special circumstances.
290
External register variables must be identically
291
declared in all modules and
292
libraries.
293
The feature is not intended for efficiency,
294
although it can produce efficient code;
295
rather it represents a unique storage class that
296
would be hard to get any other way.
297
On a shared-memory multi-processor,
298
an external register is
299
one-per-processor and neither one-per-procedure (automatic)
300
or one-per-system (external).
301
It is used for two variables in the Plan 9 kernel,
302
.CW u
303
and
304
.CW m .
305
.CW U
306
is a pointer to the structure representing the currently running process
307
and
308
.CW m
309
is a pointer to the per-machine data structure.
310
.NH 2
311
Long long
312
.LP
313
The compilers accept
314
.CW long
315
.CW long
316
as a basic type meaning 64-bit integer.
317
On some of the machines
318
this type is synthesized from 32-bit instructions.
319
.NH 2
320
Pragma
321
.LP
322
The compilers accept
323
.CW #pragma
324
.CW lib
325
.I libname
326
and pass the
327
library name string uninterpreted
328
to the loader.
329
The loader uses the library name to
330
find libraries to load.
331
If the name contains
332
.CW $O ,
333
it is replaced with
334
the single character object type of the compiler
335
(e.g.,
336
.CW v
337
for the MIPS).
338
If the name contains
339
.CW $M ,
340
it is replaced with
341
the architecture type for the compiler
342
(e.g.,
343
.CW mips
344
for the MIPS).
345
If the name starts with
346
.CW /
347
it is an absolute pathname;
348
if it starts with
349
.CW .
350
then it is searched for in the loader's current directory.
351
Otherwise, the name is searched from
352
.CW /$M/lib .
353
Such
354
.CW #pragma
355
statements in header files guarantee that the correct
356
libraries are always linked with a program without the
357
need to specify them explicitly at link time.
358
.LP
359
They also accept
360
.CW #pragma
361
.CW packed
362
.CW on
363
(or
364
.CW yes
365
or
366
.CW 1 )
367
to cause subsequently declared data, until
368
.CW #pragma
369
.CW packed
370
.CW off
371
(or
372
.CW no
373
or
374
.CW 0 ),
375
to be laid out in memory tightly packed in successive bytes, disregarding
376
the usual alignment rules.
377
Accessing such data can cause faults.
378
.LP
379
Similarly, 
380
.CW #pragma
381
.CW profile
382
.CW off
383
(or
384
.CW no
385
or
386
.CW 0 )
387
causes subsequently declared functions, until
388
.CW #pragma
389
.CW profile
390
.CW on
391
(or
392
.CW yes
393
or
394
.CW 1 ),
395
to be marked as unprofiled.
396
Such functions will not be profiled when 
397
profiling is enabled for the rest of the program.
398
.LP
399
Two
400
.CW #pragma
401
statements allow type-checking of
402
.CW print -like
403
functions.
404
The first, of the form
405
.P1
406
#pragma varargck argpos error 2
407
.P2
408
tells the compiler that the second argument to
409
.CW error
410
is a
411
.CW print
412
format string (see the manual page
413
.I print (2))
414
that specifies how to format
415
.CW error 's
416
subsequent arguments.
417
The second, of the form
418
.P1
419
#pragma varargck type "s" char*
420
.P2
421
says that the
422
.CW print
423
format verb
424
.CW s
425
processes an argument of
426
type
427
.CW char* .
428
If the compiler's
429
.CW -F
430
option is enabled, the compiler will use this information
431
to report type violations in the arguments to
432
.CW print ,
433
.CW error ,
434
and similar routines.
435
.NH
436
Object module conventions
437
.LP
438
The overall conventions of the runtime environment
439
are important
440
to runtime efficiency.
441
In this section,
442
several of these conventions are discussed.
443
.NH 2
444
Register saving
445
.LP
446
In the Plan 9 compilers,
447
the caller of a procedure saves the registers.
448
With caller-saves,
449
the leaf procedures can use all the
450
registers and never save them.
451
If you spend a lot of time at the leaves,
452
this seems preferable.
453
With callee-saves,
454
the saving of the registers is done
455
in the single point of entry and return.
456
If you are interested in space,
457
this seems preferable.
458
In both,
459
there is a degree of uncertainty
460
about what registers need to be saved.
461
Callee-saved registers make it difficult to
462
find variables in registers in debuggers.
463
Callee-saved registers also complicate
464
the implementation of
465
.CW longjmp .
466
The convincing argument is
467
that with caller-saves,
468
the decision to registerize a variable
469
can include the cost of saving the register
470
across calls.
471
For a further discussion of caller- vs. callee-saves,
472
see the paper by Davidson and Whalley [Dav91].
473
.LP
474
In the Plan 9 operating system,
475
calls to the kernel look like normal procedure
476
calls, which means
477
the caller
478
has saved the registers and the system
479
entry does not have to.
480
This makes system calls considerably faster.
481
Since this is a potential security hole,
482
and can lead to non-determinism,
483
the system may eventually save the registers
484
on entry,
485
or more likely clear the registers on return.
486
.NH 2
487
Calling convention
488
.LP
489
Older C compilers maintain a frame pointer, which is at a known constant
490
offset from the stack pointer within each function.
491
For machines where the stack grows towards zero,
492
the argument pointer is at a known constant offset
493
from the frame pointer.
494
Since the stack grows down in Plan 9,
495
the Plan 9 compilers
496
keep neither an
497
explicit frame pointer nor
498
an explicit argument pointer;
499
instead they generate addresses relative to the stack pointer.
500
.LP
501
On some architectures, the first argument to a subroutine is passed in a register.
502
.NH 2
503
Functions returning structures
504
.LP
505
Structures longer than one word are awkward to implement
506
since they do not fit in registers and must
507
be passed around in memory.
508
Functions that return structures
509
are particularly clumsy.
510
The Plan 9 compilers pass the return address of
511
a structure as the first argument of a
512
function that has a structure return value.
513
Thus
514
.DS
515
.CW
516
.ta .1i .6i 1.1i 1.6i
517
	x = f(...)
518
.DE
519
is rewritten as
520
.DS
521
.CW
522
.ta .1i .6i 1.1i 1.6i
523
	f(&x, ...)\f1.
524
.DE
525
This saves a copy and makes the compilation
526
much less clumsy.
527
A disadvantage is that if you call this
528
function without an assignment,
529
a dummy location must be invented.
530
.LP
531
There is also a danger of calling a function
532
that returns a structure without declaring
533
it as such.
534
With ANSI C function prototypes,
535
this error need never occur.
536
.NH
537
Implementation
538
.LP
539
The compiler is divided internally into
540
four machine-independent passes,
541
four machine-dependent passes,
542
and an output pass.
543
The next nine sections describe each pass in order.
544
.NH 2
545
Parsing
546
.LP
547
The first pass is a YACC-based parser
548
[Joh79].
549
Declarations are interpreted immediately,
550
building a block structured symbol table.
551
Executable statements are put into a parse tree
552
and collected,
553
without interpretation.
554
At the end of each procedure,
555
the parse tree for the function is
556
examined by the other passes of the compiler.
557
.LP
558
The input stream of the parser is
559
a pushdown list of input activations.
560
The preprocessor
561
expansions of
562
macros
563
and
564
.CW #include
565
are implemented as pushdowns.
566
Thus there is no separate
567
pass for preprocessing.
568
.NH 2
569
Typing
570
.LP
571
The next pass distributes typing information
572
to every node of the tree.
573
Implicit operations on the tree are added,
574
such as type promotions and taking the
575
address of arrays and functions.
576
.NH 2
577
Machine-independent optimization
578
.LP
579
The next pass performs optimizations
580
and transformations of the tree, such as converting
581
.CW &*x
582
and
583
.CW *&x
584
into
585
.CW x .
586
Constant expressions are converted to constants in this pass.
587
.NH 2
588
Arithmetic rewrites
589
.LP
590
This is another machine-independent optimization.
591
Subtrees of add, subtract, and multiply of integers are
592
rewritten for easier compilation.
593
The major transformation is factoring:
594
.CW 4+8*a+16*b+5
595
is transformed into
596
.CW 9+8*(a+2*b) .
597
Such expressions arise from address
598
manipulation and array indexing.
599
.NH 2
600
Addressability
601
.LP
602
This is the first of the machine-dependent passes.
603
The addressability of a processor is defined as the set of
604
expressions that is legal in the address field
605
of a machine language instruction.
606
The addressability of different processors varies widely.
607
At one end of the spectrum are the 68020 and VAX,
608
which allow a complex mix of incrementing,
609
decrementing,
610
indexing, and relative addressing.
611
At the other end is the MIPS,
612
which allows only registers and constant offsets from the
613
contents of a register.
614
The addressability can be different for different instructions
615
within the same processor.
616
.LP
617
It is important to the code generator to know when a
618
subtree represents an address of a particular type.
619
This is done with a bottom-up walk of the tree.
620
In this pass, the leaves are labeled with small integers.
621
When an internal node is encountered,
622
it is labeled by consulting a table indexed by the
623
labels on the left and right subtrees.
624
For example,
625
on the 68020 processor,
626
it is possible to address an
627
offset from a named location.
628
In C, this is represented by the expression
629
.CW *(&name+constant) .
630
This is marked addressable by the following table.
631
In the table,
632
a node represented by the left column is marked
633
with a small integer from the right column.
634
Marks of the form
635
.CW A\s-2\di\u\s0
636
are addressable while
637
marks of the form
638
.CW N\s-2\di\u\s0
639
are not addressable.
640
.DS
641
.B
642
.ta .1i 1.1i
643
	Node	Marked
644
.CW
645
	name	A\s-2\d1\u\s0
646
	const	A\s-2\d2\u\s0
647
	&A\s-2\d1\u\s0	A\s-2\d3\u\s0
648
	A\s-2\d3\u\s0+A\s-2\d1\u\s0	N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP
649
	*N\s-2\d1\u\s0	A\s-2\d4\u\s0
650
.DE
651
Here there is a distinction between
652
a node marked
653
.CW A\s-2\d1\u\s0
654
and a node marked
655
.CW A\s-2\d4\u\s0
656
because the address operator of an
657
.CW A\s-2\d4\u\s0
658
node is not addressable.
659
So to extend the table:
660
.DS
661
.B
662
.ta .1i 1.1i
663
	Node	Marked
664
.CW
665
	&A\s-2\d4\u\s0	N\s-2\d2\u\s0
666
	N\s-2\d2\u\s0+N\s-2\d1\u\s0	N\s-2\d1\u\s0
667
.DE
668
The full addressability of the 68020 is expressed
669
in 18 rules like this,
670
while the addressability of the MIPS is expressed
671
in 11 rules.
672
When one ports the compiler,
673
this table is usually initialized
674
so that leaves are labeled as addressable and nothing else.
675
The code produced is poor,
676
but porting is easy.
677
The table can be extended later.
678
.LP
679
This pass also rewrites some complex operators
680
into procedure calls.
681
Examples include 64-bit multiply and divide.
682
.LP
683
In the same bottom-up pass of the tree,
684
the nodes are labeled with a Sethi-Ullman complexity
685
[Set70].
686
This number is roughly the number of registers required
687
to compile the tree on an ideal machine.
688
An addressable node is marked 0.
689
A function call is marked infinite.
690
A unary operator is marked as the
691
maximum of 1 and the mark of its subtree.
692
A binary operator with equal marks on its subtrees is
693
marked with a subtree mark plus 1.
694
A binary operator with unequal marks on its subtrees is
695
marked with the maximum mark of its subtrees.
696
The actual values of the marks are not too important,
697
but the relative values are.
698
The goal is to compile the harder
699
(larger mark)
700
subtree first.
701
.NH 2
702
Code generation
703
.LP
704
Code is generated by recursive
705
descent.
706
The Sethi-Ullman complexity completely guides the
707
order.
708
The addressability defines the leaves.
709
The only difficult part is compiling a tree
710
that has two infinite (function call)
711
subtrees.
712
In this case,
713
one subtree is compiled into the return register
714
(usually the most convenient place for a function call)
715
and then stored on the stack.
716
The other subtree is compiled into the return register
717
and then the operation is compiled with
718
operands from the stack and the return register.
719
.LP
720
There is a separate boolean code generator that compiles
721
conditional expressions.
722
This is fundamentally different from compiling an arithmetic expression.
723
The result of the boolean code generator is the
724
position of the program counter and not an expression.
725
The boolean code generator makes extensive use of De Morgan's rule.
726
The boolean code generator is an expanded version of that described
727
in chapter 8 of Aho, Sethi, and Ullman
728
[Aho87].
729
.LP
730
There is a considerable amount of talk in the literature
731
about automating this part of a compiler with a machine
732
description.
733
Since this code generator is so small
734
(less than 500 lines of C)
735
and easy,
736
it hardly seems worth the effort.
737
.NH 2
738
Registerization
739
.LP
740
Up to now,
741
the compiler has operated on syntax trees
742
that are roughly equivalent to the original source language.
743
The previous pass has produced machine language in an internal
744
format.
745
The next two passes operate on the internal machine language
746
structures.
747
The purpose of the next pass is to reintroduce
748
registers for heavily used variables.
749
.LP
750
All of the variables that can be
751
potentially registerized within a procedure are
752
placed in a table.
753
(Suitable variables are any automatic or external
754
scalars that do not have their addresses extracted.
755
Some constants that are hard to reference are also
756
considered for registerization.)
757
Four separate data flow equations are evaluated
758
over the procedure on all of these variables.
759
Two of the equations are the normal set-behind
760
and used-ahead
761
bits that define the life of a variable.
762
The two new bits tell if a variable life
763
crosses a function call ahead or behind.
764
By examining a variable over its lifetime,
765
it is possible to get a cost
766
for registerizing.
767
Loops are detected and the costs are multiplied
768
by three for every level of loop nesting.
769
Costs are sorted and the variables
770
are replaced by available registers on a greedy basis.
771
.LP
772
The 68020 has two different
773
types of registers.
774
For the 68020,
775
two different costs are calculated for
776
each variable life and the register type that
777
affords the better cost is used.
778
Ties are broken by counting the number of available
779
registers of each type.
780
.LP
781
Note that externals are registerized together with automatics.
782
This is done by evaluating the semantics of a ``call'' instruction
783
differently for externals and automatics.
784
Since a call goes outside the local procedure,
785
it is assumed that a call references all externals.
786
Similarly,
787
externals are assumed to be set before an ``entry'' instruction
788
and assumed to be referenced after a ``return'' instruction.
789
This makes sure that externals are in memory across calls.
790
.LP
791
The overall results are satisfactory.
792
It would be nice to be able to do this processing in
793
a machine-independent way,
794
but it is impossible to get all of the costs and
795
side effects of different choices by examining the parse tree.
796
.LP
797
Most of the code in the registerization pass is machine-independent.
798
The major machine-dependency is in
799
examining a machine instruction to ask if it sets or references
800
a variable.
801
.NH 2
802
Machine code optimization
803
.LP
804
The next pass walks the machine code
805
for opportunistic optimizations.
806
For the most part,
807
this is highly specific to a particular
808
processor.
809
One optimization that is performed
810
on all of the processors is the
811
removal of unnecessary ``move''
812
instructions.
813
Ironically,
814
most of these instructions were inserted by
815
the previous pass.
816
There are two patterns that are repetitively
817
matched and replaced until no more matches are
818
found.
819
The first tries to remove ``move'' instructions
820
by relabeling variables.
821
.LP
822
When a ``move'' instruction is encountered,
823
if the destination variable is set before the
824
source variable is referenced,
825
then all of the references to the destination
826
variable can be renamed to the source and the ``move''
827
can be deleted.
828
This transformation uses the reverse data flow
829
set up in the previous pass.
830
.LP
831
An example of this pattern is depicted in the following
832
table.
833
The pattern is in the left column and the
834
replacement action is in the right column.
835
.DS
836
.CW
837
.ta .1i .6i 1.6i 2.1i 2.6i
838
	MOVE	a->b		\fR(remove)\fP
839
.R
840
	(sequence with no mention of \f(CWa\fP)
841
.CW
842
	USE	b		USE	a
843
.R
844
	(sequence with no mention of \f(CWa\fP)
845
.CW
846
	SET	b		SET	b
847
.DE
848
.LP
849
Experiments have shown that it is marginally
850
worthwhile to rename uses of the destination variable
851
with uses of the source variable up to
852
the first use of the source variable.
853
.LP
854
The second transform will do relabeling
855
without deleting instructions.
856
When a ``move'' instruction is encountered,
857
if the source variable has been set prior
858
to the use of the destination variable
859
then all of the references to the source
860
variable are replaced by the destination and
861
the ``move'' is inverted.
862
Typically,
863
this transformation will alter two ``move''
864
instructions and allow the first transformation
865
another chance to remove code.
866
This transformation uses the forward data flow
867
set up in the previous pass.
868
.LP
869
Again,
870
the following is a depiction of the transformation where
871
the pattern is in the left column and the
872
rewrite is in the right column.
873
.DS
874
.CW
875
.ta .1i .6i 1.6i 2.1i 2.6i
876
	SET	a		SET	b
877
.R
878
	(sequence with no use of \f(CWb\fP)
879
.CW
880
	USE	a		USE	b
881
.R
882
	(sequence with no use of \f(CWb\fP)
883
.CW
884
	MOVE	a->b		MOVE	b->a
885
.DE
886
Iterating these transformations
887
will usually get rid of all redundant ``move'' instructions.
888
.LP
889
A problem with this organization is that the costs
890
of registerization calculated in the previous pass
891
must depend on how well this pass can detect and remove
892
redundant instructions.
893
Often,
894
a fine candidate for registerization is rejected
895
because of the cost of instructions that are later
896
removed.
897
.NH 2
898
Writing the object file
899
.LP
900
The last pass walks the internal assembly language
901
and writes the object file.
902
The object file is reduced in size by about a factor
903
of three with simple compression
904
techniques.
905
The most important aspect of the object file
906
format is that it is independent of the compiling machine.
907
All integer and floating numbers in the object
908
code are converted to known formats and byte
909
orders.
910
.NH
911
The loader
912
.LP
913
The loader is a multiple pass program that
914
reads object files and libraries and produces
915
an executable binary.
916
The loader also does some minimal
917
optimizations and code rewriting.
918
Many of the operations performed by the
919
loader are machine-dependent.
920
.LP
921
The first pass of the loader reads the
922
object modules into an internal data
923
structure that looks like binary assembly language.
924
As the instructions are read,
925
code is reordered to remove
926
unconditional branch instructions.
927
Conditional branch instructions are inverted
928
to prevent the insertion of unconditional branches.
929
The loader will also make a copy of a few instructions
930
to remove an unconditional branch.
931
.LP
932
The next pass allocates addresses for
933
all external data.
934
Typical of processors is the MIPS,
935
which can reference ±32K bytes from a
936
register.
937
The loader allocates the register
938
.CW R30
939
as the static pointer.
940
The value placed in
941
.CW R30
942
is the base of the data segment plus 32K.
943
It is then cheap to reference all data in the
944
first 64K of the data segment.
945
External variables are allocated to
946
the data segment
947
with the smallest variables allocated first.
948
If all of the data cannot fit into the first
949
64K of the data segment,
950
then usually only a few large arrays
951
need more expensive addressing modes.
952
.LP
953
For the MIPS processor,
954
the loader makes a pass over the internal
955
structures,
956
exchanging instructions to try
957
to fill ``delay slots'' with useful work.
958
If a useful instruction cannot be found
959
to fill a delay slot,
960
the loader will insert
961
``noop''
962
instructions.
963
This pass is very expensive and does not
964
do a good job.
965
About 40% of all instructions are in
966
delay slots.
967
About 65% of these are useful instructions and
968
35% are ``noops.''
969
The vendor-supplied assembler does this job
970
more effectively,
971
filling about 80%
972
of the delay slots with useful instructions.
973
.LP
974
On the 68020 processor,
975
branch instructions come in a variety of
976
sizes depending on the relative distance
977
of the branch.
978
Thus the size of branch instructions
979
can be mutually dependent.
980
The loader uses a multiple pass algorithm
981
to resolve the branch lengths
982
[Szy78].
983
Initially, all branches are assumed minimal length.
984
On each subsequent pass,
985
the branches are reassessed
986
and expanded if necessary.
987
When no more expansions occur,
988
the locations of the instructions in
989
the text segment are known.
990
.LP
991
On the MIPS processor,
992
all instructions are one size.
993
A single pass over the instructions will
994
determine the locations of all addresses
995
in the text segment.
996
.LP
997
The last pass of the loader produces the
998
executable binary.
999
A symbol table and other tables are
1000
produced to help the debugger to
1001
interpret the binary symbolically.
1002
.LP
1003
The loader places absolute source line numbers in the symbol table.
1004
The name and absolute line number of all
1005
.CW #include
1006
files is also placed in the
1007
symbol table so that the debuggers can
1008
associate object code to source files.
1009
.NH
1010
Performance
1011
.LP
1012
The following is a table of the source size of the MIPS
1013
compiler.
1014
.DS
1015
.ta .1i .6i
1016
	lines	module
1017
	\0509	machine-independent headers
1018
	1070	machine-independent YACC source
1019
	6090	machine-independent C source
1020
 
1021
	\0545	machine-dependent headers
1022
	6532	machine-dependent C source
1023
 
1024
	\0298	loader headers
1025
	5215	loader C source
1026
.DE
1027
.LP
1028
The following table shows timing
1029
of a test program
1030
that plays checkers, running on a MIPS R4000.
1031
The test program is 26 files totaling 12600 lines of C.
1032
The execution time does not significantly
1033
depend on library implementation.
1034
Since no other compiler runs on Plan 9,
1035
the Plan 9 tests were done with the Plan 9 operating system;
1036
the other tests were done on the vendor's operating system.
1037
The hardware was identical in both cases.
1038
The optimizer in the vendor's compiler
1039
is reputed to be extremely good.
1040
.DS
1041
.ta .1i .9i
1042
	\0\04.49s	Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP)
1043
	\0\01.72s	Plan 9 \f(CWvc\fP \f(CW-N\fP load time
1044
	148.69s	Plan 9 \f(CWvc\fP \f(CW-N\fP run time
1045
 
1046
	\015.07s	Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit)
1047
	\0\01.66s	Plan 9 \f(CWvc\fP load time
1048
	\089.96s	Plan 9 \f(CWvc\fP run time
1049
 
1050
	\014.83s	vendor \f(CWcc\fP compile time
1051
	\0\00.38s	vendor \f(CWcc\fP load time
1052
	104.75s	vendor \f(CWcc\fP run time
1053
 
1054
	\043.59s	vendor \f(CWcc\fP \f(CW-O\fP compile time
1055
	\0\00.38s	vendor \f(CWcc\fP \f(CW-O\fP load time
1056
	\076.19s	vendor \f(CWcc\fP \f(CW-O\fP run time
1057
 
1058
	\0\08.19s	vendor \f(CWcc\fP \f(CW-O3\fP compile time
1059
	\035.97s	vendor \f(CWcc\fP \f(CW-O3\fP load time
1060
	\071.16s	vendor \f(CWcc\fP \f(CW-O3\fP run time
1061
.DE
1062
.LP
1063
To compare the Intel compiler,
1064
a program that is about 40% bit manipulation and
1065
about 60% single precision floating point was
1066
run on the same 33 MHz 486, once under Windows
1067
compiled with the Watcom compiler, version 10.0,
1068
in 16-bit mode and once under
1069
Plan 9 in 32-bit mode.
1070
The Plan 9 execution time was 27 sec while the Windows
1071
execution time was 31 sec.
1072
.NH
1073
Conclusions
1074
.LP
1075
The new compilers compile
1076
quickly,
1077
load slowly,
1078
and produce
1079
medium quality
1080
object code.
1081
The compilers are relatively
1082
portable,
1083
requiring but a couple of weeks' work to
1084
produce a compiler for a different computer.
1085
For Plan 9,
1086
where we needed several compilers
1087
with specialized features and
1088
our own object formats,
1089
this project was indispensable.
1090
It is also necessary for us to
1091
be able to freely distribute our compilers
1092
with the Plan 9 distribution.
1093
.LP
1094
Two problems have come up in retrospect.
1095
The first has to do with the
1096
division of labor between compiler and loader.
1097
Plan 9 runs on multi-processors and as such
1098
compilations are often done in parallel.
1099
Unfortunately,
1100
all compilations must be complete before loading
1101
can begin.
1102
The load is then single-threaded.
1103
With this model,
1104
any shift of work from compile to load
1105
results in a significant increase in real time.
1106
The same is true of libraries that are compiled
1107
infrequently and loaded often.
1108
In the future,
1109
we may try to put some of the loader work
1110
back into the compiler.
1111
.LP
1112
The second problem comes from
1113
the various optimizations performed over several
1114
passes.
1115
Often optimizations in different passes depend
1116
on each other.
1117
Iterating the passes could compromise efficiency,
1118
or even loop.
1119
We see no real solution to this problem.
1120
.NH
1121
References
1122
.LP
1123
[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman,
1124
.I
1125
Compilers \- Principles, Techniques, and Tools,
1126
.R
1127
Addison Wesley,
1128
Reading, MA,
1129
1987.
1130
.LP
1131
[ANSI90] \f2American National Standard for Information Systems \-
1132
Programming Language C\f1, American National Standards Institute, Inc.,
1133
New York, 1990.
1134
.LP
1135
[Dav91] J. W. Davidson and D. B. Whalley,
1136
``Methods for Saving and Restoring Register Values across Function Calls'',
1137
.I
1138
Software\-Practice and Experience,
1139
.R
1140
Vol 21(2), pp. 149-165, February 1991.
1141
.LP
1142
[Joh79] S. C. Johnson,
1143
``YACC \- Yet Another Compiler Compiler'',
1144
.I
1145
UNIX Programmer's Manual, Seventh Ed., Vol. 2A,
1146
.R
1147
AT&T Bell Laboratories,
1148
Murray Hill, NJ,
1149
1979.
1150
.LP
1151
[Set70] R. Sethi and J. D. Ullman,
1152
``The Generation of Optimal Code for Arithmetic Expressions'',
1153
.I
1154
Journal of the ACM,
1155
.R
1156
Vol 17(4), pp. 715-728, 1970.
1157
.LP
1158
[Szy78] T. G. Szymanski,
1159
``Assembling Code for Machines with Span-dependent Instructions'',
1160
.I
1161
Communications of the ACM,
1162
.R
1163
Vol 21(4), pp. 300-308, 1978.