Subversion Repositories planix.SVN

Rev

Rev 2 | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
2 - 1
.HTML "The Organization of Networks in Plan 9
2
.TL
3
The Organization of Networks in Plan 9
4
.AU
5
Dave Presotto
6
Phil Winterbottom
7
.sp
8
presotto,philw@plan9.bell-labs.com
9
.AB
10
.FS
11
Originally appeared in
12
.I
13
Proc. of the Winter 1993 USENIX Conf.,
14
.R
15
pp. 271-280,
16
San Diego, CA
17
.FE
18
In a distributed system networks are of paramount importance. This
19
paper describes the implementation, design philosophy, and organization
20
of network support in Plan 9. Topics include network requirements
21
for distributed systems, our kernel implementation, network naming, user interfaces,
22
and performance. We also observe that much of this organization is relevant to
23
current systems.
24
.AE
25
.NH
26
Introduction
27
.PP
28
Plan 9 [Pike90] is a general-purpose, multi-user, portable distributed system
29
implemented on a variety of computers and networks.
30
What distinguishes Plan 9 is its organization.
31
The goals of this organization were to
32
reduce administration
33
and to promote resource sharing. One of the keys to its success as a distributed
34
system is the organization and management of its networks.
35
.PP
36
A Plan 9 system comprises file servers, CPU servers and terminals.
37
The file servers and CPU servers are typically centrally
38
located multiprocessor machines with large memories and
39
high speed interconnects.
40
A variety of workstation-class machines
41
serve as terminals
42
connected to the central servers using several networks and protocols.
43
The architecture of the system demands a hierarchy of network
44
speeds matching the needs of the components.
45
Connections between file servers and CPU servers are high-bandwidth point-to-point
46
fiber links.
47
Connections from the servers fan out to local terminals
48
using medium speed networks
49
such as Ethernet [Met80] and Datakit [Fra80].
50
Low speed connections via the Internet and
51
the AT&T backbone serve users in Oregon and Illinois.
52
Basic Rate ISDN data service and 9600 baud serial lines provide slow
53
links to users at home.
54
.PP
55
Since CPU servers and terminals use the same kernel,
56
users may choose to run programs locally on
57
their terminals or remotely on CPU servers.
58
The organization of Plan 9 hides the details of system connectivity
59
allowing both users and administrators to configure their environment
60
to be as distributed or centralized as they wish.
61
Simple commands support the
62
construction of a locally represented name space
63
spanning many machines and networks.
64
At work, users tend to use their terminals like workstations,
65
running interactive programs locally and
66
reserving the CPU servers for data or compute intensive jobs
67
such as compiling and computing chess endgames.
68
At home or when connected over
69
a slow network, users tend to do most work on the CPU server to minimize
70
traffic on the slow links.
71
The goal of the network organization is to provide the same
72
environment to the user wherever resources are used.
73
.NH 
74
Kernel Network Support
75
.PP
76
Networks play a central role in any distributed system. This is particularly
77
true in Plan 9 where most resources are provided by servers external to the kernel.
78
The importance of the networking code within the kernel
79
is reflected by its size;
80
of 25,000 lines of kernel code, 12,500 are network and protocol related.
81
Networks are continually being added and the fraction of code
82
devoted to communications
83
is growing.
84
Moreover, the network code is complex.
85
Protocol implementations consist almost entirely of
86
synchronization and dynamic memory management, areas demanding 
87
subtle error recovery
88
strategies.
89
The kernel currently supports Datakit, point-to-point fiber links,
90
an Internet (IP) protocol suite and ISDN data service.
91
The variety of networks and machines
92
has raised issues not addressed by other systems running on commercial
93
hardware supporting only Ethernet or FDDI.
94
.NH 2
95
The File System protocol
96
.PP
97
A central idea in Plan 9 is the representation of a resource as a hierarchical
98
file system.
99
Each process assembles a view of the system by building a
100
.I "name space
101
[Needham] connecting its resources.
102
File systems need not represent disc files; in fact, most Plan 9 file systems have no
103
permanent storage.
104
A typical file system dynamically represents
105
some resource like a set of network connections or the process table.
106
Communication between the kernel, device drivers, and local or remote file servers uses a
107
protocol called 9P. The protocol consists of 17 messages
108
describing operations on files and directories.
109
Kernel resident device and protocol drivers use a procedural version
110
of the protocol while external file servers use an RPC form.
111
Nearly all traffic between Plan 9 systems consists
112
of 9P messages.
113
9P relies on several properties of the underlying transport protocol.
114
It assumes messages arrive reliably and in sequence and
115
that delimiters between messages
116
are preserved.
117
When a protocol does not meet these
118
requirements (for example, TCP does not preserve delimiters)
119
we provide mechanisms to marshal messages before handing them
120
to the system.
121
.PP
122
A kernel data structure, the
123
.I channel ,
124
is a handle to a file server.
125
Operations on a channel generate the following 9P messages.
126
The
127
.CW session
128
and
129
.CW attach
130
messages authenticate a connection, established by means external to 9P,
131
and validate its user.
132
The result is an authenticated
133
channel
134
referencing the root of the
135
server.
136
The
137
.CW clone
138
message makes a new channel identical to an existing channel, much like
139
the
140
.CW dup
141
system call.
142
A
143
channel
144
may be moved to a file on the server using a
145
.CW walk
146
message to descend each level in the hierarchy.
147
The
148
.CW stat
149
and
150
.CW wstat
151
messages read and write the attributes of the file referenced by a channel.
152
The
153
.CW open
154
message prepares a channel for subsequent
155
.CW read
156
and
157
.CW write
158
messages to access the contents of the file.
159
.CW Create
160
and
161
.CW remove
162
perform the actions implied by their names on the file
163
referenced by the channel.
164
The
165
.CW clunk
166
message discards a channel without affecting the file.
167
.PP
168
A kernel resident file server called the
169
.I "mount driver"
170
converts the procedural version of 9P into RPCs.
171
The
172
.I mount
173
system call provides a file descriptor, which can be
174
a pipe to a user process or a network connection to a remote machine, to
175
be associated with the mount point.
176
After a mount, operations
177
on the file tree below the mount point are sent as messages to the file server.
178
The
179
mount
180
driver manages buffers, packs and unpacks parameters from
181
messages, and demultiplexes among processes using the file server.
182
.NH 2
183
Kernel Organization
184
.PP
185
The network code in the kernel is divided into three layers: hardware interface,
186
protocol processing, and program interface.
187
A device driver typically uses streams to connect the two interface layers.
188
Additional stream modules may be pushed on
189
a device to process protocols.
190
Each device driver is a kernel-resident file system.
191
Simple device drivers serve a single level
192
directory containing just a few files;
193
for example, we represent each UART
194
by a data and a control file.
195
.P1
196
cpu% cd /dev
197
cpu% ls -l eia*
198
--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1
199
--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1ctl
200
--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2
201
--rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2ctl
202
cpu%
203
.P2
204
The control file is used to control the device;
205
writing the string
206
.CW b1200
207
to
208
.CW /dev/eia1ctl
209
sets the line to 1200 baud.
210
.PP
211
Multiplexed devices present
212
a more complex interface structure.
213
For example, the LANCE Ethernet driver
214
serves a two level file tree (Figure 1)
215
providing
216
.IP \(bu
217
device control and configuration
218
.IP \(bu
219
user-level protocols like ARP
220
.IP \(bu
221
diagnostic interfaces for snooping software.
222
.LP
223
The top directory contains a
224
.CW clone
225
file and a directory for each connection, numbered
226
.CW 1
227
to
228
.CW n .
229
Each connection directory corresponds to an Ethernet packet type.
230
Opening the
231
.CW clone
232
file finds an unused connection directory
233
and opens its
234
.CW ctl
235
file.
236
Reading the control file returns the ASCII connection number; the user
237
process can use this value to construct the name of the proper 
238
connection directory.
239
In each connection directory files named
240
.CW ctl , 
241
.CW data , 
242
.CW stats ,
243
and 
244
.CW type
245
provide access to the connection.
246
Writing the string
247
.CW "connect 2048"
248
to the
249
.CW ctl
250
file sets the packet type to 2048
251
and
252
configures the connection to receive
253
all IP packets sent to the machine.
254
Subsequent reads of the file
255
.CW type
256
yield the string
257
.CW 2048 .
258
The
259
.CW data
260
file accesses the media;
261
reading it
262
returns the
263
next packet of the selected type.
264
Writing the file
265
queues a packet for transmission after
266
appending a packet header containing the source address and packet type.
267
The
268
.CW stats
269
file returns ASCII text containing the interface address,
270
packet input/output counts, error statistics, and general information
271
about the state of the interface.
272
.so tree.pout
273
.PP
274
If several connections on an interface
275
are configured for a particular packet type, each receives a
276
copy of the incoming packets.
277
The special packet type
278
.CW -1
279
selects all packets.
280
Writing the strings
281
.CW promiscuous
282
and
283
.CW connect
284
.CW -1
285
to the
286
.CW ctl
287
file
288
configures a conversation to receive all packets on the Ethernet.
289
.PP
290
Although the driver interface may seem elaborate,
291
the representation of a device as a set of files using ASCII strings for
292
communication has several advantages.
293
Any mechanism supporting remote access to files immediately
294
allows a remote machine to use our interfaces as gateways.
295
Using ASCII strings to control the interface avoids byte order problems and
296
ensures a uniform representation for
297
devices on the same machine and even allows devices to be accessed remotely.
298
Representing dissimilar devices by the same set of files allows common tools
299
to serve
300
several networks or interfaces.
301
Programs like
302
.CW stty
303
are replaced by
304
.CW echo
305
and shell redirection.
306
.NH 2
307
Protocol devices
308
.PP
309
Network connections are represented as pseudo-devices called protocol devices.
310
Protocol device drivers exist for the Datakit URP protocol and for each of the
311
Internet IP protocols TCP, UDP, and IL.
312
IL, described below, is a new communication protocol used by Plan 9 for
313
transmitting file system RPC's.
314
All protocol devices look identical so user programs contain no
315
network-specific code.
316
.PP
317
Each protocol device driver serves a directory structure
318
similar to that of the Ethernet driver.
319
The top directory contains a
320
.CW clone
321
file and a directory for each connection numbered
322
.CW 0
323
to
324
.CW n .
325
Each connection directory contains files to control one
326
connection and to send and receive information.
327
A TCP connection directory looks like this:
328
.P1
329
cpu% cd /net/tcp/2
330
cpu% ls -l
331
--rw-rw---- I 0 ehg    bootes 0 Jul 13 21:14 ctl
332
--rw-rw---- I 0 ehg    bootes 0 Jul 13 21:14 data
333
--rw-rw---- I 0 ehg    bootes 0 Jul 13 21:14 listen
334
--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 local
335
--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 remote
336
--r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 status
337
cpu% cat local remote status
338
135.104.9.31 5012
339
135.104.53.11 564
340
tcp/2 1 Established connect
341
cpu%
342
.P2
343
The files
344
.CW local ,
345
.CW remote ,
346
and
347
.CW status
348
supply information about the state of the connection.
349
The
350
.CW data
351
and
352
.CW ctl
353
files
354
provide access to the process end of the stream implementing the protocol.
355
The
356
.CW listen
357
file is used to accept incoming calls from the network.
358
.PP
359
The following steps establish a connection.
360
.IP 1)
361
The clone device of the
362
appropriate protocol directory is opened to reserve an unused connection.
363
.IP 2)
364
The file descriptor returned by the open points to the
365
.CW ctl
366
file of the new connection.
367
Reading that file descriptor returns an ASCII string containing
368
the connection number.
369
.IP 3)
370
A protocol/network specific ASCII address string is written to the
371
.CW ctl
372
file.
373
.IP 4)
374
The path of the
375
.CW data
376
file is constructed using the connection number.
377
When the
378
.CW data
379
file is opened the connection is established.
380
.LP
381
A process can read and write this file descriptor
382
to send and receive messages from the network.
383
If the process opens the
384
.CW listen
385
file it blocks until an incoming call is received.
386
An address string written to the
387
.CW ctl
388
file before the listen selects the
389
ports or services the process is prepared to accept.
390
When an incoming call is received, the open completes
391
and returns a file descriptor
392
pointing to the
393
.CW ctl
394
file of the new connection.
395
Reading the
396
.CW ctl
397
file yields a connection number used to construct the path of the
398
.CW data
399
file.
400
A connection remains established while any of the files in the connection directory
401
are referenced or until a close is received from the network.
402
.NH 2
403
Streams
404
.PP
405
A
406
.I stream 
407
[Rit84a][Presotto] is a bidirectional channel connecting a
408
physical or pseudo-device to user processes.
409
The user processes insert and remove data at one end of the stream.
410
Kernel processes acting on behalf of a device insert data at
411
the other end.
412
Asynchronous communications channels such as pipes,
413
TCP conversations, Datakit conversations, and RS232 lines are implemented using
414
streams.
415
.PP
416
A stream comprises a linear list of
417
.I "processing modules" .
418
Each module has both an upstream (toward the process) and
419
downstream (toward the device)
420
.I "put routine" .
421
Calling the put routine of the module on either end of the stream
422
inserts data into the stream.
423
Each module calls the succeeding one to send data up or down the stream.
424
.PP
425
An instance of a processing module is represented by a pair of
426
.I queues ,
427
one for each direction.
428
The queues point to the put procedures and can be used
429
to queue information traveling along the stream.
430
Some put routines queue data locally and send it along the stream at some
431
later time, either due to a subsequent call or an asynchronous
432
event such as a retransmission timer or a device interrupt.
433
Processing modules create helper kernel processes to
434
provide a context for handling asynchronous events.
435
For example, a helper kernel process awakens periodically
436
to perform any necessary TCP retransmissions.
437
The use of kernel processes instead of serialized run-to-completion service routines
438
differs from the implementation of Unix streams.
439
Unix service routines cannot
440
use any blocking kernel resource and they lack a local long-lived state.
441
Helper kernel processes solve these problems and simplify the stream code.
442
.PP
443
There is no implicit synchronization in our streams.
444
Each processing module must ensure that concurrent processes using the stream
445
are synchronized.
446
This maximizes concurrency but introduces the
447
possibility of deadlock.
448
However, deadlocks are easily avoided by careful programming; to
449
date they have not caused us problems.
450
.PP
451
Information is represented by linked lists of kernel structures called
452
.I blocks .
453
Each block contains a type, some state flags, and pointers to
454
an optional buffer.
455
Block buffers can hold either data or control information, i.e., directives
456
to the processing modules.
457
Blocks and block buffers are dynamically allocated from kernel memory.
458
.NH 3
459
User Interface
460
.PP
461
A stream is represented at user level as two files, 
462
.CW ctl
463
and
464
.CW data .
465
The actual names can be changed by the device driver using the stream,
466
as we saw earlier in the example of the UART driver.
467
The first process to open either file creates the stream automatically.
468
The last close destroys it.
469
Writing to the
470
.CW data
471
file copies the data into kernel blocks
472
and passes them to the downstream put routine of the first processing module.
473
A write of less than 32K is guaranteed to be contained by a single block.
474
Concurrent writes to the same stream are not synchronized, although the
475
32K block size assures atomic writes for most protocols.
476
The last block written is flagged with a delimiter
477
to alert downstream modules that care about write boundaries.
478
In most cases the first put routine calls the second, the second
479
calls the third, and so on until the data is output.
480
As a consequence, most data is output without context switching.
481
.PP
482
Reading from the
483
.CW data
484
file returns data queued at the top of the stream.
485
The read terminates when the read count is reached
486
or when the end of a delimited block is encountered.
487
A per stream read lock ensures only one process
488
can read from a stream at a time and guarantees
489
that the bytes read were contiguous bytes from the
490
stream.
491
.PP
492
Like UNIX streams [Rit84a],
493
Plan 9 streams can be dynamically configured.
494
The stream system intercepts and interprets
495
the following control blocks:
496
.IP "\f(CWpush\fP \fIname\fR" 15
497
adds an instance of the processing module 
498
.I name
499
to the top of the stream.
500
.IP \f(CWpop\fP 15
501
removes the top module of the stream.
502
.IP \f(CWhangup\fP 15
503
sends a hangup message
504
up the stream from the device end.
505
.LP
506
Other control blocks are module-specific and are interpreted by each
507
processing module
508
as they pass.
509
.PP
510
The convoluted syntax and semantics of the UNIX
511
.CW ioctl
512
system call convinced us to leave it out of Plan 9.
513
Instead,
514
.CW ioctl
515
is replaced by the
516
.CW ctl
517
file.
518
Writing to the
519
.CW ctl
520
file
521
is identical to writing to a
522
.CW data
523
file except the blocks are of type
524
.I control .
525
A processing module parses each control block it sees.
526
Commands in control blocks are ASCII strings, so
527
byte ordering is not an issue when one system
528
controls streams in a name space implemented on another processor.
529
The time to parse control blocks is not important, since control
530
operations are rare.
531
.NH 3
532
Device Interface
533
.PP
534
The module at the downstream end of the stream is part of a device interface.
535
The particulars of the interface vary with the device.
536
Most device interfaces consist of an interrupt routine, an output
537
put routine, and a kernel process.
538
The output put routine stages data for the
539
device and starts the device if it is stopped.
540
The interrupt routine wakes up the kernel process whenever
541
the device has input to be processed or needs more output staged.
542
The kernel process puts information up the stream or stages more data for output.
543
The division of labor among the different pieces varies depending on
544
how much must be done at interrupt level.
545
However, the interrupt routine may not allocate blocks or call
546
a put routine since both actions require a process context.
547
.NH 3
548
Multiplexing
549
.PP
550
The conversations using a protocol device must be
551
multiplexed onto a single physical wire.
552
We push a multiplexer processing module
553
onto the physical device stream to group the conversations.
554
The device end modules on the conversations add the necessary header
555
onto downstream messages and then put them to the module downstream
556
of the multiplexer.
557
The multiplexing module looks at each message moving up its stream and
558
puts it to the correct conversation stream after stripping
559
the header controlling the demultiplexing.
560
.PP
561
This is similar to the Unix implementation of multiplexer streams.
562
The major difference is that we have no general structure that
563
corresponds to a multiplexer.
564
Each attempt to produce a generalized multiplexer created a more complicated
565
structure and underlined the basic difficulty of generalizing this mechanism.
566
We now code each multiplexer from scratch and favor simplicity over
567
generality.
568
.NH 3
569
Reflections
570
.PP
571
Despite five year's experience and the efforts of many programmers,
572
we remain dissatisfied with the stream mechanism.
573
Performance is not an issue;
574
the time to process protocols and drive
575
device interfaces continues to dwarf the
576
time spent allocating, freeing, and moving blocks
577
of data.
578
However the mechanism remains inordinately
579
complex.
580
Much of the complexity results from our efforts
581
to make streams dynamically configurable, to
582
reuse processing modules on different devices
583
and to provide kernel synchronization
584
to ensure data structures
585
don't disappear under foot.
586
This is particularly irritating since we seldom use these properties.
587
.PP
588
Streams remain in our kernel because we are unable to
589
devise a better alternative.
590
Larry Peterson's X-kernel [Pet89a]
591
is the closest contender but
592
doesn't offer enough advantage to switch.
593
If we were to rewrite the streams code, we would probably statically
594
allocate resources for a large fixed number of conversations and burn
595
memory in favor of less complexity.
596
.NH
597
The IL Protocol
598
.PP
599
None of the standard IP protocols is suitable for transmission of
600
9P messages over an Ethernet or the Internet.
601
TCP has a high overhead and does not preserve delimiters.
602
UDP, while cheap, does not provide reliable sequenced delivery.
603
Early versions of the system used a custom protocol that was
604
efficient but unsatisfactory for internetwork transmission.
605
When we implemented IP, TCP, and UDP we looked around for a suitable
606
replacement with the following properties:
607
.IP \(bu
608
Reliable datagram service with sequenced delivery
609
.IP \(bu
610
Runs over IP
611
.IP \(bu
612
Low complexity, high performance
613
.IP \(bu
614
Adaptive timeouts
615
.LP
616
None met our needs so a new protocol was designed.
617
IL is a lightweight protocol designed to be encapsulated by IP.
618
It is a connection-based protocol
619
providing reliable transmission of sequenced messages between machines.
620
No provision is made for flow control since the protocol is designed to transport RPC
621
messages between client and server.
622
A small outstanding message window prevents too
623
many incoming messages from being buffered;
624
messages outside the window are discarded
625
and must be retransmitted.
626
Connection setup uses a two way handshake to generate
627
initial sequence numbers at each end of the connection;
628
subsequent data messages increment the
629
sequence numbers allowing
630
the receiver to resequence out of order messages. 
631
In contrast to other protocols, IL does not do blind retransmission.
632
If a message is lost and a timeout occurs, a query message is sent.
633
The query message is a small control message containing the current
634
sequence numbers as seen by the sender.
635
The receiver responds to a query by retransmitting missing messages.
636
This allows the protocol to behave well in congested networks,
637
where blind retransmission would cause further
638
congestion.
639
Like TCP, IL has adaptive timeouts.
640
A round-trip timer is used
641
to calculate acknowledge and retransmission times in terms of the network speed.
642
This allows the protocol to perform well on both the Internet and on local Ethernets.
643
.PP
644
In keeping with the minimalist design of the rest of the kernel, IL is small.
645
The entire protocol is 847 lines of code, compared to 2200 lines for TCP.
646
IL is our protocol of choice.
647
.NH
648
Network Addressing
649
.PP
650
A uniform interface to protocols and devices is not sufficient to
651
support the transparency we require.
652
Since each network uses a different
653
addressing scheme,
654
the ASCII strings written to a control file have no common format.
655
As a result, every tool must know the specifics of the networks it
656
is capable of addressing.
657
Moreover, since each machine supplies a subset
658
of the available networks, each user must be aware of the networks supported
659
by every terminal and server machine.
660
This is obviously unacceptable.
661
.PP
662
Several possible solutions were considered and rejected; one deserves
663
more discussion.
664
We could have used a user-level file server
665
to represent the network name space as a Plan 9 file tree. 
666
This global naming scheme has been implemented in other distributed systems.
667
The file hierarchy provides paths to
668
directories representing network domains.
669
Each directory contains
670
files representing the names of the machines in that domain;
671
an example might be the path
672
.CW /net/name/usa/edu/mit/ai .
673
Each machine file contains information like the IP address of the machine.
674
We rejected this representation for several reasons.
675
First, it is hard to devise a hierarchy encompassing all representations
676
of the various network addressing schemes in a uniform manner.
677
Datakit and Ethernet address strings have nothing in common.
678
Second, the address of a machine is
679
often only a small part of the information required to connect to a service on
680
the machine.
681
For example, the IP protocols require symbolic service names to be mapped into
682
numeric port numbers, some of which are privileged and hence special.
683
Information of this sort is hard to represent in terms of file operations.
684
Finally, the size and number of the networks being represented burdens users with
685
an unacceptably large amount of information about the organization of the network
686
and its connectivity.
687
In this case the Plan 9 representation of a
688
resource as a file is not appropriate.
689
.PP
690
If tools are to be network independent, a third-party server must resolve
691
network names.
692
A server on each machine, with local knowledge, can select the best network
693
for any particular destination machine or service.
694
Since the network devices present a common interface,
695
the only operation which differs between networks is name resolution.
696
A symbolic name must be translated to
697
the path of the clone file of a protocol
698
device and an ASCII address string to write to the
699
.CW ctl
700
file.
701
A connection server (CS) provides this service.
702
.NH 2
703
Network Database
704
.PP
705
On most systems several
706
files such as
707
.CW /etc/hosts ,
708
.CW /etc/networks ,
709
.CW /etc/services ,
710
.CW /etc/hosts.equiv ,
711
.CW /etc/bootptab ,
712
and
713
.CW /etc/named.d
714
hold network information.
715
Much time and effort is spent
716
administering these files and keeping
717
them mutually consistent.
718
Tools attempt to
719
automatically derive one or more of the files from
720
information in other files but maintenance continues to be
721
difficult and error prone.
722
.PP
723
Since we were writing an entirely new system, we were free to
724
try a simpler approach.
725
One database on a shared server contains all the information
726
needed for network administration.
727
Two ASCII files comprise the main database:
728
.CW /lib/ndb/local
729
contains locally administered information and
730
.CW /lib/ndb/global
731
contains information imported from elsewhere.
732
The files contain sets of attribute/value pairs of the form
733
.I attr\f(CW=\fPvalue ,
734
where
735
.I attr
736
and
737
.I value
738
are alphanumeric strings.
739
Systems are described by multi-line entries;
740
a header line at the left margin begins each entry followed by zero or more
741
indented attribute/value pairs specifying
742
names, addresses, properties, etc.
743
For example, the entry for our CPU server
744
specifies a domain name, an IP address, an Ethernet address,
745
a Datakit address, a boot file, and supported protocols.
746
.P1
747
sys=helix
748
	dom=helix.research.bell-labs.com
749
	bootf=/mips/9power
750
	ip=135.104.9.31 ether=0800690222f0
751
	dk=nj/astro/helix
752
	proto=il flavor=9cpu
753
.P2
754
If several systems share entries such as
755
network mask and gateway, we specify that information
756
with the network or subnetwork instead of the system.
757
The following entries define a Class B IP network and 
758
a few subnets derived from it.
759
The entry for the network specifies the IP mask,
760
file system, and authentication server for all systems
761
on the network.
762
Each subnetwork specifies its default IP gateway.
763
.P1
764
ipnet=mh-astro-net ip=135.104.0.0 ipmask=255.255.255.0
765
	fs=bootes.research.bell-labs.com
766
	auth=1127auth
767
ipnet=unix-room ip=135.104.117.0
768
	ipgw=135.104.117.1
769
ipnet=third-floor ip=135.104.51.0
770
	ipgw=135.104.51.1
771
ipnet=fourth-floor ip=135.104.52.0
772
	ipgw=135.104.52.1
773
.P2
774
Database entries also define the mapping of service names
775
to port numbers for TCP, UDP, and IL.
776
.P1
777
tcp=echo	port=7
778
tcp=discard	port=9
779
tcp=systat	port=11
780
tcp=daytime	port=13
781
.P2
782
.PP
783
All programs read the database directly so
784
consistency problems are rare.
785
However the database files can become large.
786
Our global file, containing all information about
787
both Datakit and Internet systems in AT&T, has 43,000
788
lines.
789
To speed searches, we build hash table files for each
790
attribute we expect to search often.
791
The hash file entries point to entries
792
in the master files.
793
Every hash file contains the modification time of its master
794
file so we can avoid using an out-of-date hash table.
795
Searches for attributes that aren't hashed or whose hash table
796
is out-of-date still work, they just take longer.
797
.NH 2
798
Connection Server
799
.PP
800
On each system a user level connection server process, CS, translates
801
symbolic names to addresses.
802
CS uses information about available networks, the network database, and
803
other servers (such as DNS) to translate names.
804
CS is a file server serving a single file,
805
.CW /net/cs .
806
A client writes a symbolic name to
807
.CW /net/cs
808
then reads one line for each matching destination reachable
809
from this system.
810
The lines are of the form
811
.I "filename message",
812
where
813
.I filename
814
is the path of the clone file to open for a new connection and
815
.I message
816
is the string to write to it to make the connection.
817
The following example illustrates this.
818
.CW Ndb/csquery
819
is a program that prompts for strings to write to
820
.CW /net/cs
821
and prints the replies.
822
.P1
823
% ndb/csquery
824
> net!helix!9fs
825
/net/il/clone 135.104.9.31!17008
826
/net/dk/clone nj/astro/helix!9fs
827
.P2
828
.PP
829
CS provides meta-name translation to perform complicated
830
searches.
831
The special network name
832
.CW net
833
selects any network in common between source and
834
destination supporting the specified service.
835
A host name of the form \f(CW$\fIattr\f1
836
is the name of an attribute in the network database.
837
The database search returns the value
838
of the matching attribute/value pair
839
most closely associated with the source host.
840
Most closely associated is defined on a per network basis.
841
For example, the symbolic name
842
.CW tcp!$auth!rexauth
843
causes CS to search for the
844
.CW auth
845
attribute in the database entry for the source system, then its
846
subnetwork (if there is one) and then its network.
847
.P1
848
% ndb/csquery
849
> net!$auth!rexauth
850
/net/il/clone 135.104.9.34!17021
851
/net/dk/clone nj/astro/p9auth!rexauth
852
/net/il/clone 135.104.9.6!17021
853
/net/dk/clone nj/astro/musca!rexauth
854
.P2
855
.PP
856
Normally CS derives naming information from its database files.
857
For domain names however, CS first consults another user level
858
process, the domain name server (DNS).
859
If no DNS is reachable, CS relies on its own tables.
860
.PP
861
Like CS, the domain name server is a user level process providing
862
one file,
863
.CW /net/dns .
864
A client writes a request of the form
865
.I "domain-name type" ,
866
where
867
.I type
868
is a domain name service resource record type.
869
DNS performs a recursive query through the
870
Internet domain name system producing one line
871
per resource record found.  The client reads
872
.CW /net/dns 
873
to retrieve the records.
874
Like other domain name servers, DNS caches information
875
learned from the network.
876
DNS is implemented as a multi-process shared memory application
877
with separate processes listening for network and local requests.
878
.NH
879
Library routines
880
.PP
881
The section on protocol devices described the details
882
of making and receiving connections across a network.
883
The dance is straightforward but tedious.
884
Library routines are provided to relieve
885
the programmer of the details.
886
.NH 2
887
Connecting
888
.PP
889
The
890
.CW dial
891
library call establishes a connection to a remote destination.
892
It
893
returns an open file descriptor for the
894
.CW data
895
file in the connection directory.
896
.P1
897
int  dial(char *dest, char *local, char *dir, int *cfdp)
898
.P2
899
.IP \f(CWdest\fP 10
900
is the symbolic name/address of the destination.
901
.IP \f(CWlocal\fP 10
902
is the local address.
903
Since most networks do not support this, it is
904
usually zero.
905
.IP \f(CWdir\fP 10
906
is a pointer to a buffer to hold the path name of the protocol directory
907
representing this connection.
908
.CW Dial
909
fills this buffer if the pointer is non-zero.
910
.IP \f(CWcfdp\fP 10
911
is a pointer to a file descriptor for the
912
.CW ctl
913
file of the connection.
914
If the pointer is non-zero,
915
.CW dial
916
opens the control file and tucks the file descriptor here.
917
.LP
918
Most programs call
919
.CW dial
920
with a destination name and all other arguments zero.
921
.CW Dial
922
uses CS to
923
translate the symbolic name to all possible destination addresses
924
and attempts to connect to each in turn until one works.
925
Specifying the special name
926
.CW net
927
in the network portion of the destination
928
allows CS to pick a network/protocol in common
929
with the destination for which the requested service is valid.
930
For example, assume the system
931
.CW research.bell-labs.com
932
has the Datakit address
933
.CW nj/astro/research
934
and IP addresses
935
.CW 135.104.117.5
936
and
937
.CW 129.11.4.1 .
938
The call
939
.P1
940
fd = dial("net!research.bell-labs.com!login", 0, 0, 0, 0);
941
.P2
942
tries in succession to connect to
943
.CW nj/astro/research!login
944
on the Datakit and both
945
.CW 135.104.117.5!513
946
and
947
.CW 129.11.4.1!513
948
across the Internet.
949
.PP
950
.CW Dial
951
accepts addresses instead of symbolic names.
952
For example, the destinations
953
.CW tcp!135.104.117.5!513
954
and
955
.CW tcp!research.bell-labs.com!login
956
are equivalent
957
references to the same machine.
958
.NH 2
959
Listening
960
.PP
961
A program uses
962
four routines to listen for incoming connections.
963
It first
964
.CW announce() s
965
its intention to receive connections,
966
then
967
.CW listen() s
968
for calls and finally
969
.CW accept() s
970
or
971
.CW reject() s
972
them.
973
.CW Announce
974
returns an open file descriptor for the
975
.CW ctl
976
file of a connection and fills
977
.CW dir
978
with the
979
path of the protocol directory
980
for the announcement.
981
.P1
982
int  announce(char *addr, char *dir)
983
.P2
984
.CW Addr
985
is the symbolic name/address announced;
986
if it does not contain a service, the announcement is for
987
all services not explicitly announced.
988
Thus, one can easily write the equivalent of the
989
.CW inetd
990
program without
991
having to announce each separate service.
992
An announcement remains in force until the control file is
993
closed.
994
.LP
995
.CW Listen
996
returns an open file descriptor for the
997
.CW ctl
998
file and fills
999
.CW ldir
1000
with the path
1001
of the protocol directory
1002
for the received connection.
1003
It is passed
1004
.CW dir
1005
from the announcement.
1006
.P1
1007
int  listen(char *dir, char *ldir)
1008
.P2
1009
.LP
1010
.CW Accept
1011
and
1012
.CW reject
1013
are called with the control file descriptor and
1014
.CW ldir
1015
returned by
1016
.CW listen.
1017
Some networks such as Datakit accept a reason for a rejection;
1018
networks such as IP ignore the third argument.
1019
.P1
1020
int  accept(int ctl, char *ldir)
1021
int  reject(int ctl, char *ldir, char *reason)
1022
.P2
1023
.PP
1024
The following code implements a typical TCP listener.
1025
It announces itself, listens for connections, and forks a new
1026
process for each.
1027
The new process echoes data on the connection until the
1028
remote end closes it.
1029
The "*" in the symbolic name means the announcement is valid for
1030
any addresses bound to the machine the program is run on.
1031
.P1
1032
.ta 8n 16n 24n 32n 40n 48n 56n 64n
1033
int
1034
echo_server(void)
1035
{
1036
	int dfd, lcfd;
1037
	char adir[40], ldir[40];
1038
	int n;
1039
	char buf[256];
1040
 
1041
	afd = announce("tcp!*!echo", adir);
1042
	if(afd < 0)
1043
		return -1;
1044
 
1045
	for(;;){
1046
		/* listen for a call */
1047
		lcfd = listen(adir, ldir);
1048
		if(lcfd < 0)
1049
			return -1;
1050
 
1051
		/* fork a process to echo */
1052
		switch(fork()){
1053
		case 0:
1054
			/* accept the call and open the data file */
1055
			dfd = accept(lcfd, ldir);
1056
			if(dfd < 0)
1057
				return -1;
1058
 
1059
			/* echo until EOF */
1060
			while((n = read(dfd, buf, sizeof(buf))) > 0)
1061
				write(dfd, buf, n);
1062
			exits(0);
1063
		case -1:
1064
			perror("forking");
1065
		default:
1066
			close(lcfd);
1067
			break;
1068
		}
1069
 
1070
	}
1071
}
1072
.P2
1073
.NH
1074
User Level
1075
.PP
1076
Communication between Plan 9 machines is done almost exclusively in
1077
terms of 9P messages. Only the two services
1078
.CW cpu
1079
and
1080
.CW exportfs
1081
are used.
1082
The
1083
.CW cpu
1084
service is analogous to
1085
.CW rlogin .
1086
However, rather than emulating a terminal session
1087
across the network,
1088
.CW cpu
1089
creates a process on the remote machine whose name space is an analogue of the window
1090
in which it was invoked.
1091
.CW Exportfs
1092
is a user level file server which allows a piece of name space to be
1093
exported from machine to machine across a network. It is used by the
1094
.CW cpu
1095
command to serve the files in the terminal's name space when they are
1096
accessed from the
1097
cpu server.
1098
.PP
1099
By convention, the protocol and device driver file systems are mounted in a
1100
directory called
1101
.CW /net .
1102
Although the per-process name space allows users to configure an
1103
arbitrary view of the system, in practice their profiles build
1104
a conventional name space.
1105
.NH 2
1106
Exportfs
1107
.PP
1108
.CW Exportfs
1109
is invoked by an incoming network call.
1110
The
1111
.I listener
1112
(the Plan 9 equivalent of
1113
.CW inetd )
1114
runs the profile of the user
1115
requesting the service to construct a name space before starting
1116
.CW exportfs .
1117
After an initial protocol
1118
establishes the root of the file tree being
1119
exported,
1120
the remote process mounts the connection,
1121
allowing
1122
.CW exportfs
1123
to act as a relay file server. Operations in the imported file tree
1124
are executed on the remote server and the results returned.
1125
As a result
1126
the name space of the remote machine appears to be exported into a
1127
local file tree.
1128
.PP
1129
The
1130
.CW import
1131
command calls
1132
.CW exportfs
1133
on a remote machine, mounts the result in the local name space,
1134
and
1135
exits.
1136
No local process is required to serve mounts;
1137
9P messages are generated by the kernel's mount driver and sent
1138
directly over the network.
1139
.PP
1140
.CW Exportfs
1141
must be multithreaded since the system calls
1142
.CW open,
1143
.CW read
1144
and
1145
.CW write
1146
may block.
1147
Plan 9 does not implement the 
1148
.CW select
1149
system call but does allow processes to share file descriptors,
1150
memory and other resources.
1151
.CW Exportfs
1152
and the configurable name space
1153
provide a means of sharing resources between machines.
1154
It is a building block for constructing complex name spaces
1155
served from many machines.
1156
.PP
1157
The simplicity of the interfaces encourages naive users to exploit the potential
1158
of a richly connected environment.
1159
Using these tools it is easy to gateway between networks.
1160
For example a terminal with only a Datakit connection can import from the server
1161
.CW helix :
1162
.P1
1163
import -a helix /net
1164
telnet ai.mit.edu
1165
.P2
1166
The
1167
.CW import
1168
command makes a Datakit connection to the machine
1169
.CW helix
1170
where
1171
it starts an instance
1172
.CW exportfs
1173
to serve
1174
.CW /net .
1175
The
1176
.CW import
1177
command mounts the remote
1178
.CW /net
1179
directory after (the
1180
.CW -a
1181
option to
1182
.CW import )
1183
the existing contents
1184
of the local
1185
.CW /net
1186
directory.
1187
The directory contains the union of the local and remote contents of
1188
.CW /net .
1189
Local entries supersede remote ones of the same name so
1190
networks on the local machine are chosen in preference
1191
to those supplied remotely.
1192
However, unique entries in the remote directory are now visible in the local
1193
.CW /net 
1194
directory.
1195
All the networks connected to
1196
.CW helix ,
1197
not just Datakit,
1198
are now available in the terminal. The effect on the name space is shown by the following
1199
example:
1200
.P1
1201
philw-gnot% ls /net
1202
/net/cs
1203
/net/dk
1204
philw-gnot% import -a musca /net
1205
philw-gnot% ls /net
1206
/net/cs
1207
/net/cs
1208
/net/dk
1209
/net/dk
1210
/net/dns
1211
/net/ether
1212
/net/il
1213
/net/tcp
1214
/net/udp
1215
.P2
1216
.NH 2
1217
Ftpfs
1218
.PP
1219
We decided to make our interface to FTP
1220
a file system rather than the traditional command.
1221
Our command,
1222
.I ftpfs,
1223
dials the FTP port of a remote system, prompts for login and password, sets image mode,
1224
and mounts the remote file system onto
1225
.CW /n/ftp .
1226
Files and directories are cached to reduce traffic.
1227
The cache is updated whenever a file is created.
1228
Ftpfs works with TOPS-20, VMS, and various Unix flavors
1229
as the remote system.
1230
.NH
1231
Cyclone Fiber Links
1232
.PP
1233
The file servers and CPU servers are connected by
1234
high-bandwidth
1235
point-to-point links.
1236
A link consists of two VME cards connected by a pair of optical
1237
fibers.
1238
The VME cards use 33MHz Intel 960 processors and AMD's TAXI
1239
fiber transmitter/receivers to drive the lines at 125 Mbit/sec.
1240
Software in the VME card reduces latency by copying messages from system memory
1241
to fiber without intermediate buffering.
1242
.NH
1243
Performance
1244
.PP
1245
We measured both latency and throughput
1246
of reading and writing bytes between two processes
1247
for a number of different paths.
1248
Measurements were made on two- and four-CPU SGI Power Series processors.
1249
The CPUs are 25 MHz MIPS 3000s.
1250
The latency is measured as the round trip time
1251
for a byte sent from one process to another and
1252
back again.
1253
Throughput is measured using 16k writes from
1254
one process to another.
1255
.DS C
1256
.TS
1257
box, tab(:);
1258
c s s
1259
c | c | c
1260
l | n | n.
1261
Table 1 - Performance
1262
_
1263
test:throughput:latency
1264
:MBytes/sec:millisec
1265
_
1266
pipes:8.15:.255
1267
_
1268
IL/ether:1.02:1.42
1269
_
1270
URP/Datakit:0.22:1.75
1271
_
1272
Cyclone:3.2:0.375
1273
.TE
1274
.DE
1275
.NH
1276
Conclusion
1277
.PP
1278
The representation of all resources as file systems
1279
coupled with an ASCII interface has proved more powerful
1280
than we had originally imagined.
1281
Resources can be used by any computer in our networks
1282
independent of byte ordering or CPU type.
1283
The connection server provides an elegant means
1284
of decoupling tools from the networks they use.
1285
Users successfully use Plan 9 without knowing the
1286
topology of the system or the networks they use.
1287
More information about 9P can be found in the Section 5 of the Plan 9 Programmer's
1288
Manual, Volume I.
1289
.NH
1290
References
1291
.LP
1292
[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
1293
``Plan 9 from Bell Labs'',
1294
.I
1295
UKUUG Proc. of the Summer 1990 Conf. ,
1296
London, England,
1297
1990.
1298
.LP
1299
[Needham] R. Needham, ``Names'', in
1300
.I
1301
Distributed systems,
1302
.R
1303
S. Mullender, ed.,
1304
Addison Wesley, 1989.
1305
.LP
1306
[Presotto] D. Presotto, ``Multiprocessor Streams for Plan 9'',
1307
.I
1308
UKUUG Proc. of the Summer 1990 Conf. ,
1309
.R
1310
London, England, 1990.
1311
.LP
1312
[Met80] R. Metcalfe, D. Boggs, C. Crane, E. Taf and J. Hupp, ``The
1313
Ethernet Local Network: Three reports'',
1314
.I
1315
CSL-80-2,
1316
.R
1317
XEROX Palo Alto Research Center, February 1980.
1318
.LP
1319
[Fra80] A. G. Fraser, ``Datakit - A Modular Network for Synchronous
1320
and Asynchronous Traffic'', 
1321
.I
1322
Proc. Int'l Conf. on Communication,
1323
.R
1324
Boston, June 1980.
1325
.LP
1326
[Pet89a] L. Peterson, ``RPC in the X-Kernel: Evaluating new Design Techniques'',
1327
.I
1328
Proc. Twelfth Symp. on Op. Sys. Princ.,
1329
.R
1330
Litchfield Park, AZ, December 1990.
1331
.LP
1332
[Rit84a] D. M. Ritchie, ``A Stream Input-Output System'',
1333
.I
1334
AT&T Bell Laboratories Technical Journal, 68(8),
1335
.R
1336
October 1984.