Subversion Repositories planix.SVN

Rev

Details | Last modification | View Log | RSS feed

Rev Author Line No. Line
2 - 1
.HTML "Fossil, an Archival File Server
2
... .FP times
3
... .fp 1 R R.nomath
4
... .fp 5 CW LucidaSansCW83
5
.TL
6
Fossil, an Archival File Server
7
.AU
8
Sean Quinlan
9
Jim McKie
10
Russ Cox
11
jmk,rsc@plan9.bell-labs.com
12
.AB
13
This paper describes the internals and 
14
operation of Fossil, an archival file server built for Plan 9.
15
Fossil has not yet replaced the current Plan 9 file server
16
and
17
.CW kfs ,
18
but that is our eventual intent.
19
Both fossil and this documentation are
20
works in progress.  Comments on either are most welcome.
21
.AE
22
.de HP
23
.LP
24
..
25
.NH 1
26
Introduction
27
.HP
28
Fossil is an archival file server built for Plan 9.
29
In a typical configuration, it maintains a traditional file system
30
in a local disk partition and periodically archives snapshots of the file system
31
to a Venti server.  These archives are made available through
32
a file system interface.
33
Fossil can also be run without a Venti server, in which case the
34
snapshots (if any) occupy local disk space.
35
.PP
36
The bulk of this paper explains the underlying data structures:
37
Venti trees, the Venti archival file system format, and finally Fossil's
38
file system format.
39
The end of the paper discusses the architecture of the Fossil server.
40
.PP
41
The presentation of the data structures is very detailed, perhaps
42
too detailed for most readers.
43
The intent is to record all the details necessary to make structural
44
changes to the file system format.
45
Feel free to jump ahead when boredom strikes.
46
.NH 1
47
Venti trees and directory hierarchies
48
.HP
49
Venti [3] is an archival block storage server.
50
Once a block is stored, it can be retrieved by presenting the 20-byte
51
SHA1 hash of its contents, called a
52
.I score .
53
Blocks on Venti have a maximum length of about 56 kilobytes,
54
though in practice smaller blocks are used.
55
To store a byte stream of arbitrary length, Venti uses a hash tree.
56
Conceptually, the data stream is broken into fixed-size (say,
57
.I dsize -byte)
58
chunks, which are stored on the Venti server.
59
The resulting scores are concatenated into a new pointer stream, which is
60
broken into fixed size (say,
61
.I psize -byte)
62
chunks, which are stored on the Venti server.
63
.I Psize "" (
64
is different from
65
.I dsize
66
so that we can ensure that each pointer block holds an
67
integral number of pointers.)
68
This yields a new pointer stream, and so on, until there is a single block
69
and finally a single score describing the entire tree.
70
The resulting structure looks like:
71
.PS
72
.ps 8
73
.vs 10
74
boxht=0.1
75
boxwid=0.1
76
 
77
B0: box invis wid 1 "\f(CWVtDataType\fP"
78
move right 0.1
79
L0a: box wid 0.2
80
move right 0.1
81
L0b: box wid 0.2
82
move right 0.1
83
L0c: box invis wid 0.2 "..."
84
move right 0.1
85
 
86
L0d: box wid 0.2
87
move right 0.1
88
L0e: box wid 0.2
89
move right 0.2
90
L0f: box invis wid 0.2 "..."
91
move right 0.2
92
 
93
L0g: box wid 0.2
94
move right 0.1
95
L0h: box wid 0.2
96
move right 0.1
97
L0i: box invis wid 0.2 "..."
98
move right 0.1
99
 
100
L0j: box wid 0.2
101
move right 0.1
102
L0k: box wid 0.2
103
move right 0.1
104
L0l: box invis wid 0.2 "..."
105
move right 0.1
106
L0m: box wid 0.2
107
 
108
define boxppddd {
109
	line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
110
	line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
111
	X: box invis at 0.1<$1.nw,$1.ne>
112
	Y: box invis at 0.1<$1.sw,$1.se>
113
	line -> from 0.5<X,Y> to $2.nw
114
	X: box invis at 0.3<$1.nw,$1.ne>
115
	Y: box invis at 0.3<$1.sw,$1.se>
116
	line -> from 0.5<X,Y> to $3.nw
117
	"..." at 0.7<$1.w,$1.e>
118
}
119
 
120
define boxppdddp {
121
	line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
122
	line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
123
	line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
124
	X: box invis at 0.1<$1.nw,$1.ne>
125
	Y: box invis at 0.1<$1.sw,$1.se>
126
	line -> from 0.5<X,Y> to $2.nw
127
	X: box invis at 0.3<$1.nw,$1.ne>
128
	Y: box invis at 0.3<$1.sw,$1.se>
129
	line -> from 0.5<X,Y> to $3.nw
130
	"..." at 0.6<$1.w,$1.e>
131
	X: box invis at 0.9<$1.nw,$1.ne>
132
	Y: box invis at 0.9<$1.sw,$1.se>
133
	line -> from 0.5<X,Y> to $4.nw
134
}
135
 
136
define boxpdddp {
137
	line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
138
	line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
139
	X: box invis at 0.1<$1.nw,$1.ne>
140
	Y: box invis at 0.1<$1.sw,$1.se>
141
	line -> from 0.5<X,Y> to $2.nw
142
	"..." at 0.5<$1.w,$1.e>
143
	X: box invis at 0.9<$1.nw,$1.ne>
144
	Y: box invis at 0.9<$1.sw,$1.se>
145
	line -> from 0.5<X,Y> to $3.nw
146
}
147
 
148
bhd=0.4
149
L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd)
150
boxppddd(L1abc, L0a, L0b)
151
L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd)
152
boxppddd(L1def, L0d, L0e)
153
L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd)
154
boxppddd(L1ghi, L0g, L0h)
155
L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd)
156
boxppdddp(L1jklm, L0j, L0k, L0m)
157
B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd)
158
 
159
L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd)
160
boxppddd(L2abcdef, L1abc, L1def)
161
L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd)
162
boxpdddp(L2ghijklm, L1ghi, L1jklm)
163
B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd)
164
 
165
L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd)
166
boxpdddp(L3atom, L2abcdef, L2ghijklm)
167
B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd)
168
.PE
169
.LP
170
The leaves are the original data stream.  Those blocks have type
171
.CW VtDataType .
172
The first pointer stream has type
173
.CW VtPointerType0 ,
174
the next has type
175
.CW VtPointerType1 ,
176
and so on.
177
The figure ends with a single block of type
178
.CW VtPointerType2 ,
179
but in general trees can have height up to
180
.CW VtPointerType6 .
181
For a
182
.I dsize
183
of 8192 bytes
184
and
185
.I psize
186
of 8180 bytes (409 pointers),
187
this gives a maximum stream size of approximately 10 zettabytes
188
(2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes).
189
.PP
190
Data blocks are truncated to remove trailing runs of zeros before
191
storage to Venti; they are zero-filled back to
192
.I dsize
193
bytes after retrieval from Venti.
194
Similarly, trailing runs of pointers to zero-length blocks are
195
removed from and added back to pointer blocks.
196
These simple rules happen to make it particularly efficient to store
197
large runs of zeros, as might occur in a data stream with ``holes:''
198
the zero-length block itself can be interpreted as a tree of any
199
depth encoding an all-zero data stream.
200
.PP
201
Reconstructing the data stream requires the score and type of the
202
topmost block in the tree, the data chunk size, the pointer chunk size,
203
and the data stream size.
204
(From the data stream size and the chunk sizes we could derive the
205
depth of the tree and thus the type of the topmost block, but it is convenient
206
to allow trees that are deeper than necessary.)
207
This information is kept in a 40-byte structure called a
208
.CW VtEntry :
209
.P1
210
VtEntry:
211
.ta +\w'    'u +\w'            'u
212
	gen[4]	\fRgeneration number\fP
213
	psize[2]	\fRsize of pointer blocks\fP
214
	dsize[2]	\fRsize of data blocks\fP
215
	flags[1]
216
	zero[5]
217
	size[6]	\fRlength of file\fP
218
	score[20]	\fRscore of root block in tree\fP
219
.P2
220
(In this notation,
221
.CW name[sz]
222
indicates a
223
.CW sz -byte
224
field called
225
.CW name .
226
Integers are stored in big-endian order.
227
.CW Size
228
really is a 48-bit field.)
229
.CW Flags
230
is made up of the following bit fields.
231
.P1
232
.ta +\w'      'u +\w'                      'u
233
0x01	VtEntryActive	\fRentry is allocated\fP
234
0x02	VtEntryDir	\fRentry describes a Venti directory (q.v.)\fP
235
0x1C	VtEntryDepthMask	\fRmask for tree depth\fP
236
0x20	VtEntryLocal	\fRreserved (q.v.)\fP
237
.P2
238
.LP
239
The depth of the described tree is stored in the 3 bits indicated:
240
a tree with a topmost node of type
241
.CW VtPointerType3
242
has depth 4.
243
.PP
244
With
245
.CW VtEntry
246
we can build more complicated data structures,
247
ones with multiple or nested data streams.
248
A data stream consisting of
249
.CW VtEntry
250
structures is called a Venti directory.
251
It is identical in structure to the Venti data stream
252
we described earlier except that the bottom-level type is
253
.CW VtDirType ,
254
and
255
the
256
.CW VtEntry
257
describing a Venti directory has the
258
.CW VtEntryDir
259
flag bit set.
260
The
261
.I dsize
262
for a Venti directory
263
is a multiple of 40 so that each data chunk holds
264
an integer number of
265
.CW VtEntry
266
structures.
267
By analogy with Venti directories,
268
we call the original data stream a
269
Venti file.
270
Note that Venti files are assumed
271
.I not
272
to contain pointers to other Venti blocks.
273
The only pointers to Venti blocks occur in 
274
.CW VtEntry
275
structures in
276
Venti directories
277
(and in the internal hash tree structure of the
278
individual files and directories).
279
Note also that these directories are nothing more than pointer lists.
280
In particular, there are no names or metadata like in a file system.
281
.PP
282
To make it easier to pass hierarchies between applications,
283
the root of a hierarchy is described in a 300-byte structure
284
called a
285
.CW VtRoot :
286
.P1
287
VtRoot:
288
.ta +\w'    'u +\w'                'u
289
	version[2]	\f(CW2\fP
290
	name[128]	\fRname of structure (just a comment)\fP
291
	type[128]	\fRstring describing structure (\f(CWvac\fR)\f(CW
292
	score[20]	\fRpointer to \f(CWVtDirType\fP block\f(CW
293
	blockSize[2]	\fRmaximum block size in structure\fP
294
	prev[20]	\fRprevious \f(CWVtRoot\fP in chain, if any\fP
295
.P2
296
.LP
297
This structure is stored to Venti and its score is passed
298
between applications, typically in the form
299
``\fItype\f(CW:\fIrootscore\fR,''
300
where
301
.I type
302
is the type field from the
303
.CW VtRoot
304
structure, and
305
.I rootscore
306
is the score of the
307
.CW VtRoot
308
block.
309
.CW VtRoot
310
structures can be chained together using the
311
.I prev
312
field to encode an archival history
313
of the data structure.
314
.PP
315
For example, a small Venti hierarchy might look like:
316
.PS
317
.ps 8
318
.vs 10
319
boxwid=0.1
320
boxht=0.1
321
f=0.9
322
mb=0.16
323
 
324
VtRoot: [
325
	right
326
	B1: box
327
	move right 0.1
328
	"\f(CWVtRoot\fP" ljust
329
]
330
 
331
Root: [
332
	right
333
	B1: box fill f
334
	B2: box fill f
335
	B3: box fill f
336
	move right 0.1
337
] with .nw at VtRoot.sw+(0.2,-.1)
338
Level1: [
339
	RootMeta: [
340
		box wid mb
341
	]
342
	MetaSource: [
343
		right
344
		B1: box wid 5*mb
345
	] with .nw at RootMeta.sw+(0,-.1)
346
 
347
	Source: [
348
		right
349
		B1: box fill f
350
		B2: box fill f
351
		B3: box fill f
352
		B4: box fill f
353
		B5: box fill f
354
		B6: box fill f
355
		B7: box fill f
356
		B8: box fill f
357
	] with .nw at MetaSource.sw+(0,-.1)
358
	SB1: box invis at Source.B1
359
	SB2: box invis at Source.B2
360
	SB3: box invis at Source.B3
361
] with .nw at Root.sw+(0.4,-.1)
362
Level2: [
363
	MetaSource: [
364
		right
365
		B1: box wid 5*mb
366
	] 
367
	Source: [
368
		right
369
		B1: box fill f
370
		B2: box fill f
371
		B3: box fill f
372
		B4: box fill f
373
		B5: box fill f
374
		B6: box fill f
375
		B7: box fill f
376
		B8: box fill f
377
	] with .nw at MetaSource.sw+(0,-.1)
378
	File: box wid 0.8 with .nw at Source.sw+(0,-.1)
379
] with .nw at Level1.sw+(0.6,-.1)
380
 
381
line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w
382
line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w
383
line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w
384
line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w
385
 
386
line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w
387
line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w
388
line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w
389
 
390
[
391
	KEY: box wid 1.5 invis "Key"
392
	line from KEY.sw to KEY.se
393
	k = -.1
394
	kk=0.5
395
	A: [
396
		box wid 4*boxwid
397
		"Venti file" ljust with .w at last box .w+(kk,0)
398
	] with .nw at KEY.sw+(0,2*k)
399
	B: [
400
		box fill f
401
		"Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0)
402
	] with .nw at A.sw+(0,k)
403
	C: [
404
		right
405
		CC: box fill f
406
		box fill f
407
		box fill f
408
		box fill f
409
		"Venti directory" ljust with .w at CC.w+(kk,0)
410
	] with .nw at B.sw+(0,k)
411
	D: [
412
		line -> right 3*boxwid
413
		"Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
414
	] with .nw at C.sw+(0,k)
415
] with .nw at VtRoot.nw+(3,0)
416
.PE
417
.LP
418
Venti files are shown as white boxes, while directories are shown
419
as shaded boxes.  Each shaded square represents a
420
.CW VtEntry .
421
Arrows represent pointers from
422
.CW VtEntry
423
structures to other
424
Venti files or directories.
425
.PP
426
The hierarchical structure provided by Venti files and directories
427
can be used as the base for more complicated data structures.
428
Because this structure captures all the information
429
about pointers to other blocks, tools written to traverse
430
Venti hierarchies can traverse the more complicated
431
data structures as well.
432
For example,
433
.I venti/copy
434
(see
435
.I venti (1))
436
copies a Venti hierarchy from one Venti server to another,
437
given the root
438
.CW VtEntry .
439
Because the traditional file system described in later sections is
440
layered on a Venti hierarchy, 
441
.I venti/copy
442
can copy it without fully understanding its structure.
443
.NH 1
444
Vac file system format
445
.HP
446
The Venti archive format
447
.I vac
448
builds a traditional file system using a Venti hierarchy.
449
Each vac file is implemented as a Venti file;
450
each vac directory is implemented as a Venti
451
directory and a Venti file to provide traditional file system metadata.
452
The metadata is stored in a structure called a
453
.CW DirEntry :
454
.P1
455
DirEntry:
456
.ta +\w'    'u +\w'            'u
457
	magic[4]	\f(CW0x1c4d9072\fP (DirMagic)\fP
458
	version[2]	\f(CW9\fP
459
	elem[s]	\fRname (final path element only)\fP
460
	entry[4]	\fRentry number for Venti file or directory\fP
461
	gen[4]	\fRgeneration number\fP
462
	mentry[4]	\fRentry number for Venti file holding metadata\fP
463
	mgen[4]	\fRgeneration number\fP
464
	qid[8]	\fRunique file serial number\fP
465
	uid[s]	\fRowner\fP
466
	gid[s]	\fRgroup\fP
467
	mid[s]	\fRlast modified by\fP
468
	mtime[4]	\fRlast modification time\fP
469
	ctime[4]	\fRcreation time\fP
470
	atime[4]	\fRlast access time\fP
471
	mode[4]	\fRmode bits\fP
472
.P2
473
The notation
474
.CW name[s]
475
denotes a string stored as a two-byte length
476
and then that many bytes.
477
The above describes Version 9 of the 
478
.CW DirEntry
479
format.  Versions 7 and 8 are very similar; they can be
480
read by the current
481
.I vac
482
source code but are not written.
483
Earlier versions were not widespread.
484
A
485
.CW DirEntry
486
may be followed by optional extension sections, though none
487
are currently used.
488
The
489
.CW mode
490
bits include bits commonly used by
491
Unix and Windows, in addition to those used by Plan 9.
492
.PP
493
The
494
.CW entry
495
field is an index into the parallel Venti directory.
496
The
497
.CW gen
498
field must match the
499
.CW gen 
500
field in the corresponding
501
.CW VtEntry
502
in the directory;
503
it is used to detect
504
stale indices.
505
Similarly,
506
.CW mentry
507
and
508
.CW mgen
509
are the index and generation number
510
for the metadata Venti file,
511
if the
512
.CW DirEntry
513
describes a vac directory.
514
.PP
515
The relation between Venti files and directories and
516
vac files and directories can be seen in this figure:
517
.PS
518
.ps 8
519
.vs 10
520
boxwid=0.1
521
boxht=0.1
522
f=0.9
523
mb=0.16
524
 
525
VtRoot: [
526
	right
527
	B1: box
528
	move right 0.1
529
	"\f(CWVtRoot\fP" ljust
530
]
531
 
532
SuperRoot: [
533
	right
534
	B1: box fill f
535
	move right 0.1
536
	"fs root block" ljust
537
] with .nw at VtRoot.sw + (0.2, -.2)
538
Root: [
539
	right
540
	B1: box fill f
541
	B2: box fill f
542
	B3: box fill f
543
	move right 0.1
544
	"root directory info block" ljust
545
] with .nw at SuperRoot.sw+(0.2, -.2)
546
Level1: [
547
	RootMeta: [
548
		box wid mb
549
		move right 0.1
550
		"root metadata" ljust
551
	]
552
	MetaSource: [
553
		right
554
		B1: box wid mb
555
		B2: box wid mb
556
		B3: box wid mb
557
		B4: box wid mb
558
		B5: box wid mb
559
	] with .nw at RootMeta.sw+(0,-.2)
560
	MB1: box wid mb invis at MetaSource.B1
561
	MB2: box wid mb invis at MetaSource.B2
562
	MB3: box wid mb invis at MetaSource.B3
563
	MB4: box wid mb invis at MetaSource.B4
564
	MB5: box wid mb invis at MetaSource.B5
565
 
566
	Source: [
567
		right
568
		B1: box fill f
569
		B2: box fill f
570
		B3: box fill f
571
		B4: box fill f
572
		B5: box fill f
573
		B6: box fill f
574
		B7: box fill f
575
		B8: box fill f
576
	] with .nw at MetaSource.sw+(0,-.1)
577
	SB1: box invis at Source.B1
578
	SB2: box invis at Source.B2
579
	SB3: box invis at Source.B3
580
	SB4: box invis at Source.B4
581
	SB5: box invis at Source.B5
582
	SB6: box invis at Source.B6
583
	SB7: box invis at Source.B7
584
	SB8: box invis at Source.B8
585
] with .nw at Root.sw+(0.4,-.2)
586
Level2: [
587
	MetaSource: [
588
		right
589
		B1: box wid mb
590
		B2: box wid mb
591
		B3: box wid mb
592
		B4: box wid mb
593
		B5: box wid mb
594
	] 
595
	Source: [
596
		right
597
		B1: box fill f
598
		B2: box fill f
599
		B3: box fill f
600
		B4: box fill f
601
		B5: box fill f
602
		B6: box fill f
603
		B7: box fill f
604
		B8: box fill f
605
	] with .nw at MetaSource.sw+(0,-.1)
606
	File: box wid 0.8 with .nw at Source.sw+(0,-.2)
607
] with .nw at Level1.sw+(0.6,-.2)
608
 
609
line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w
610
line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w
611
line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w
612
line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w
613
line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w
614
 
615
line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w
616
line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w
617
line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w
618
 
619
arrowwid = arrowwid/2
620
arrowht = arrowht/2
621
line -> from Level1.MB1 to Level1.SB1.n
622
line -> from Level1.MB2 to Level1.SB2.n
623
line -> from Level1.MB2 to Level1.SB3.n
624
line -> from Level1.MB4 to Level1.SB7.n
625
line -> from Level1.MB5 to Level1.SB5.n
626
arrowwid = arrowwid * 2
627
arrowht = arrowht * 2
628
 
629
box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
630
box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
631
box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2
632
 
633
[
634
	KEY: box wid 1.5 invis "Key"
635
	line from KEY.sw to KEY.se
636
	k = -.1
637
	kk=0.5
638
	A: [
639
		box wid 4*boxwid
640
		"Venti file" ljust with .w at last box .w+(kk,0)
641
	] with .nw at KEY.sw+(0,2*k)
642
	B: [
643
		box fill f
644
		"Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0)
645
	] with .nw at A.sw+(0,k)
646
	C: [
647
		right
648
		CC: box fill f
649
		box fill f
650
		box fill f
651
		box fill f
652
		"Venti directory" ljust with .w at CC.w+(kk,0)
653
	] with .nw at B.sw+(0,k)
654
	D: [
655
		line -> right 3*boxwid
656
		"Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
657
	] with .nw at C.sw+(0,k)
658
	DD: [
659
		box dotted wid 4*boxwid
660
		"Vac file" ljust with .w at last box .w+(kk,0)
661
	] with .nw at D.sw+(0,k)
662
	E: [
663
		box wid mb
664
		"Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0)
665
	] with .nw at DD.sw+(0,k)
666
	G: [
667
		box dashed wid 4*boxwid
668
		"Vac directory" ljust with .w at last box .w+(kk,0)
669
	] with .nw at E.sw+(0,k)
670
	H: [
671
		arrowwid = arrowwid/2
672
		arrowht = arrowht/2
673
		line -> right 1.5*boxwid
674
		"Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0)
675
		arrowwid = arrowwid * 2
676
		arrowht = arrowht * 2
677
	] with .nw at G.sw+(0,k)
678
] with .nw at VtRoot.nw+(3,0)
679
.PE
680
.LP
681
In reality, the story is slightly more complicated.
682
The metadata file in a Vac directory
683
is not just the concatenation of
684
.CW DirEntry
685
structures.
686
Instead, it is the concatenation of
687
.CW MetaBlocks .
688
A
689
.CW MetaBlock
690
contains some number of
691
.CW DirEntry
692
structures along with a sorted index to make it easy
693
to look for a particular
694
.CW DirEntry
695
by its
696
.CW elem 
697
field.
698
The details are in the source code.
699
.PP
700
As shown in the diagram,
701
the root directory of the file system is summarized by
702
three
703
.CW VtEntry
704
structures describing
705
the Venti directory for the children of the root,
706
the Venti file for the metadata describing the children of the root,
707
and a Venti file holding metadata for the root directory itself.
708
These
709
.CW VtEntry
710
structures are placed in a Venti directory of their own,
711
described by the single 
712
.CW VtEntry
713
in the
714
root block.
715
.NH 1
716
Fossil file system format
717
.HP
718
Fossil uses the vac format, with some small changes.
719
The changes only affect the data on the local disk; the data
720
archived to Venti is exactly in vac format.
721
.PP
722
Blocks stored on local disk may contain scores pointing at local disk
723
blocks or at Venti blocks. 
724
Local block addresses are stored as 20-byte scores in which the first 16 bytes
725
are all zero and the last 4 bytes specify a block number in the disk.
726
Before a block is archived, all the
727
blocks it points to must be archived, and the local scores in the block
728
must be changed to Venti scores.
729
Using block addresses rather than content hashes for local data
730
makes the local file system easier to manage: if a local block's contents
731
change, the pointer to the block does not need to change.
732
.NH 2
733
Snapshots
734
.HP
735
Fossil is an archival file server.
736
It takes periodic snapshots of the file system,
737
which are made accessible through the file system.
738
Specifically, the active file system is presented in
739
.CW /active .
740
Ephemeral snapshots (those that are kept on local disk and eventually deleted)
741
are presented in
742
\f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR,
743
where
744
.I yyyy
745
is the full year,
746
.I mm
747
is the month number,
748
.I dd
749
is the day number,
750
.I hh
751
is the hour,
752
and
753
.I mm
754
is the minute.
755
Archival snapshots (those that are archived to Venti and persist forever)
756
are presented in
757
\f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR,
758
where
759
.I yyyy ,
760
.I mm ,
761
and
762
.I dd
763
are year, month, and day as before,
764
and
765
.I s
766
is a sequence number if more than one
767
archival snapshot is done in a day.
768
For the first snapshot,
769
.I s
770
is null.
771
For the subsequent snapshots,
772
.I s
773
is
774
.CW .1 ,
775
.CW .2 ,
776
.CW .3 ,
777
etc.
778
.PP
779
To implement the snapshots, the file server maintains a
780
current
781
.I epoch
782
for the active file system.
783
Each local block has a label that records, among other things,
784
the epoch in which the block was allocated.
785
If a block was allocated in an epoch earlier than the current one,
786
it is immutable and treated as copy-on-write.
787
Taking a snapshot can be accomplished by
788
recording the address of the current root block and then 
789
incrementing the epoch number.
790
Notice that the copy-on-write method makes
791
snapshots both time efficient and space efficient.
792
The only time cost is waiting for all current file system
793
requests to finish and then incrementing a counter.
794
After a snapshot, blocks only get copied when they are
795
next modified, so the per-snapshot
796
space requirement is proportional
797
to the amount of new data rather than the total
798
size of the file system.
799
.PP
800
The blocks in the archival snapshots are moved to Venti,
801
but the blocks in the ephemeral snapshots take up space
802
in the local disk file.
803
To allow reclamation of this disk space, the file system
804
maintains a 
805
.I low
806
.I epoch ,
807
which is the epoch of the earliest ephemeral snapshot
808
still available.
809
Fossil only allows access to snapshots with epoch numbers
810
between the 
811
low epoch and the current epoch
812
(also called the high epoch).
813
Incrementing the low epoch thus makes old
814
snapshots inaccessible.
815
The space required to store those snapshots can then
816
be reclaimed, as described below.
817
.NH 2
818
Local blocks
819
.HP
820
The bulk of the local disk file is the local blocks.
821
Each block has a 14-byte label associated with it, of the format:
822
.P1
823
Label:
824
.ta +\w'    'u +\w'                'u
825
	state[1]	\fRblock state\fP
826
	type[1]	\fRblock type\fP
827
	epoch[4]	\fRallocation epoch\fP
828
	epochClose[4]	\fRclose epoch\fP
829
	tag[4]	\fRrandom tag\fP
830
.P2
831
.LP
832
The
833
.CW type
834
is an analogue of the block types described earlier,
835
though different names are used, to distinguish between
836
pointers blocks in a hash tree for a data stream
837
and pointer blocks for a directory stream.
838
The
839
.CW epoch
840
was mentioned in the last section.
841
The other fields are explained below.
842
.PP
843
There are two distinguished blocks states
844
.CW BsFree
845
.CW 0x00 ) (
846
and
847
.CW BsBad
848
.CW 0xFF ), (
849
which mark blocks that are available for allocation
850
and blocks that are bad and should be avoided.
851
If
852
.CW state
853
is not one of these values, it is a bitwise
854
.I or ' `
855
of the following flags:
856
.P1
857
.ta +\w'      'u +\w'                'u
858
0x01	BsAlloc	\fRblock is in use\fP
859
0x02	BsCopied	\fRblock has been copied\fP
860
0x04	BsVenti	\fRblock has been stored on Venti\fP
861
0x08	BsClosed	\fRblock has been unlinked from active file system\fP
862
.P2
863
.LP
864
The flags are explained as they arise in the discussions below.
865
.PP
866
It is convenient to store some extra fields in the
867
.CW VtEntry
868
structure when it describes a Venti file or directory
869
stored on local disk.
870
Specifically, we set the
871
.CW VtEntryLocal
872
flag bit
873
and then use the bytes 7-16 of the score (which would
874
otherwise be zero, since it is a local score) to hold these fields:
875
.P1
876
.ta +\w'    'u +\w'                'u
877
	archive[1]	\fRboolean: this is an archival snapshot\fP
878
	snap[4]	\fRepoch number if root of snapshot\fP
879
	tag[4]	\fRrandom tag\fP
880
.P2
881
.LP
882
The extended
883
.CW VtEntry
884
structure is called an
885
.CW Entry .
886
The
887
.CW tag
888
field
889
in the
890
.CW Label
891
and the
892
.CW Entry
893
is used to identify dangling pointers or other file system corruption:
894
all the local blocks in a hash tree must
895
have tags matching the tag in the
896
.CW Entry .
897
If this
898
.CW Entry
899
points at the root of a snapshot,
900
the
901
.CW snap
902
field is the epoch of the snapshot.
903
If the snapshot is intended to be archived to Venti,
904
the
905
.CW archive
906
field is non-zero.
907
.NH 2
908
Block reclamation
909
.HP
910
The blocks in the active file system form a tree: each
911
block has only one parent.
912
Once a copy-on-write block 
913
.I b
914
is replaced by its copy, it is no longer
915
needed by the active file system.
916
At this point,
917
.I b
918
is unlinked from the active file system.
919
We say that
920
.I b
921
is now
922
.I closed :
923
it is needed only for snapshots.
924
When a block is closed, the
925
.CW BsClosed
926
bit is set in its state, and the current epoch (called the block's closing epoch)
927
is stored in the
928
.CW epochClose
929
label field.
930
(Open blocks have an
931
.CW epochClose
932
of
933
.CW ~0 ).
934
.PP
935
A block is referenced by snapshots with epochs
936
between the block's allocation epoch and its closing epoch.
937
Once the file system's low epoch grows to be greater than or equal to the block's
938
closing epoch, the block is no longer needed for any snapshots
939
and can be reused.
940
.PP
941
In a typical configuration, where nightly archival snapshots
942
are taken and written to Venti, it is desirable to reclaim
943
the space occupied by now-archived blocks if possible.
944
To do this, Fossil keeps track of whether the pointers
945
in each block are unique to that block.
946
When a block
947
.I bb
948
is allocated, a pointer to
949
.I bb
950
is written into exactly one active block (say,
951
.I b ).
952
In the absence of snapshots, the pointer to
953
.I bb
954
will remain unique to
955
.I b ,
956
so that if the pointer is zeroed,
957
.I bb
958
can be immediately reused.
959
Snapshots complicate this invariant:
960
when
961
.I b
962
is copied-on-write, all its pointers
963
are no longer unique to it.
964
At time of the copy, the
965
.CW BsCopied
966
state bit in the block's label
967
is set to note the duplication of the pointers contained within.
968
.NH 2
969
Disk layout
970
.HP
971
The file system header describes the file system layout and has this format:
972
.P1
973
.ta +\w'    'u +\w'                'u
974
Header:
975
	magic[4]	\fR0x3776AE89 (HeaderMagic)\fP
976
	version[2]	\fR1 (HeaderVersion)\fP
977
	blockSize[2]	\fIfile system block size\fP
978
	super[4]	\fRblock offset of super block\fP
979
	label[4]	\fRblock offset of labels\fP
980
	data[4]	\fRdata blocks\fP
981
	end[4]	\fRend of file system\fP
982
.P2
983
.LP
984
The corresponding file system layout is:
985
.PS
986
.ps 8
987
.vs 9
988
boxwid=0.75
989
boxht=0.15
990
Empty: box "empty" ht 0.25
991
Header: box "header" with .n at Empty.s
992
Empty2: box "empty" with .n at Header.s
993
Super: box "super block" with .n at Empty2.s
994
Label: box "label" "blocks" with .n at Super.s ht 0.25
995
Data: box "data" "blocks" with .n at Label.s ht 0.3
996
"  0" ljust at Empty.ne
997
"  128kB" ljust at Header.ne
998
"  \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne
999
"  \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne
1000
"  \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne
1001
"  \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se
1002
"" at (-1,0)
1003
"" at (6,0)
1004
.PE
1005
.LP
1006
The numbers to the right of the blocks are byte offsets
1007
of the boundaries.
1008
.LP
1009
The super block describes the file system itself and looks like:
1010
.P1
1011
.ta +\w'    'u +\w'                'u
1012
Super:
1013
	magic[4]	\fR0x2340A3B1 (SuperMagic)\fP
1014
	version[2]	\fR1 (SuperVersion)\fP
1015
	epochLow[4]	\fRfile system low epoch\fP
1016
	epochHigh[4]	\fRfile system high (active) epoch\fP
1017
	qid[8]	\fRnext qid to allocate\fP
1018
	active[4]	\fRdata block number: root of active file system\fP
1019
	next[4]	\fRdata block number: root of next file system to archive\fP
1020
	current[4]	\fRdata block number: root of file system currently being archived\fP
1021
	last[20]	\fRVenti score of last successful archive\fP
1022
	name[128]	\fRname of file system (just a comment)\fP
1023
.P2
1024
.LP
1025
.NH 1
1026
Fossil server
1027
.HP
1028
The Fossil server is a user-space program that runs on a standard Plan 9 kernel.
1029
.NH 2
1030
Process structure
1031
.PP
1032
The file server is structured as a set of processes synchronizing
1033
mostly through message passing along queues.
1034
The processes are given names, which can be seen in the output of
1035
.CW ps 
1036
.CW -a .
1037
.PP
1038
.CW Listen
1039
processes announce on various network addresses.
1040
A
1041
.CW con
1042
process handles each incoming connection, reading 9P requests
1043
and adding them to a central message queue.
1044
.CW Msg
1045
processes remove 9P requests from the queue,
1046
handle them, and write the responses to the appropriate
1047
file descriptors.
1048
.PP
1049
The
1050
.CW disk
1051
process handles disk I/O requests made by the other processes.
1052
The
1053
.CW flush
1054
process writes dirty blocks from the in-memory block cache to disk.
1055
The
1056
.CW unlink
1057
process frees previously linked blocks once the blocks that point at them
1058
have been written to disk.
1059
.PP
1060
A
1061
.CW consI
1062
reads from each console file (typically a pipe posted in
1063
.CW /srv ),
1064
adding the typed characters to the input queue.
1065
The
1066
.CW cons
1067
process echoes input and runs the commands, saving
1068
output in a ring buffer.
1069
Because there is only one
1070
.CW cons
1071
process, only one console command may be executing at a time.
1072
A
1073
.CW consO
1074
process copies this ring buffer to each console file.
1075
.PP
1076
The
1077
.CW periodic
1078
process runs periodic events, like
1079
flushing the root metadata to disk or
1080
taking snapshots of the file system.
1081
.NH 2
1082
Block cache
1083
.HP
1084
Fossil maintains an in-memory block cache which 
1085
holds both local disk blocks and Venti blocks.
1086
Cache eviction follows a least recently used policy.
1087
Dirty blocks are restricted to at most half the cache.
1088
This can be changed by editing
1089
.CW DirtyPercentage
1090
in 
1091
.CW dat.h .
1092
.PP
1093
The block cache uses soft updates [1] to ensure that the on-disk
1094
file system is always self-consistent.
1095
Thus there is no
1096
.I halt
1097
console command
1098
and no need to check a file system 
1099
that was shut down without halting.
1100
.NH 2
1101
Archiving
1102
.HP
1103
A background process writes blocks in archival snapshots to Venti.
1104
Although
1105
.CW /archive/\fIyyyy\fP/\fImmdds\fR
1106
is a copy of only
1107
.CW /active
1108
at the time of the snapshot,
1109
the archival process archives the
1110
entire file tree rather than just
1111
the subtree rooted at
1112
.CW /active .
1113
The snapshots
1114
.CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm
1115
are stored as empty directories.
1116
Once all the blocks have been archived,
1117
a 
1118
.CW VtRoot
1119
header for the file system is archived.
1120
The score of that header is recorded in
1121
.CW super.score
1122
and also printed on the file server console.
1123
The score can used by
1124
.I flfmt
1125
to restore a file system (see
1126
.I fossil (4)).
1127
.NH 2
1128
Contrast with the old file server
1129
.HP
1130
The most obvious difference between Fossil and the 
1131
old Plan 9 file server [2] is that Fossil uses a Venti server as 
1132
its archival storage in place of a WORM juke box.
1133
There are a few other architectural differences to be 
1134
aware of.
1135
.PP
1136
Fossil is a user-level program run on a standard kernel.
1137
.PP
1138
Fossil does not have any way to concatenate, stripe, or
1139
mirror disk files.  For functionality similar to the old file server's
1140
configuration strings, use the experimental file stack device 
1141
(see
1142
.I fs (3)).
1143
.PP
1144
Fossil speaks only 9P2000.  Old 9P (aka 9P1) is not supported.
1145
.PP
1146
... XXX words about converting an old file system to fossil?
1147
.NH 1
1148
References
1149
.LP
1150
[1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,
1151
and Yale N. Patt.
1152
``Soft Updates: A Solution to the Metadata Update Problem
1153
in File Systems,''
1154
.I "ACM Transactions on Computer Systems" ,
1155
Vol 18., No. 2, May 2000, pp. 127\-153.
1156
.LP
1157
[2] Sean Quinlan, ``A Cached WORM File System,''
1158
.I "Software\(emPractice and Experience" ,
1159
Vol 21., No 12., December 1991, pp. 1289\-1299.
1160
.LP
1161
[3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,''
1162
.I "Usenix Conference on File and Storage Technologies" ,
1163
2002.