Subversion Repositories planix.SVN

Rev

Details | Last modification | View Log | RSS feed

Rev Author Line No. Line
2 - 1
.TH VENTI 8
2
.SH NAME
3
venti \- archival storage server
4
.SH SYNOPSIS
5
.in +0.25i
6
.ti -0.25i
7
.B venti/venti
8
[
9
.B -Ldrs
10
]
11
[
12
.B -a
13
.I address
14
]
15
[
16
.B -B
17
.I blockcachesize
18
]
19
[
20
.B -c
21
.I config
22
]
23
[
24
.B -C
25
.I lumpcachesize
26
]
27
[
28
.B -h
29
.I httpaddress
30
]
31
[
32
.B -I
33
.I indexcachesize
34
]
35
[
36
.B -m
37
.I free-memory-percent
38
]
39
[
40
.B -W
41
.I webroot
42
]
43
.SH DESCRIPTION
44
.I Venti
45
is a SHA1-addressed archival storage server.
46
See 
47
.IR venti (6)
48
for a full introduction to the system.
49
This page documents the structure and operation of the server.
50
.PP
51
A venti server requires multiple disks or disk partitions,
52
each of which must be properly formatted before the server
53
can be run.
54
.SS Disk 
55
The venti server maintains three disk structures, typically
56
stored on raw disk partitions:
57
the append-only
58
.IR "data log" ,
59
which holds, in sequential order,
60
the contents of every block written to the server;
61
the 
62
.IR index ,
63
which helps locate a block in the data log given its score;
64
and optionally the 
65
.IR "bloom filter" ,
66
a concise summary of which scores are present in the index.
67
The data log is the primary storage.
68
To improve the robustness, it should be stored on
69
a device that provides RAID functionality.
70
The index and the bloom filter are optimizations 
71
employed to access the data log efficiently and can be rebuilt
72
if lost or damaged.
73
.PP
74
The data log is logically split into sections called
75
.IR arenas ,
76
typically sized for easy offline backup
77
(e.g., 500MB).
78
A data log may comprise many disks, each storing
79
one or more arenas.
80
Such disks are called
81
.IR "arena partitions" .
82
Arena partitions are filled in the order given in the configuration.
83
.PP
84
The index is logically split into block-sized pieces called
85
.IR buckets ,
86
each of which is responsible for a particular range of scores.
87
An index may be split across many disks, each storing many buckets.
88
Such disks are called
89
.IR "index sections" .
90
.PP
91
The index must be sized so that no bucket is full.
92
When a bucket fills, the server must be shut down and
93
the index made larger.
94
Since scores appear random, each bucket will contain
95
approximately the same number of entries.
96
Index entries are 40 bytes long.  Assuming that a typical block
97
being written to the server is 8192 bytes and compresses to 4096
98
bytes, the active index is expected to be about 1% of
99
the active data log.
100
Storing smaller blocks increases the relative index footprint;
101
storing larger blocks decreases it.
102
To allow variation in both block size and the random distribution
103
of scores to buckets, the suggested index size is 5% of
104
the active data log.
105
.PP
106
The (optional) bloom filter is a large bitmap that is stored on disk but
107
also kept completely in memory while the venti server runs.
108
It helps the venti server efficiently detect scores that are
109
.I not
110
already stored in the index.
111
The bloom filter starts out zeroed.
112
Each score recorded in the bloom filter is hashed to choose
113
.I nhash
114
bits to set in the bloom filter.
115
A score is definitely not stored in the index of any of its
116
.I nhash 
117
bits are not set.
118
The bloom filter thus has two parameters: 
119
.I nhash
120
(maximum 32)
121
and the total bitmap size 
122
(maximum 512MB, 2\s-2\u32\d\s+2 bits).
123
.PP
124
The bloom filter should be sized so that
125
.I nhash
126
\(mu
127
.I nblock
128
\(<=
129
0.7 \(mu
130
.IR b ,
131
where
132
.I nblock
133
is the expected number of blocks stored on the server
134
and
135
.I b
136
is the bitmap size in bits.
137
The false positive rate of the bloom filter when sized
138
this way is approximately 2\s-2\u\-\fInblock\fR\d\s+2.
139
.I Nhash
140
less than 10 are not very useful;
141
.I nhash
142
greater than 24 are probably a waste of memory.
143
.I Fmtbloom
144
(see
145
.IR venti-fmt (8))
146
can be given either
147
.I nhash
148
or
149
.IR nblock ;
150
if given
151
.IR nblock ,
152
it will derive an appropriate
153
.IR nhash .
154
.SS Memory
155
Venti can make effective use of large amounts of memory
156
for various caches.
157
.PP
158
The
159
.I "lump cache
160
holds recently-accessed venti data blocks, which the server refers to as 
161
.IR lumps .
162
The lump cache should be at least 1MB but can profitably be much larger.
163
The lump cache can be thought of as the level-1 cache:
164
read requests handled by the lump cache can
165
be served instantly.
166
.PP
167
The
168
.I "block cache
169
holds recently-accessed
170
.I disk
171
blocks from the arena partitions.
172
The block cache needs to be able to simultaneously hold two blocks
173
from each arena plus four blocks for the currently-filling arena.
174
The block cache can be thought of as the level-2 cache:
175
read requests handled by the block cache are slower than those
176
handled by the lump cache, since the lump data must be extracted
177
from the raw disk blocks and possibly decompressed, but no
178
disk accesses are necessary.
179
.PP
180
The
181
.I "index cache
182
holds recently-accessed or prefetched
183
index entries.
184
The index cache needs to be able to hold index entries
185
for three or four arenas, at least, in order for prefetching
186
to work properly.  Each index entry is 50 bytes.
187
Assuming 500MB arenas of
188
128,000 blocks that are 4096 bytes each after compression,
189
the minimum index cache size is about 6MB.
190
The index cache can be thought of as the level-3 cache:
191
read requests handled by the index cache must still go
192
to disk to fetch the arena blocks, but the costly random
193
access to the index is avoided.
194
.PP
195
The size of the index cache determines how long venti
196
can sustain its `burst' write throughput, during which time
197
the only disk accesses on the critical path
198
are sequential writes to the arena partitions.
199
For example, if you want to be able to sustain 10MB/s
200
for an hour, you need enough index cache to hold entries
201
for 36GB of blocks.  Assuming 8192-byte blocks,
202
you need room for almost five million index entries.
203
Since index entries are 50 bytes each, you need 250MB
204
of index cache.
205
If the background index update process can make a single
206
pass through the index in an hour, which is possible,
207
then you can sustain the 10MB/s indefinitely (at least until
208
the arenas are all filled).
209
.PP
210
The
211
.I "bloom filter
212
requires memory equal to its size on disk,
213
as discussed above.
214
.PP
215
A reasonable starting allocation is to
216
divide memory equally (in thirds) between
217
the bloom filter, the index cache, and the lump and block caches;
218
the third of memory allocated to the lump and block caches 
219
should be split unevenly, with more (say, two thirds)
220
going to the block cache.
221
.SS Network
222
The venti server announces two network services, one 
223
(conventionally TCP port 
224
.BR venti ,
225
17034) serving
226
the venti protocol as described in
227
.IR venti (6),
228
and one serving HTTP
229
(conventionally TCP port 
230
.BR http ,
231
80).
232
.PP
233
The venti web server provides the following 
234
URLs for accessing status information:
235
.TF "\fL/storage"
236
.PD
237
.TP
238
.B /index
239
A summary of the usage of the arenas and index sections.
240
.TP
241
.B /xindex
242
An XML version of
243
.BR /index .
244
.TP
245
.B /storage
246
Brief storage totals.
247
.TP
248
.BI /set/ variable
249
The current integer value of
250
.IR variable .
251
Variables are:
252
.BR compress ,
253
whether or not to compress blocks
254
(for debugging);
255
.BR logging ,
256
whether to write entries to the debugging logs;
257
.BR stats ,
258
whether to collect run-time statistics;
259
.BR icachesleeptime ,
260
the time in milliseconds between successive updates
261
of megabytes of the index cache;
262
.BR arenasumsleeptime ,
263
the time in milliseconds between reads while
264
checksumming an arena in the background.
265
The two sleep times should be (but are not) managed by venti;
266
they exist to provide more experience with their effects.
267
The other variables exist only for debugging and
268
performance measurement.
269
.TP
270
.BI /set/ variable / value
271
Set
272
.I variable
273
to
274
.IR value .
275
.TP
276
.BI /graph/ name / param / param / \fR...
277
A PNG image graphing the named run-time statistic over time.
278
The details of names and parameters are undocumented;
279
see
280
.B httpd.c
281
in the venti sources.
282
.TP
283
.B /log
284
A list of all debugging logs present in the server's memory.
285
.TP
286
.BI /log/ name
287
The contents of the debugging log with the given
288
.IR name .
289
.TP
290
.B /flushicache
291
Force venti to begin flushing the index cache to disk.
292
The request response will not be sent until the flush
293
has completed.
294
.TP
295
.B /flushdcache
296
Force venti to begin flushing the arena block cache to disk.
297
The request response will not be sent until the flush
298
has completed.
299
.PD
300
.PP
301
Requests for other files are served by consulting a
302
directory named in the configuration file
303
(see
304
.B webroot
305
below).
306
.SS Configuration File
307
A venti configuration file 
308
enumerates the various index sections and
309
arenas that constitute a venti system.
310
The components are indicated by the name of the file, typically
311
a disk partition, in which they reside.  The configuration
312
file is the only location that file names are used.  Internally,
313
venti uses the names assigned when the components were formatted
314
with 
315
.I fmtarenas
316
or 
317
.I fmtisect
318
(see
319
.IR venti-fmt (8)).
320
In particular, only the configuration needs to be
321
changed if a component is moved to a different file.
322
.PP
323
The configuration file consists of lines in the form described below.
324
Lines starting with
325
.B #
326
are comments.
327
.TF "\fLindex\fI name "
328
.PD
329
.TP
330
.BI index " name
331
Names the index for the system.
332
.TP
333
.BI arenas " file
334
.I File
335
is an arena partition, formatted using
336
.IR fmtarenas .
337
.TP
338
.BI isect " file
339
.I File
340
is an index section, formatted using
341
.IR fmtisect .
342
.TP
343
.BI bloom " file
344
.I File
345
is a bloom filter, formatted using
346
.IR fmtbloom .
347
.PD
348
.PP
349
After formatting a venti system using
350
.IR fmtindex ,
351
the order of arenas and index sections should not be changed.
352
Additional arenas can be appended to the configuration;
353
run
354
.I fmtindex
355
with the
356
.B -a
357
flag to update the index.
358
.PP
359
The configuration file also holds configuration parameters
360
for the venti server itself.
361
These are:
362
.TF "\fLhttpaddr\fI netaddr "
363
.TP
364
.BI mem " size
365
lump cache size
366
.TP
367
.BI bcmem " size
368
block cache size
369
.TP
370
.BI icmem " size
371
index cache size
372
.TP
373
.BI addr " netaddr
374
network address to announce venti service
375
(default
376
.BR tcp!*!venti )
377
.TP
378
.BI httpaddr " netaddr
379
network address to announce HTTP service
380
(default
381
.BR tcp!*!http )
382
.TP
383
.B queuewrites
384
queue writes in memory
385
(default is not to queue)
386
.TP
387
.BI webroot " dir
388
directory tree containing files for
389
.IR venti 's
390
internal HTTP server to consult for unrecognized URLs
391
.PD
392
.PP
393
The units for the various cache sizes above can be specified by appending a
394
.LR k ,
395
.LR m ,
396
or
397
.LR g
398
(case-insensitive)
399
to indicate kilobytes, megabytes, or gigabytes respectively.
400
.PP
401
The
402
.I file
403
name in the configuration lines above can be of the form
404
.IB file : lo - hi
405
to specify a range of the file. 
406
.I Lo
407
and
408
.I hi
409
are specified in bytes but can have the usual
410
.BI k ,
411
.BI m ,
412
or
413
.B g
414
suffixes.
415
Either
416
.I lo
417
or
418
.I hi
419
may be omitted.
420
This notation eliminates the need to
421
partition raw disks on non-Plan 9 systems.
422
.SS Command Line
423
Many of the options to Venti duplicate parameters that
424
can be specified in the configuration file.
425
The command line options override those found in a
426
configuration file.
427
Additional options are:
428
.TF "\fL-c\fI config"
429
.PD
430
.TP
431
.BI -c " config
432
The server configuration file
433
(default
434
.BR venti.conf )
435
.TP
436
.B -d
437
Produce various debugging information on standard error.
438
Implies
439
.BR -s .
440
.TP
441
.B -L
442
Enable logging.  By default all logging is disabled.
443
Logging slows server operation considerably.
444
.TP
445
.B -m
446
Allocate
447
.I free-memory-percent
448
percent of the available free RAM, and partition it
449
per the guidelines in the
450
.B Memory
451
subsection.
452
This percentage should be large enough to include the entire bloom filter.
453
This overrides all other memory sizing parameters,
454
including those on the command line and in the configuration file.
455
25% is a reasonable choice.
456
.TP
457
.B -r
458
Allow only read access to the venti data.
459
.TP
460
.B -s
461
Do not run in the background.
462
Normally,
463
the foreground process will exit once the Venti server
464
is initialized and ready for connections.
465
.PD
466
.SH EXAMPLE
467
A simple configuration:
468
.IP
469
.EX
470
% cat venti.conf
471
index main
472
isect /tmp/disks/isect0
473
isect /tmp/disks/isect1
474
arenas /tmp/disks/arenas
475
bloom /tmp/disks/bloom
476
% 
477
.EE
478
.PP
479
Format the index sections, the arena partition,
480
the bloom filter, and
481
finally the main index:
482
.IP
483
.EX
484
% venti/fmtisect isect0. /tmp/disks/isect0
485
% venti/fmtisect isect1. /tmp/disks/isect1
486
% venti/fmtarenas arenas0. /tmp/disks/arenas &
487
% venti/fmtbloom /tmp/disks/bloom &
488
% wait
489
% venti/fmtindex venti.conf
490
% 
491
.EE
492
.PP
493
Start the server and check the storage statistics:
494
.IP
495
.EX
496
% venti/venti
497
% hget http://$sysname/storage
498
.EE
499
.SH SOURCE
500
.B /sys/src/cmd/venti/srv
501
.SH "SEE ALSO"
502
.IR venti (1),
503
.IR venti (2),
504
.IR venti (6),
505
.IR venti-backup (8),
506
.IR venti-fmt (8)
507
.br
508
Sean Quinlan and Sean Dorward,
509
``Venti: a new approach to archival storage'',
510
.I "Usenix Conference on File and Storage Technologies" ,
511
2002.
512
.SH BUGS
513
Setting up a venti server is too complicated.