Subversion Repositories planix.SVN

Rev

Rev 2 | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
2 - 1
.TH VENTI 6
2
.SH NAME
3
venti \- archival storage server
4
.SH DESCRIPTION
5
Venti is a block storage server intended for archival data.
6
In a Venti server, the SHA1 hash of a block's contents acts
7
as the block identifier for read and write operations.
8
This approach enforces a write-once policy, preventing
9
accidental or malicious destruction of data.  In addition,
10
duplicate copies of a block are coalesced, reducing the
11
consumption of storage and simplifying the implementation
12
of clients.
13
.PP
14
This manual page documents the basic concepts of
15
block storage using Venti as well as the Venti network protocol.
16
.PP
17
.IR Venti (1)
18
documents some simple clients.
19
.IR Vac (1)
20
and
21
.IR vacfs (4)
22
are more complex clients.
23
.PP
24
.IR Venti (2)
25
describes a C library interface for accessing
26
Venti servers and manipulating Venti data structures.
27
.PP
28
.IR Venti (8)
29
describes the programs used to run a Venti server.
30
.PP
31
.SS "Scores
32
The SHA1 hash that identifies a block is called its
33
.IR score .
34
The score of the zero-length block is called the
35
.IR "zero score" .
36
.PP
37
Scores may have an optional 
38
.IB label :
39
prefix, typically used to
40
describe the format of the data.
41
For example, 
42
.IR vac (1)
43
uses a
44
.B vac:
45
prefix.
46
.SS "Files and Directories
47
Venti accepts blocks up to 56 kilobytes in size.  
48
By convention, Venti clients use hash trees of blocks to
49
represent arbitrary-size data
50
.IR files .
51
The data to be stored is split into fixed-size
52
blocks and written to the server, producing a list
53
of scores.
54
The resulting list of scores is split into fixed-size pointer
55
blocks (using only an integral number of scores per block)
56
and written to the server, producing a smaller list
57
of scores.
58
The process continues, eventually ending with the
59
score for the hash tree's top-most block.
60
Each file stored this way is summarized by
61
a
62
.B VtEntry
63
structure recording the top-most score, the depth
64
of the tree, the data block size, and the pointer block size.
65
One or more 
66
.B VtEntry
67
structures can be concatenated
68
and stored as a special file called a
69
.IR directory .
70
In this
71
manner, arbitrary trees of files can be constructed
72
and stored.
73
.PP
74
Scores passed between programs conventionally refer
75
to
76
.B VtRoot
77
blocks, which contain descriptive information
78
as well as the score of a directory block containing a small number
79
of directory entries.
80
.PP
81
Conventionally, programs do not mix data and directory entries
82
in the same file.  Instead, they keep two separate files, one with
83
directory entries and one with metadata referencing those
84
entries by position.
85
Keeping this parallel representation is a minor annoyance
86
but makes it possible for general programs like
87
.I venti/copy
88
(see
89
.IR venti (1))
90
to traverse the block tree without knowing the specific details
91
of any particular program's data.
92
.SS "Block Types
93
To allow programs to traverse these structures without
94
needing to understand their higher-level meanings,
95
Venti tags each block with a type.  The types are:
96
.PP
97
.nf
98
.ft L
99
    VtDataType     000  \f1data\fL
100
    VtDataType+1   001  \fRscores of \fPVtDataType\fR blocks\fL
101
    VtDataType+2   002  \fRscores of \fPVtDataType+1\fR blocks\fL
102
    \fR\&...\fL
103
    VtDirType      010  VtEntry\fR structures\fL
104
    VtDirType+1    011  \fRscores of \fLVtDirType\fR blocks\fL
105
    VtDirType+2    012  \fRscores of \fLVtDirType+1\fR blocks\fL
106
    \fR\&...\fL
107
    VtRootType     020  VtRoot\fR structure\fL
108
.fi
109
.PP
110
The octal numbers listed are the type numbers used
111
by the commands below.
112
(For historical reasons, the type numbers used on
113
disk and on the wire are different from the above.
114
They do not distinguish
115
.BI VtDataType+ n
116
blocks from
117
.BI VtDirType+ n
118
blocks.)
119
.SS "Zero Truncation
120
To avoid storing the same short data blocks padded with
121
differing numbers of zeros, Venti clients working with fixed-size
122
blocks conventionally
123
`zero truncate' the blocks before writing them to the server.
124
For example, if a 1024-byte data block contains the 
125
11-byte string 
126
.RB ` hello " " world '
127
followed by 1013 zero bytes,
128
a client would store only the 11-byte block.
129
When the client later read the block from the server,
130
it would append zero bytes to the end as necessary to
131
reach the expected size.
132
.PP
133
When truncating pointer blocks
134
.RB ( VtDataType+ \fIn
135
and
136
.BI VtDirType+ n
137
blocks),
138
trailing zero scores are removed
139
instead of trailing zero bytes.
140
.PP
141
Because of the truncation convention,
142
any file consisting entirely of zero bytes,
143
no matter what its length, will be represented by the zero score:
144
the data blocks contain all zeros and are thus truncated
145
to the empty block, and the pointer blocks contain all zero scores
146
and are thus also truncated to the empty block, 
147
and so on up the hash tree.
148
.SS Network Protocol
149
A Venti session begins when a
150
.I client
151
connects to the network address served by a Venti
152
.IR server ;
153
the conventional address is 
154
.BI tcp! server !venti
155
(the
156
.B venti
157
port is 17034).
158
Both client and server begin by sending a version
159
string of the form
160
.BI venti- versions - comment \en \fR.
161
The
162
.I versions
163
field is a list of acceptable versions separated by
164
colons.
165
The protocol described here is version
166
.BR 02 .
167
The client is responsible for choosing a common
168
version and sending it in the
169
.B VtThello
170
message, described below.
171
.PP
172
After the initial version exchange, the client transmits
173
.I requests
174
.RI ( T-messages )
175
to the server, which subsequently returns
176
.I replies
177
.RI ( R-messages )
178
to the client.
179
The combined act of transmitting (receiving) a request
180
of a particular type, and receiving (transmitting) its reply
181
is called a
182
.I transaction
183
of that type.
184
.PP
185
Each message consists of a sequence of bytes.
186
Two-byte fields hold unsigned integers represented
187
in big-endian order (most significant byte first).
188
Data items of variable lengths are represented by
189
a one-byte field specifying a count,
190
.IR n ,
191
followed by
192
.I n
193
bytes of data.
194
Text strings are represented similarly,
195
using a two-byte count with
196
the text itself stored as a UTF-encoded sequence
197
of Unicode characters (see
198
.IR utf (6)).
199
Text strings are not
200
.SM NUL\c
201
-terminated:
202
.I n
203
counts the bytes of UTF data, which include no final
204
zero byte.
205
The
206
.SM NUL
207
character is illegal in text strings in the Venti protocol.
208
The maximum string length in Venti is 1024 bytes.
209
.PP
210
Each Venti message begins with a two-byte size field 
211
specifying the length in bytes of the message,
212
not including the length field itself.
213
The next byte is the message type, one of the constants
214
in the enumeration in the include file
215
.BR <venti.h> .
216
The next byte is an identifying
217
.IR tag ,
218
used to match responses to requests.
219
The remaining bytes are parameters of different sizes.
220
In the message descriptions, the number of bytes in a field
221
is given in brackets after the field name.
222
The notation
223
.IR parameter [ n ]
224
where
225
.I n
226
is not a constant represents a variable-length parameter:
227
.IR n [1]
228
followed by
229
.I n
230
bytes of data forming the
231
.IR parameter .
232
The notation
233
.IR string [ s ]
234
(using a literal
235
.I s
236
character)
237
is shorthand for
238
.IR s [2]
239
followed by
240
.I s
241
bytes of UTF-8 text.
242
The notation
243
.IR parameter []
244
where 
245
.I parameter
246
is the last field in the message represents a 
247
variable-length field that comprises all remaining
248
bytes in the message.
249
.PP
250
All Venti RPC messages are prefixed with a field
251
.IR size [2]
252
giving the length of the message that follows
253
(not including the
254
.I size
255
field itself).
256
The message bodies are:
257
.ta \w'\fLVtTgoodbye 'u
258
.IP
259
.ne 2v
260
.B VtThello
261
.IR tag [1]
262
.IR version [ s ]
263
.IR uid [ s ]
264
.IR strength [1]
265
.IR crypto [ n ]
266
.IR codec [ n ]
267
.br
268
.B VtRhello
269
.IR tag [1]
270
.IR sid [ s ] 
271
.IR rcrypto [1]
272
.IR rcodec [1]
273
.IP
274
.ne 2v
275
.B VtTping
276
.IR tag [1]
277
.br
278
.B VtRping
279
.IR tag [1]
280
.IP
281
.ne 2v
282
.B VtTread
283
.IR tag [1]
284
.IR score [20]
285
.IR type [1]
286
.IR pad [1]
287
.IR count [2]
288
.br
289
.B VtRread
290
.IR tag [1]
291
.IR data []
292
.IP
293
.ne 2v
294
.B VtTwrite
295
.IR tag [1]
296
.IR type [1]
297
.IR pad [3]
298
.IR data []
299
.br
300
.B VtRwrite
301
.IR tag [1]
302
.IR score [20]
303
.IP
304
.ne 2v
305
.B VtTsync
306
.IR tag [1]
307
.br
308
.B VtRsync
309
.IR tag [1]
310
.IP
311
.ne 2v
312
.B VtRerror
313
.IR tag [1]
314
.IR error [ s ]
315
.IP
316
.ne 2v
317
.B VtTgoodbye
318
.IR tag [1]
319
.PP
320
Each T-message has a one-byte
321
.I tag
322
field, chosen and used by the client to identify the message.
323
The server will echo the request's
324
.I tag
325
field in the reply.
326
Clients should arrange that no two outstanding
327
messages have the same tag field so that responses
328
can be distinguished.
329
.PP
330
The type of an R-message will either be one greater than
331
the type of the corresponding T-message or
332
.BR Rerror ,
333
indicating that the request failed.
334
In the latter case, the
335
.I error
336
field contains a string describing the reason for failure.
337
.PP
338
Venti connections must begin with a 
339
.B hello
340
transaction.
341
The
342
.B VtThello
343
message contains the protocol
344
.I version
345
that the client has chosen to use.
346
The fields
347
.IR strength ,
348
.IR crypto ,
349
and
350
.IR codec
351
could be used to add authentication, encryption,
352
and compression to the Venti session
353
but are currently ignored.
354
The 
355
.IR rcrypto ,
356
and
357
.I rcodec
358
fields in the 
359
.B VtRhello
360
response are similarly ignored.
361
The
362
.IR uid 
363
and
364
.IR sid
365
fields are intended to be the identity
366
of the client and server but, given the lack of
367
authentication, should be treated only as advisory.
368
The initial
369
.B hello
370
should be the only
371
.B hello
372
transaction during the session.
373
.PP
374
The
375
.B ping
376
message has no effect and 
377
is used mainly for debugging.
378
Servers should respond immediately to pings.
379
.PP
380
The
381
.B read
382
message requests a block with the given
383
.I score
384
and
385
.IR type .
386
Use
387
.I vttodisktype
388
and
389
.I vtfromdisktype
390
(see
391
.IR venti (2))
392
to convert a block type enumeration value
393
.RB ( VtDataType ,
394
etc.)
395
to the 
396
.I type
397
used on disk and in the protocol.
398
The
399
.I count
400
field specifies the maximum expected size
401
of the block.
402
The
403
.I data
404
in the reply is the block's contents.
405
.PP
406
The
407
.B write
408
message writes a new block of the given
409
.I type
410
with contents
411
.I data
412
to the server.
413
The response includes the
414
.I score
415
to use to read the block,
416
which should be the SHA1 hash of 
417
.IR data .
418
.PP
419
The Venti server may buffer written blocks in memory,
420
waiting until after responding to the
421
.B write
422
message before writing them to
423
permanent storage.
424
The server will delay the response to a
425
.B sync
426
message until after all blocks in earlier
427
.B write
428
messages have been written to permanent storage.
429
.PP
430
The
431
.B goodbye
432
message ends a session.  There is no
433
.BR VtRgoodbye :
434
upon receiving the
435
.BR VtTgoodbye
436
message, the server terminates up the connection.
437
.SH SEE ALSO
438
.IR venti (1),
439
.IR venti (2),
440
.IR venti (8)
441
.br
442
Sean Quinlan and Sean Dorward,
443
``Venti: a new approach to archival storage'',
444
.I "Usenix Conference on File and Storage Technologies" ,
445
2002.