2 |
- |
1 |
<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN">
|
|
|
2 |
<html>
|
|
|
3 |
|
|
|
4 |
<head>
|
|
|
5 |
|
|
|
6 |
<meta http-equiv="Content-Language" content="en-us">
|
|
|
7 |
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
|
8 |
|
|
|
9 |
<title>Venti: a new approach to archival storage</title>
|
|
|
10 |
</head>
|
|
|
11 |
|
|
|
12 |
<body bgcolor="white">
|
|
|
13 |
|
|
|
14 |
<h1>Venti: a new approach to archival storage</h1>
|
|
|
15 |
|
|
|
16 |
<p>
|
|
|
17 |
|
|
|
18 |
Sean Quinlan and Sean Dorward
|
|
|
19 |
<br>
|
|
|
20 |
|
|
|
21 |
Bell Labs, Lucent Technologies
|
|
|
22 |
<p>
|
|
|
23 |
|
|
|
24 |
<h1>Abstract</h1>
|
|
|
25 |
<p>
|
|
|
26 |
|
|
|
27 |
This paper describes a network storage system, called Venti, intended
|
|
|
28 |
for archival data. In this system, a unique hash of a block's
|
|
|
29 |
contents acts as the block identifier for read and write operations.
|
|
|
30 |
This approach enforces a write-once policy, preventing accidental or
|
|
|
31 |
malicious destruction of data. In addition, duplicate copies of a
|
|
|
32 |
block can be coalesced, reducing the consumption of storage and
|
|
|
33 |
simplifying the implementation of clients. Venti is a building block
|
|
|
34 |
for constructing a variety of storage applications such as logical
|
|
|
35 |
backup, physical backup, and snapshot file systems.
|
|
|
36 |
<p>
|
|
|
37 |
|
|
|
38 |
We have built a prototype of the system and present some preliminary
|
|
|
39 |
performance results. The system uses magnetic disks as the storage
|
|
|
40 |
technology, resulting in an access time for archival data that is
|
|
|
41 |
comparable to non-archival data. The feasibility of the write-once
|
|
|
42 |
model for storage is demonstrated using data from over a decade's use
|
|
|
43 |
of two Plan 9 file systems.
|
|
|
44 |
<p>
|
|
|
45 |
|
|
|
46 |
<h1>1. Introduction</h1>
|
|
|
47 |
<p>
|
|
|
48 |
|
|
|
49 |
Archival storage is a second class citizen. Many computer
|
|
|
50 |
environments provide access to a few recent versions of the
|
|
|
51 |
information stored in file systems and databases, though this access
|
|
|
52 |
can be tedious and may require the assistance of a system
|
|
|
53 |
administrator. Less common is the ability for a user to examine data
|
|
|
54 |
from last month or last year or last decade. Such a feature may not
|
|
|
55 |
be needed frequently, but when it is needed it is often crucial.
|
|
|
56 |
<p>
|
|
|
57 |
|
|
|
58 |
The growth in capacity of storage technologies exceeds the ability of
|
|
|
59 |
many users to generate data, making it practical to archive data in
|
|
|
60 |
perpetuity. Plan 9, the computing environment that the authors use,
|
|
|
61 |
includes a file system that stores archival data to an optical jukebox
|
|
|
62 |
[16, 17]. Ken Thompson observed that, for our usage patterns, the
|
|
|
63 |
capacity of the jukebox could be considered infinite. In the time it
|
|
|
64 |
took for us to fill the jukebox, the improvement in technology would
|
|
|
65 |
allow us to upgrade to a new jukebox with twice the capacity.
|
|
|
66 |
<p>
|
|
|
67 |
|
|
|
68 |
Abundant storage suggests that an archival system impose a write-once
|
|
|
69 |
policy. Such a policy prohibits either a user or administrator from
|
|
|
70 |
deleting or modifying data once it is stored. This approach greatly
|
|
|
71 |
reduces the opportunities for accidental or malicious data loss and
|
|
|
72 |
simplifies the system's implementation.
|
|
|
73 |
<p>
|
|
|
74 |
|
|
|
75 |
Moreover, our experience with Plan 9 is that a write-once policy
|
|
|
76 |
changes the way one views storage. Obviously, some data is temporary,
|
|
|
77 |
derivative, or so large that it is either undesirable or impractical
|
|
|
78 |
to retain forever and should not be archived. However, once it is
|
|
|
79 |
decided that the data is worth keeping, the resources needed to store
|
|
|
80 |
the data have been consumed and cannot be reclaimed. This eliminates
|
|
|
81 |
the task of periodically "cleaning up" and deciding whether the data
|
|
|
82 |
is still worth keeping. More thought is required before storing the
|
|
|
83 |
data to a write-once archive, but as the cost of storage continues to
|
|
|
84 |
fall, this becomes an easy decision.
|
|
|
85 |
<p>
|
|
|
86 |
|
|
|
87 |
This paper describes the design and implementation of an archival
|
|
|
88 |
server, called Venti. The goal of Venti is to provide a write-once
|
|
|
89 |
archival repository that can be shared by multiple client machines and
|
|
|
90 |
applications. In addition, by using magnetic disks as the primary
|
|
|
91 |
storage technology, the performance of the system approaches that of
|
|
|
92 |
non-archival storage.
|
|
|
93 |
<p>
|
|
|
94 |
|
|
|
95 |
<h1>2. Background</h1>
|
|
|
96 |
<p>
|
|
|
97 |
|
|
|
98 |
A prevalent form of archival storage is the regular backup of data to
|
|
|
99 |
magnetic tape [15]. A typical scenario is to provide backup as a
|
|
|
100 |
central service for a number of client machines. Client software
|
|
|
101 |
interfaces with a database or file system and determines what data to
|
|
|
102 |
back up. The data is copied from the client to the tape device, often
|
|
|
103 |
over a network, and a record of what was copied is stored in a catalog
|
|
|
104 |
database.
|
|
|
105 |
<p>
|
|
|
106 |
|
|
|
107 |
Restoring data from a tape backup system can be tedious and error
|
|
|
108 |
prone. The backup system violates the access permission of the file
|
|
|
109 |
system, requiring a system administrator or privileged software to
|
|
|
110 |
perform the task. Since they are tedious, restore operations are
|
|
|
111 |
infrequent and problems with the process may go undetected. Potential
|
|
|
112 |
sources of error abound: tapes are mislabeled or reused or lost,
|
|
|
113 |
drives wander out of alignment and cannot read their old tapes,
|
|
|
114 |
technology becomes obsolete.
|
|
|
115 |
<p>
|
|
|
116 |
|
|
|
117 |
For tape backup systems, a tradeoff exists between the performance of
|
|
|
118 |
backup and restore operations [1]. A full backup simplifies the
|
|
|
119 |
process of restoring data since all the data is copied to a continuous
|
|
|
120 |
region on the tape media. For large file systems and databases,
|
|
|
121 |
incremental backups are more efficient to generate, but such backups
|
|
|
122 |
are not self-contained; the data for a restore operation is scattered
|
|
|
123 |
across multiple incremental backups and perhaps multiple tapes. The
|
|
|
124 |
conventional solution is to limit the extent of this scattering by
|
|
|
125 |
performing a full backup followed by a small number of incremental
|
|
|
126 |
backups.
|
|
|
127 |
<p>
|
|
|
128 |
|
|
|
129 |
File systems such as Plan 9 [16, 17], WAFL [5], and AFS [7] provide a
|
|
|
130 |
more unified approach to the backup problem by implementing a snapshot
|
|
|
131 |
feature. A snapshot is a consistent read-only view of the file system
|
|
|
132 |
at some point in the past. The snapshot retains the file system
|
|
|
133 |
permissions and can be accessed with standard tools (ls, cat, cp,
|
|
|
134 |
grep, diff) without special privileges or assistance from an
|
|
|
135 |
administrator. In our experience, snapshots are a relied-upon and
|
|
|
136 |
frequently-used resource because they are always available and easy to
|
|
|
137 |
access.
|
|
|
138 |
<p>
|
|
|
139 |
|
|
|
140 |
Snapshots avoid the tradeoff between full and incremental backups.
|
|
|
141 |
Each snapshot is a complete file system tree, much like a full backup.
|
|
|
142 |
The implementation, however, resembles an incremental backup because
|
|
|
143 |
the snapshots and the active file system share any blocks that remain
|
|
|
144 |
unmodified; a snapshot only requires additional storage for the blocks
|
|
|
145 |
that have changed. To achieve reasonable performance, the device that
|
|
|
146 |
stores the snapshots must efficiently support random access, limiting
|
|
|
147 |
the suitability of tape storage for this approach.
|
|
|
148 |
<p>
|
|
|
149 |
|
|
|
150 |
In the WAFL and AFS systems, snapshots are ephemeral; only a small
|
|
|
151 |
number of recent versions of the file system are retained. This
|
|
|
152 |
policy is reasonable since the most recent versions of files are the
|
|
|
153 |
most useful. For these systems, archival storage requires an
|
|
|
154 |
additional mechanism such as tape backup.
|
|
|
155 |
<p>
|
|
|
156 |
|
|
|
157 |
The philosophy of the Plan 9 file system is that random access storage
|
|
|
158 |
is sufficiently cheap that it is feasible to retain snapshots
|
|
|
159 |
permanently. The storage required to retain all daily snapshots of a
|
|
|
160 |
file system is surprisingly modest; later in the paper we present
|
|
|
161 |
statistics for two file servers that have been in use over the last 10
|
|
|
162 |
years.
|
|
|
163 |
<p>
|
|
|
164 |
|
|
|
165 |
Like Plan 9, the Elephant file system [18] retains many versions of
|
|
|
166 |
data. This system allows a variety of storage reclamation policies
|
|
|
167 |
that determine when a version of a file should be deleted. In
|
|
|
168 |
particular, "landmark" versions of files are retained permanently and
|
|
|
169 |
provide an archival record.
|
|
|
170 |
<p>
|
|
|
171 |
|
|
|
172 |
<h1>3. The Venti Archival Server</h1>
|
|
|
173 |
<p>
|
|
|
174 |
|
|
|
175 |
Venti is a block-level network storage system intended for archival
|
|
|
176 |
data. The interface to the system is a simple protocol that enables
|
|
|
177 |
client applications to read and write variable sized blocks of data.
|
|
|
178 |
Venti itself does not provide the services of a file or backup system,
|
|
|
179 |
but rather the backend archival storage for these types of
|
|
|
180 |
applications.
|
|
|
181 |
<p>
|
|
|
182 |
|
|
|
183 |
Venti identifies data blocks by a hash of their contents. By using a
|
|
|
184 |
collision-resistant hash function with a sufficiently large output, it
|
|
|
185 |
is possible to consider the hash of a data block as unique. Such a
|
|
|
186 |
unique hash is called the fingerprint of a block and can be used as
|
|
|
187 |
the address for read and write operations. This approach results in a
|
|
|
188 |
storage system with a number of interesting properties.
|
|
|
189 |
<p>
|
|
|
190 |
|
|
|
191 |
As blocks are addressed by the fingerprint of their contents, a block
|
|
|
192 |
cannot be modified without changing its address; the behavior is
|
|
|
193 |
intrinsically write-once. This property distinguishes Venti from most
|
|
|
194 |
other storage systems, in which the address of a block and its
|
|
|
195 |
contents are independent.
|
|
|
196 |
<p>
|
|
|
197 |
|
|
|
198 |
Moreover, writes are idempotent. Multiple writes of the same data can
|
|
|
199 |
be coalesced and do not require additional storage space. This
|
|
|
200 |
property can greatly increase the effective storage capacity of the
|
|
|
201 |
server since it does not rely on the behavior of client applications.
|
|
|
202 |
For example, an incremental backup application may not be able to
|
|
|
203 |
determine exactly which blocks have changed, resulting in unnecessary
|
|
|
204 |
duplication of data. On Venti, such duplicate blocks will be
|
|
|
205 |
discarded and only one copy of the data will be retained. In fact,
|
|
|
206 |
replacing the incremental backup with a full backup will consume the
|
|
|
207 |
same amount of storage. Even duplicate data from different
|
|
|
208 |
applications and machines can be eliminated if the clients write the
|
|
|
209 |
data using the same block size and alignment.
|
|
|
210 |
<p>
|
|
|
211 |
|
|
|
212 |
The hash function can be viewed as generating a universal name space
|
|
|
213 |
for data blocks. Without cooperating or coordinating, multiple
|
|
|
214 |
clients can share this name space and share a Venti server. Moreover,
|
|
|
215 |
the block level interface places few restrictions on the structures
|
|
|
216 |
and format that clients use to store their data. In contrast,
|
|
|
217 |
traditional backup and archival systems require more centralized
|
|
|
218 |
control. For example, backup systems include some form of job
|
|
|
219 |
scheduler to serialize access to tape devices and may only support a
|
|
|
220 |
small number of predetermined data formats so that the catalog system
|
|
|
221 |
can extract pertinent meta-data.
|
|
|
222 |
<p>
|
|
|
223 |
|
|
|
224 |
Venti provides inherent integrity checking of data. When a block is
|
|
|
225 |
retrieved, both the client and the server can compute the fingerprint
|
|
|
226 |
of the data and compare it to the requested fingerprint. This
|
|
|
227 |
operation allows the client to avoid errors from undetected data
|
|
|
228 |
corruption and enables the server to identify when error recovery is
|
|
|
229 |
necessary.
|
|
|
230 |
<p>
|
|
|
231 |
|
|
|
232 |
Using the fingerprint of a block as its identity facilitates features
|
|
|
233 |
such as replication, caching, and load balancing. Since the contents
|
|
|
234 |
of a particular block are immutable, the problem of data coherency is
|
|
|
235 |
greatly reduced; a cache or a mirror cannot contain a stale or out of
|
|
|
236 |
date version of a block.
|
|
|
237 |
<p>
|
|
|
238 |
|
|
|
239 |
<h2>3.1. Choice of Hash Function</h2>
|
|
|
240 |
<p>
|
|
|
241 |
|
|
|
242 |
The design of Venti requires a hash function that generates a unique
|
|
|
243 |
fingerprint for every data block that a client may want to store.
|
|
|
244 |
Obviously, if the size of the fingerprint is smaller than the size of
|
|
|
245 |
the data blocks, such a hash function cannot exist since there are
|
|
|
246 |
fewer possible fingerprints than blocks. If the fingerprint is large
|
|
|
247 |
enough and randomly distributed, this problem does not arise in
|
|
|
248 |
practice. For a server of a given capacity, the likelihood that two
|
|
|
249 |
different blocks will have the same hash value, also known as a
|
|
|
250 |
collision, can be determined. If the probability of a collision is
|
|
|
251 |
vanishingly small, we can be confident that each fingerprint is
|
|
|
252 |
unique.
|
|
|
253 |
<p>
|
|
|
254 |
|
|
|
255 |
It is desirable that Venti employ a cryptographic hash function. For
|
|
|
256 |
such a function, it is computationally infeasible to find two distinct
|
|
|
257 |
inputs that hash to the same value [10]. This property is important
|
|
|
258 |
because it prevents a malicious client from intentionally creating
|
|
|
259 |
blocks that violate the assumption that each block has a unique
|
|
|
260 |
fingerprint. As an additional benefit, using a cryptographic hash
|
|
|
261 |
function strengthens a client's integrity check, preventing a
|
|
|
262 |
malicious server from fulfilling a read request with fraudulent data.
|
|
|
263 |
If the fingerprint of the returned block matches the requested
|
|
|
264 |
fingerprint, the client can be confident the server returned the
|
|
|
265 |
original data.
|
|
|
266 |
<p>
|
|
|
267 |
|
|
|
268 |
Venti uses the Sha1 hash function [13] developed by the US National
|
|
|
269 |
Institute for Standards and Technology (NIST). Sha1 is a popular hash
|
|
|
270 |
algorithm for many security systems and, to date, there are no known
|
|
|
271 |
collisions. The output of Sha1 is a 160 bit (20 byte) hash value.
|
|
|
272 |
Software implementations of Sha1 are relatively efficient; for
|
|
|
273 |
example, a 700Mhz Pentium 3 can compute the Sha1 hash of 8 Kbyte data
|
|
|
274 |
blocks in about 130 microseconds, a rate of 60 Mbytes per second.
|
|
|
275 |
<p>
|
|
|
276 |
|
|
|
277 |
Are the 160 bit hash values generated by Sha1 large enough to ensure
|
|
|
278 |
the fingerprint of every block is unique? Assuming random hash values
|
|
|
279 |
with a uniform distribution, a collection of n different data blocks
|
|
|
280 |
and a hash function that generates b bits, the probability p that
|
|
|
281 |
there will be one or more collisions is bounded by the number of pairs
|
|
|
282 |
of blocks multiplied by the probability that a given pair will
|
|
|
283 |
collide, i.e.
|
|
|
284 |
<p>
|
|
|
285 |
|
|
|
286 |
<img src="probablity.gif" ALT="probablity">
|
|
|
287 |
<p>
|
|
|
288 |
|
|
|
289 |
Today, a large storage system may contain a petabyte (10^15 bytes) of data.
|
|
|
290 |
Consider an even larger system that contains an exabyte (10^18 bytes)
|
|
|
291 |
stored as 8 Kbyte blocks (~10^14 blocks). Using the Sha1 hash function, the
|
|
|
292 |
probability of a collision is less than 10^-20. Such a scenario seems
|
|
|
293 |
sufficiently unlikely that we ignore it and use the Sha1 hash as a
|
|
|
294 |
unique identifier for a block. Obviously, as storage technology
|
|
|
295 |
advances, it may become feasible to store much more than an exabyte,
|
|
|
296 |
at which point it maybe necessary to move to a larger hash function.
|
|
|
297 |
NIST has already proposed variants of Sha1 that produce 256, 384, and
|
|
|
298 |
512 bit results [14]. For the immediate future, however, Sha1 is a
|
|
|
299 |
suitable choice for generating the fingerprint of a block.
|
|
|
300 |
<p>
|
|
|
301 |
|
|
|
302 |
<h2>3.2. Choice of Storage Technology</h2>
|
|
|
303 |
<p>
|
|
|
304 |
|
|
|
305 |
When the Plan 9 file system was designed in 1989, optical jukeboxes
|
|
|
306 |
offered high capacity with respectable random access performance and
|
|
|
307 |
thus were an obvious candidate for archival storage. The last decade,
|
|
|
308 |
however, has seen the capacity of magnetic disks increase at a far
|
|
|
309 |
faster rate than optical technologies [20]. Today, a disk array costs
|
|
|
310 |
less than the equivalent capacity optical jukebox and occupies less
|
|
|
311 |
physical space. Disk technology is even approaching tape in cost per
|
|
|
312 |
bit.
|
|
|
313 |
<p>
|
|
|
314 |
|
|
|
315 |
Magnetic disk storage is not as stable or permanent as optical media.
|
|
|
316 |
Reliability can be improved with technology such as RAID, but unlike
|
|
|
317 |
write-once optical disks, there is little protection from erasure due
|
|
|
318 |
to failures of the storage server or RAID array firmware. This issue
|
|
|
319 |
is discussed in Section 7.
|
|
|
320 |
<p>
|
|
|
321 |
|
|
|
322 |
Using magnetic disks for Venti has the benefit of reducing the
|
|
|
323 |
disparity in performance between conventional and archival storage.
|
|
|
324 |
Operations that previously required data to be restored to magnetic
|
|
|
325 |
disk can be accomplished directly from the archive. Similarly, the
|
|
|
326 |
archive can contain the primary copy of often-accessed read-only data.
|
|
|
327 |
In effect, archival data need not be further down the storage
|
|
|
328 |
hierarchy; it is differentiated by the write-once policy of the
|
|
|
329 |
server.
|
|
|
330 |
<p>
|
|
|
331 |
|
|
|
332 |
<h1>4. Applications</h1>
|
|
|
333 |
<p>
|
|
|
334 |
|
|
|
335 |
Venti is a building block on which to construct a variety of storage
|
|
|
336 |
applications. Venti provides a large repository for data that can be
|
|
|
337 |
shared by many clients, much as tape libraries are currently the
|
|
|
338 |
foundation of many centralized backup systems. Applications need to
|
|
|
339 |
accommodate the unique properties of Venti, which are different from
|
|
|
340 |
traditional block level storage devices, but these properties enable a
|
|
|
341 |
number of interesting features.
|
|
|
342 |
<p>
|
|
|
343 |
|
|
|
344 |
Applications use the block level service provided by Venti to store
|
|
|
345 |
more complex data structures. Data is divided into blocks and written
|
|
|
346 |
to the server. To enable this data to be retrieved, the application
|
|
|
347 |
must record the fingerprints of these blocks. One approach is to pack
|
|
|
348 |
the fingerprints into additional blocks, called pointer blocks, that
|
|
|
349 |
are also written to the server, a process that can be repeated
|
|
|
350 |
recursively until a single fingerprint is obtained. This fingerprint
|
|
|
351 |
represents the root of a tree of blocks and corresponds to a
|
|
|
352 |
hierarchical hash of the original data.
|
|
|
353 |
<p>
|
|
|
354 |
|
|
|
355 |
A simple data structure for storing a linear sequence of data blocks
|
|
|
356 |
is shown in Figure 1. The data blocks are located via a fixed depth
|
|
|
357 |
tree of pointer blocks which itself is addressed by a root
|
|
|
358 |
fingerprint. Applications can use such a structure to store a single
|
|
|
359 |
file or to mimic the behavior of a physical device such as a tape or a
|
|
|
360 |
disk drive. The write-once nature of Venti does not allow such a tree
|
|
|
361 |
to be modified, but new versions of the tree can be generated
|
|
|
362 |
efficiently by storing the new or modified data blocks and reusing the
|
|
|
363 |
unchanged sections of the tree as depicted in Figure 2.
|
|
|
364 |
<p>
|
|
|
365 |
|
|
|
366 |
|
|
|
367 |
|
|
|
368 |
<img src="SimpleTree.gif" ALT="simple tree">
|
|
|
369 |
<p>
|
|
|
370 |
Figure 1. A tree structure for storing a linear sequence of blocks
|
|
|
371 |
<p>
|
|
|
372 |
|
|
|
373 |
|
|
|
374 |
|
|
|
375 |
<img src="ModifiedTree.gif" ALT="modified tree">
|
|
|
376 |
<p>
|
|
|
377 |
Figure 2. Build a new version of the tree.
|
|
|
378 |
<p>
|
|
|
379 |
|
|
|
380 |
By mixing data and fingerprints in a block, more complex data
|
|
|
381 |
structures can be constructed. For example, a structure for storing a
|
|
|
382 |
file system may include three types of blocks: directory, pointer, and
|
|
|
383 |
data. A directory block combines the meta information for a file and
|
|
|
384 |
the fingerprint to a tree of data blocks containing the file's
|
|
|
385 |
contents. The depth of the tree can be determined from the size of
|
|
|
386 |
the file, assuming the pointer and data blocks have a fixed size.
|
|
|
387 |
Other structures are obviously possible. Venti's block-level
|
|
|
388 |
interface leaves the choice of format to client applications and
|
|
|
389 |
different data structures can coexist on a single server.
|
|
|
390 |
<p>
|
|
|
391 |
|
|
|
392 |
The following sections describes three applications that use Venti as
|
|
|
393 |
an archival data repository: a user level archive utility called vac,
|
|
|
394 |
a proposal for a physical level backup utility, and our preliminary
|
|
|
395 |
work on a new version of the Plan 9 file system.
|
|
|
396 |
<p>
|
|
|
397 |
|
|
|
398 |
<h2>4.1. Vac</h2>
|
|
|
399 |
<p>
|
|
|
400 |
|
|
|
401 |
Vac is an application for storing a collection of files and
|
|
|
402 |
directories as a single object, similar in functionality to the
|
|
|
403 |
utilities tar and zip. With vac, the contents of the selected files
|
|
|
404 |
are stored as a tree of blocks on a Venti server. The root
|
|
|
405 |
fingerprint for this tree is written to a vac archive file specified
|
|
|
406 |
by the user, which consists of an ASCII representation of the 20 byte
|
|
|
407 |
root fingerprint plus a fixed header string, and is always 45 bytes
|
|
|
408 |
long. A corresponding program, called unvac, enables the user to
|
|
|
409 |
restore files from a vac archive. Naturally, unvac requires access to
|
|
|
410 |
the Venti server that contains the actual data, but in most situations
|
|
|
411 |
this is transparent. For a user, it appears that vac compresses any
|
|
|
412 |
amount of data down to 45 bytes.
|
|
|
413 |
<p>
|
|
|
414 |
|
|
|
415 |
An important attribute of vac is that it writes each file as a
|
|
|
416 |
separate collection of Venti blocks, thus ensuring that duplicate
|
|
|
417 |
copies of a file will be coalesced on the server. If multiple users
|
|
|
418 |
vac the same data, only one copy will be stored on the server.
|
|
|
419 |
Similarly, a user may repeatedly vac a directory over time and even if
|
|
|
420 |
the contents of the directory change, the additional storage consumed
|
|
|
421 |
on the server will be related to the extent of the changes rather than
|
|
|
422 |
the total size of the contents. Since Venti coalesces data at the
|
|
|
423 |
block level, even files that change may share many blocks with
|
|
|
424 |
previous versions and thus require little space on the server; log and
|
|
|
425 |
database files are good examples of this scenario.
|
|
|
426 |
<p>
|
|
|
427 |
|
|
|
428 |
On many Unix systems, the dump utility is used to back up file
|
|
|
429 |
systems. Dump has the ability to perform incremental backups of data;
|
|
|
430 |
a user specifies a dump level, and only files that are new or have
|
|
|
431 |
changed since the last dump at this level are written to the archive.
|
|
|
432 |
To implement incremental backups, dump examines the modified time
|
|
|
433 |
associated with each file, which is an efficient method of filtering
|
|
|
434 |
out the unchanged files.
|
|
|
435 |
<p>
|
|
|
436 |
|
|
|
437 |
Vac also implements an incremental option based on the file
|
|
|
438 |
modification times. The user specifies an existing vac file and this
|
|
|
439 |
archive is used to reduce the number of blocks written to the Venti
|
|
|
440 |
server. For each file, vac examines the modified time in both the
|
|
|
441 |
file system and the vac archive. If they are the same, vac copies the
|
|
|
442 |
fingerprint for the file from the old archive into the new archive.
|
|
|
443 |
Copying just the 20-byte fingerprint enables the new archive to
|
|
|
444 |
include the entire file without reading the data from the file system
|
|
|
445 |
nor writing the data across the network to the Venti server. In
|
|
|
446 |
addition, unlike an incremental dump, the resulting archive will be
|
|
|
447 |
identical to an archive generated without the incremental option; it
|
|
|
448 |
is only a performance improvement. This means there is no need to
|
|
|
449 |
have multiple levels of backups, some incremental, some full, and so
|
|
|
450 |
restore operations are greatly simplified.
|
|
|
451 |
<p>
|
|
|
452 |
|
|
|
453 |
A variant of the incremental option improves the backup of files
|
|
|
454 |
without reference to modification times. As vac reads a file, it
|
|
|
455 |
computes the fingerprint for each block. Concurrently, the pointer
|
|
|
456 |
blocks of the old archive are examined to determine the fingerprint
|
|
|
457 |
for the block at the same offset in the old version of the file. If
|
|
|
458 |
the fingerprints are the same, the block does not need to be written
|
|
|
459 |
to Venti. Instead, the fingerprint can simply be copied into the
|
|
|
460 |
appropriate pointer block. This optimization reduces the number of
|
|
|
461 |
writes to the Venti server, saving both network and disk bandwidth.
|
|
|
462 |
Like the file level optimization above, the resulting vac file is no
|
|
|
463 |
different from the one produced without this optimization. It does,
|
|
|
464 |
however, require the data for the file to be read and is only
|
|
|
465 |
effective if there are a significant number of unchanged blocks.
|
|
|
466 |
<p>
|
|
|
467 |
|
|
|
468 |
<h2>4.2. Physical backup</h2>
|
|
|
469 |
<p>
|
|
|
470 |
|
|
|
471 |
Utilities such as vac, tar, and dump archive data at the file or
|
|
|
472 |
logical level: they walk the file hierarchy converting both data and
|
|
|
473 |
meta-data into their own internal format. An alternative approach is
|
|
|
474 |
block-level or physical backup, in which the disk blocks that make up
|
|
|
475 |
the file system are directly copied without interpretation. Physical
|
|
|
476 |
backup has a number of benefits including simplicity and potentially
|
|
|
477 |
much higher throughput [8]. A physical backup utility for file
|
|
|
478 |
systems that stores the resulting data on Venti appears attractive,
|
|
|
479 |
though we have not yet implemented such an application.
|
|
|
480 |
<p>
|
|
|
481 |
|
|
|
482 |
The simplest form of physical backup is to copy the raw contents of
|
|
|
483 |
one or mores disk drives to Venti. The backup also includes a tree of
|
|
|
484 |
pointer blocks, which enables access to the data blocks. Like vac,
|
|
|
485 |
the end result is a single fingerprint representing the root of the
|
|
|
486 |
tree; that fingerprint needs to be recorded outside of Venti.
|
|
|
487 |
<p>
|
|
|
488 |
|
|
|
489 |
Coalescing duplicate blocks is the main advantage of making a physical
|
|
|
490 |
backup to Venti rather than copying the data to another storage medium
|
|
|
491 |
such as tape. Since file systems are inherently block based, we
|
|
|
492 |
expect coalescing to be effective. Not only will backups of a file
|
|
|
493 |
system over time share many unchanged blocks, but even file systems
|
|
|
494 |
for different machines that are running the same operating system may
|
|
|
495 |
have many blocks in common. As with vac, the user sees a full backup
|
|
|
496 |
of the device, while retaining the storage space advantages of an
|
|
|
497 |
incremental backup.
|
|
|
498 |
<p>
|
|
|
499 |
|
|
|
500 |
One enhancement to physical backup is to copy only blocks that are
|
|
|
501 |
actively in use in the file system. For most file system formats it
|
|
|
502 |
is relatively easy to determine if a block is in use or free without
|
|
|
503 |
walking the file system hierarchy. Free blocks generally contain the
|
|
|
504 |
remnants of temporary files that were created and removed in the time
|
|
|
505 |
between backups and it is advantageous not to store such blocks. This
|
|
|
506 |
optimization requires that the backup format be able to represent
|
|
|
507 |
missing blocks, which can easily be achieved on Venti by storing a
|
|
|
508 |
null value for the appropriate entry in the pointer tree.
|
|
|
509 |
<p>
|
|
|
510 |
|
|
|
511 |
The random access performance of Venti is sufficiently good that it is
|
|
|
512 |
possible to use a physical backup without first restoring it to disk.
|
|
|
513 |
With operating system support, it is feasible to directly mount a
|
|
|
514 |
backup file system image from Venti. Access to this file system is
|
|
|
515 |
read only, but it provides a natural method of restoring a subset of
|
|
|
516 |
files. For situations where a full restore is required, it might be
|
|
|
517 |
possible to do this restore in a lazy fashion, copying blocks from
|
|
|
518 |
Venti to the file system as needed, instead of copying the entire
|
|
|
519 |
contents of the file system before resuming normal operation.
|
|
|
520 |
<p>
|
|
|
521 |
|
|
|
522 |
The time to perform a physical backup can be reduced using a variety
|
|
|
523 |
of incremental techniques. Like vac, the backup utility can compute
|
|
|
524 |
the fingerprint of each block and compare this fingerprint with the
|
|
|
525 |
appropriate entry in the pointer tree of a previous backup. This
|
|
|
526 |
optimization reduces the number of writes to the Venti server. If the
|
|
|
527 |
file system provides information about which blocks have changed, as
|
|
|
528 |
is the case with WAFL, the backup utility can avoid even reading the
|
|
|
529 |
unchanged blocks. Again, a major advantage of using Venti is that the
|
|
|
530 |
backup utility can implement these incremental techniques while still
|
|
|
531 |
providing the user with a full backup. The backup utility writes the
|
|
|
532 |
new blocks to the Venti server and constructs a pointer tree with the
|
|
|
533 |
appropriate fingerprint for the unchanged blocks.
|
|
|
534 |
<p>
|
|
|
535 |
|
|
|
536 |
<h2>4.3. Plan 9 File system</h2>
|
|
|
537 |
<p>
|
|
|
538 |
|
|
|
539 |
When combined with a small amount of read/write storage, Venti can be
|
|
|
540 |
used as the primary location for data rather than a place to store
|
|
|
541 |
backups. A new version of the Plan 9 file system, which we are
|
|
|
542 |
developing, exemplifies this approach.
|
|
|
543 |
<p>
|
|
|
544 |
|
|
|
545 |
Previously, the Plan 9 file system was stored on a combination of
|
|
|
546 |
magnetic disks and a write-once optical jukebox. The jukebox
|
|
|
547 |
furnishes the permanent storage for the system, while the magnetic
|
|
|
548 |
disks act as a cache for the jukebox. The cache provides faster file
|
|
|
549 |
access and, more importantly, accumulates the changes to the file
|
|
|
550 |
system during the period between snapshots. When a snapshot is taken,
|
|
|
551 |
new or modified blocks are written from the disk cache to the jukebox.
|
|
|
552 |
<p>
|
|
|
553 |
|
|
|
554 |
The disk cache can be smaller than the active file system, needing
|
|
|
555 |
only to be big enough to contain the daily changes to the file system.
|
|
|
556 |
However, accesses that miss the cache are significantly slower since
|
|
|
557 |
changing platters in the jukebox takes several seconds. This
|
|
|
558 |
performance penalty makes certain operations on old snapshots
|
|
|
559 |
prohibitively expensive. Also, on the rare occasions when the disk
|
|
|
560 |
cache has been reinitialized due to corruption, the file server spends
|
|
|
561 |
several days filling the cache before performance returns to normal.
|
|
|
562 |
<p>
|
|
|
563 |
|
|
|
564 |
The new version of the Plan 9 file system uses Venti instead of an
|
|
|
565 |
optical jukebox as its storage device. Since the performance of Venti
|
|
|
566 |
is comparable to disk, this substitution equalizes access both to the
|
|
|
567 |
active and to the archival view of the file system. It also allows
|
|
|
568 |
the disk cache to be quite small; the cache accumulates changes to the
|
|
|
569 |
file system between snapshots, but does not speed file access.
|
|
|
570 |
<p>
|
|
|
571 |
|
|
|
572 |
<h1>5. Implementation</h1>
|
|
|
573 |
<p>
|
|
|
574 |
|
|
|
575 |
We have implemented a prototype of Venti. The implementation uses an
|
|
|
576 |
append-only log of data blocks and an index that maps fingerprints to
|
|
|
577 |
locations in this log. It also includes a number of features that
|
|
|
578 |
improve robustness and performance. This section gives a brief
|
|
|
579 |
overview of the implementation. Figure 3 shows a block diagram of the
|
|
|
580 |
server.
|
|
|
581 |
<p>
|
|
|
582 |
|
|
|
583 |
|
|
|
584 |
|
|
|
585 |
<img src="Block.gif" ALT="block diagram">
|
|
|
586 |
<p>
|
|
|
587 |
Figure 3. A block diagram of the Venti prototype.
|
|
|
588 |
<p>
|
|
|
589 |
|
|
|
590 |
Since Venti is intended for archival storage, one goal of our
|
|
|
591 |
prototype is robustness. The approach we have taken is to separate
|
|
|
592 |
the storage of data blocks from the index used to locate a block. In
|
|
|
593 |
particular, blocks are stored in an append-only log on a RAID array of
|
|
|
594 |
disk drives. The simplicity of the append-only log structure
|
|
|
595 |
eliminates many possible software errors that might cause data
|
|
|
596 |
corruption and facilitates a variety of additional integrity
|
|
|
597 |
strategies. A separate index structure allows a block to be
|
|
|
598 |
efficiently located in the log; however, the index can be regenerated
|
|
|
599 |
from the data log if required and thus does not have the same
|
|
|
600 |
reliability constraints as the log itself.
|
|
|
601 |
<p>
|
|
|
602 |
|
|
|
603 |
The structure of the data log is illustrated in Figure 4. To ease
|
|
|
604 |
maintenance, the log is divided into self-contained sections called
|
|
|
605 |
arenas. Each arena contains a large number of data blocks and is
|
|
|
606 |
sized to facilitate operations such as copying to removable media.
|
|
|
607 |
Within an arena is a section for data bocks that is filled in an
|
|
|
608 |
append-only manner. In Venti, data blocks are variable sized, up to a
|
|
|
609 |
current limit of 52 Kbytes, but since blocks are immutable they can be
|
|
|
610 |
densely packed into an arena without fragmentation.
|
|
|
611 |
<p>
|
|
|
612 |
|
|
|
613 |
|
|
|
614 |
<img src="LogFormat.gif" ALT="log format">
|
|
|
615 |
<p>
|
|
|
616 |
Figure 4. The format of the data log.
|
|
|
617 |
<p>
|
|
|
618 |
|
|
|
619 |
Each block is prefixed by a header that describes the contents of the
|
|
|
620 |
block. The primary purpose of the header is to provide integrity
|
|
|
621 |
checking during normal operation and to assist in data recovery. The
|
|
|
622 |
header includes a magic number, the fingerprint and size of the block,
|
|
|
623 |
the time when the block was first written, and identity of the user
|
|
|
624 |
that wrote it. The header also includes a user-supplied type
|
|
|
625 |
identifier, which is explained in Section 7. Note, only one copy of a
|
|
|
626 |
given block is stored in the log, thus the user and wtime fields
|
|
|
627 |
correspond to the first time the block was stored to the server.
|
|
|
628 |
<p>
|
|
|
629 |
|
|
|
630 |
Before storing a block in the log, an attempt is made to compress its
|
|
|
631 |
contents. The inclusion of data compression increases the effective
|
|
|
632 |
capacity of the archive and is simple to add given the log structure.
|
|
|
633 |
Obviously, some blocks are incompressible. The encoding field in the
|
|
|
634 |
block header indicates whether the data was compressed and, if so, the
|
|
|
635 |
algorithm used. The esize field indicates the size of the data after
|
|
|
636 |
compression, enabling the location of the next block in the arena to
|
|
|
637 |
be determined. The downside of using compression is the computational
|
|
|
638 |
cost, typically resulting in a decrease in the rate that blocks can be
|
|
|
639 |
stored and retrieved. Our prototype uses a custom Lempel-Ziv '77 [21]
|
|
|
640 |
algorithm that is optimized for speed. Compression is not a
|
|
|
641 |
performance bottleneck for our existing server. Future
|
|
|
642 |
implementations may benefit from hardware solutions.
|
|
|
643 |
<p>
|
|
|
644 |
|
|
|
645 |
In addition to a log of data blocks, an arena includes a header, a
|
|
|
646 |
directory, and a trailer. The header identifies the arena. The
|
|
|
647 |
directory contains a copy of the block header and offset for every
|
|
|
648 |
block in the arena. By replicating the headers of all the blocks in
|
|
|
649 |
one relatively small part of the arena, the server can rapidly check
|
|
|
650 |
or rebuild the system's global block index. The directory also
|
|
|
651 |
facilitates error recovery if part of the arena is destroyed or
|
|
|
652 |
corrupted. The trailer summarizes the current state of the arena
|
|
|
653 |
itself, including the number of blocks and the size of the log.
|
|
|
654 |
Within the arena, the data log and the directory start at opposite
|
|
|
655 |
ends and grow towards each other. When the arena is filled, it is
|
|
|
656 |
marked as sealed, and a fingerprint is computed for the contents of
|
|
|
657 |
the entire arena. Sealed arenas are never modified.
|
|
|
658 |
<p>
|
|
|
659 |
|
|
|
660 |
The basic operation of Venti is to store and retrieve blocks based on
|
|
|
661 |
their fingerprints. A fingerprint is 160 bits long, and the number of
|
|
|
662 |
possible fingerprints far exceeds the number of blocks stored on a
|
|
|
663 |
server. The disparity between the number of fingerprints and blocks
|
|
|
664 |
means it is impractical to map the fingerprint directly to a location
|
|
|
665 |
on a storage device. Instead, we use an index to locate a block
|
|
|
666 |
within the log.
|
|
|
667 |
<p>
|
|
|
668 |
|
|
|
669 |
We implement the index using a disk-resident hash table as illustrated
|
|
|
670 |
in Figure 5. The index is divided into fixed-sized buckets, each of
|
|
|
671 |
which is stored as a single disk block. Each bucket contains the
|
|
|
672 |
index map for a small section of the fingerprint space. A hash
|
|
|
673 |
function is used to map fingerprints to index buckets in a roughly
|
|
|
674 |
uniform manner, and then the bucket is examined using binary search.
|
|
|
675 |
If provisioned with sufficient buckets, the index hash table will be
|
|
|
676 |
relatively empty and bucket overflows will be extremely rare. If a
|
|
|
677 |
bucket does overflow, the extra entries are placed in an adjacent
|
|
|
678 |
bucket. This structure is simple and efficient, requiring one disk
|
|
|
679 |
access to locate a block in almost all cases.
|
|
|
680 |
<p>
|
|
|
681 |
|
|
|
682 |
|
|
|
683 |
<p>
|
|
|
684 |
|
|
|
685 |
<img src="Index.gif" ALT="index format">
|
|
|
686 |
<p>
|
|
|
687 |
|
|
|
688 |
Figure 5. Format of the index.
|
|
|
689 |
<p>
|
|
|
690 |
|
|
|
691 |
The need to go through an index is the main performance penalty for
|
|
|
692 |
Venti compared to a conventional block storage device. Our prototype
|
|
|
693 |
uses three techniques to increase the performance: caching, striping,
|
|
|
694 |
and write buffering.
|
|
|
695 |
<p>
|
|
|
696 |
|
|
|
697 |
The current implementation has two important caches of approximately
|
|
|
698 |
equal size: a block cache and an index cache. A hit in the block
|
|
|
699 |
cache returns the data for that fingerprint, bypassing the both the
|
|
|
700 |
index lookup and access to the data log. Hits in the index cache
|
|
|
701 |
eliminate only the index lookup, but the entries are much smaller and
|
|
|
702 |
the hit rate correspondingly higher.
|
|
|
703 |
<p>
|
|
|
704 |
|
|
|
705 |
Unfortunately, these caches do not speed the process of storing a new
|
|
|
706 |
block to Venti. The server must check that the block is not a
|
|
|
707 |
duplicate by examining the index. If the block is not contained on
|
|
|
708 |
the server, it will obviously not be in any cache. Since the
|
|
|
709 |
fingerprint of the block contains no internal structure, the location
|
|
|
710 |
of a fingerprint in the index is essentially random. Furthermore, the
|
|
|
711 |
archival nature of Venti means the entire index will not fit in memory
|
|
|
712 |
because of the large number of blocks. Combining these factors means
|
|
|
713 |
that the write performance of Venti will be limited to the random IO
|
|
|
714 |
performance of the index disk, which for current technology is a few
|
|
|
715 |
hundred accesses per second. By striping the index across multiple
|
|
|
716 |
disks, however, we get a linear speedup. This requires a sufficient
|
|
|
717 |
number of concurrent accesses, which we assure by buffering the writes
|
|
|
718 |
before accessing the index.
|
|
|
719 |
<p>
|
|
|
720 |
|
|
|
721 |
The prototype Venti server is implemented for the Plan 9 operating
|
|
|
722 |
system in about 10,000 lines of C. The server runs on a dedicated dual
|
|
|
723 |
550Mhz Pentium III processor system with 2 Gbyte of memory and is
|
|
|
724 |
accessed over a 100Mbs Ethernet network. The data log is stored on a
|
|
|
725 |
500 Gbyte MaxTronic IDE Raid 5 Array and the index resides on a string
|
|
|
726 |
of 8 Seagate Cheetah 18XL 9 Gbyte SCSI drives.
|
|
|
727 |
<p>
|
|
|
728 |
|
|
|
729 |
<h1>6. Performance</h1>
|
|
|
730 |
<p>
|
|
|
731 |
|
|
|
732 |
Table 1 gives the preliminary performance results for read and write
|
|
|
733 |
operations in a variety of situations. For comparison, we include the
|
|
|
734 |
SCSI performance of the RAID array. Although the performance is still
|
|
|
735 |
several times slower than directly accessing the disk, we believe the
|
|
|
736 |
results are promising and will improve as the system matures.
|
|
|
737 |
<p>
|
|
|
738 |
Table 1. The performance of read and write operations in Mbytes/s for 8 Kbyte blocks.
|
|
|
739 |
<p>
|
|
|
740 |
<p>
|
|
|
741 |
<table align=center>
|
|
|
742 |
<tr>
|
|
|
743 |
<th></th>
|
|
|
744 |
<th width=150>sequential reads</th>
|
|
|
745 |
<th width=150>random reads</th>
|
|
|
746 |
<th width=150>virgin writes</th>
|
|
|
747 |
<th width=150>duplicate writes</th>
|
|
|
748 |
</tr>
|
|
|
749 |
<tr>
|
|
|
750 |
<td>uncached</td>
|
|
|
751 |
<td align=center>0.9</td>
|
|
|
752 |
<td align=center>0.4</td>
|
|
|
753 |
<td align=center>3.7</td>
|
|
|
754 |
<td align=center>5.6</td>
|
|
|
755 |
</tr>
|
|
|
756 |
<tr>
|
|
|
757 |
<td>index cache</td>
|
|
|
758 |
<td align=center>4.2</td>
|
|
|
759 |
<td align=center>0.7</td>
|
|
|
760 |
<td align=center>-</td>
|
|
|
761 |
<td align=center>6.2</td>
|
|
|
762 |
</tr>
|
|
|
763 |
<tr>
|
|
|
764 |
<td>block cache</td>
|
|
|
765 |
<td align=center>6.8</td>
|
|
|
766 |
<td align=center>-</td>
|
|
|
767 |
<td align=center>-</td>
|
|
|
768 |
<td align=center>6.5</td>
|
|
|
769 |
</tr>
|
|
|
770 |
<tr>
|
|
|
771 |
<td>raw raid</td>
|
|
|
772 |
<td align=center>14.8</td>
|
|
|
773 |
<td align=center>1.0</td>
|
|
|
774 |
<td align=center>12.4</td>
|
|
|
775 |
<td align=center>12.4</td>
|
|
|
776 |
</tr>
|
|
|
777 |
</table>
|
|
|
778 |
<p>
|
|
|
779 |
|
|
|
780 |
|
|
|
781 |
The uncached sequential read performance is particularly bad. The
|
|
|
782 |
problem is that these sequential reads require a random read of the
|
|
|
783 |
index. Without assistance from the client, the read operations are
|
|
|
784 |
not overlapped and do not benefit from the striping of the index. One
|
|
|
785 |
possible solution is a form of read-ahead. When reading a block from
|
|
|
786 |
the data log, it is feasible to also read several following blocks.
|
|
|
787 |
These extra blocks can be added to the caches without referencing the
|
|
|
788 |
index. If blocks are read in the same order they were written to the
|
|
|
789 |
log, the latency of uncached index lookups will be avoided. This
|
|
|
790 |
strategy should work well for streaming data such as multimedia files.
|
|
|
791 |
<p>
|
|
|
792 |
|
|
|
793 |
The basic assumption in Venti is that the growth in capacity of disks
|
|
|
794 |
combined with the removal of duplicate blocks and compression of their
|
|
|
795 |
contents enables a model in which it is not necessary to reclaim space
|
|
|
796 |
by deleting archival data. To demonstrate why we believe this model
|
|
|
797 |
is practical, we present some statistics derived from a decade's use
|
|
|
798 |
of the Plan 9 file system.
|
|
|
799 |
<p>
|
|
|
800 |
|
|
|
801 |
The computing environment in which we work includes two Plan 9 file
|
|
|
802 |
servers named bootes and emelie. Bootes was our primary file
|
|
|
803 |
repository from 1990 until 1997 at which point it was superseded by
|
|
|
804 |
emelie. Over the life of these two file servers there have been 522
|
|
|
805 |
user accounts of which between 50 and 100 were active at any given
|
|
|
806 |
time. The file servers have hosted numerous development projects and
|
|
|
807 |
also contain several large data sets including chess end games,
|
|
|
808 |
astronomical data, satellite imagery, and multimedia files.
|
|
|
809 |
<p>
|
|
|
810 |
|
|
|
811 |
Figure 6 depicts the size of the active file system as measured over
|
|
|
812 |
time by du, the space consumed on the jukebox, and the size of the
|
|
|
813 |
jukebox's data if it were to be stored on Venti. The ratio of the
|
|
|
814 |
size of the archival data and the active file system is also given.
|
|
|
815 |
As can be seen, even without using Venti, the storage required to
|
|
|
816 |
implement the daily snapshots in Plan 9 is relatively modest, a result
|
|
|
817 |
of the block level incremental approach to generating a snapshot.
|
|
|
818 |
When the archival data is stored to Venti the cost of retaining the
|
|
|
819 |
snapshots is reduced significantly. In the case of the emelie file
|
|
|
820 |
system, the size on Venti is only slightly larger than the active file
|
|
|
821 |
system; the cost of retaining the daily snapshots is almost zero.
|
|
|
822 |
Note that the amount of storage that Venti uses for the snapshots
|
|
|
823 |
would be the same even if more conventional methods were used to back
|
|
|
824 |
up the file system. The Plan 9 approach to snapshots is not a
|
|
|
825 |
necessity, since Venti will remove duplicate blocks.
|
|
|
826 |
<p>
|
|
|
827 |
<img src="bootes.gif" ALT="storage sizes for bootes">
|
|
|
828 |
<img src="emelie.gif" ALT="storage sizes for emelie">
|
|
|
829 |
<img src="bootes2.gif" ALT="ratio of sizes for bootes">
|
|
|
830 |
<img src="emelie2.gif" ALT="ratio of sizes for emelie">
|
|
|
831 |
<p>
|
|
|
832 |
Figure 6. Graphs of the various sizes of two Plan 9 file servers.
|
|
|
833 |
<p>
|
|
|
834 |
|
|
|
835 |
When stored on Venti, the size of the jukebox data is reduced by three
|
|
|
836 |
factors: elimination of duplicate blocks, elimination of block
|
|
|
837 |
fragmentation, and compression of the block contents. Table 2
|
|
|
838 |
presents the percent reduction for each of these factors. Note,
|
|
|
839 |
bootes uses a 6 Kbyte block size while emelie uses 16 Kbyte, so the
|
|
|
840 |
effect of removing fragmentation is more significant on emelie.
|
|
|
841 |
<p>
|
|
|
842 |
|
|
|
843 |
The 10 year history of the two Plan 9 file servers may be of interest
|
|
|
844 |
to other researchers. We have made available per-block information
|
|
|
845 |
including a hash of each block's contents, all the block pointers, and
|
|
|
846 |
most of the directory information. The traces do not include the
|
|
|
847 |
actual contents of files nor the file names. There is sufficient
|
|
|
848 |
information to reconstruct the structure of the file system and to
|
|
|
849 |
track the daily changes to this structure over time. The traces are
|
|
|
850 |
available at http://www.cs.bell-labs.com/~seanq/p9trace.html.
|
|
|
851 |
<p>
|
|
|
852 |
|
|
|
853 |
Table 2. The percentage reduction in the size of data stored on
|
|
|
854 |
Venti.
|
|
|
855 |
<p>
|
|
|
856 |
<table align=center>
|
|
|
857 |
<tr>
|
|
|
858 |
<th></th>
|
|
|
859 |
<th width=150>bootes</th>
|
|
|
860 |
<th width=150>emelie</th>
|
|
|
861 |
</tr>
|
|
|
862 |
<tr>
|
|
|
863 |
<td>Elimination of duplicates</td>
|
|
|
864 |
<td align=center>27.8%</td>
|
|
|
865 |
<td align=center>31.3%</td>
|
|
|
866 |
</tr>
|
|
|
867 |
<tr>
|
|
|
868 |
<td>Elimination of fragments</td>
|
|
|
869 |
<td align=center>10.2%</td>
|
|
|
870 |
<td align=center>25.4%</td>
|
|
|
871 |
</tr>
|
|
|
872 |
<tr>
|
|
|
873 |
<td>Data Compression</td>
|
|
|
874 |
<td align=center>33.8%</td>
|
|
|
875 |
<td align=center>54.1%</td>
|
|
|
876 |
</tr>
|
|
|
877 |
<tr>
|
|
|
878 |
<td>Total Reduction</td>
|
|
|
879 |
<td align=center>59.7%</td>
|
|
|
880 |
<td align=center>76.5%</td>
|
|
|
881 |
</tr>
|
|
|
882 |
</table>
|
|
|
883 |
<p>
|
|
|
884 |
|
|
|
885 |
|
|
|
886 |
<p>
|
|
|
887 |
|
|
|
888 |
<h1>7. Reliability and Recovery</h1>
|
|
|
889 |
<p>
|
|
|
890 |
|
|
|
891 |
In concert with the development of the Venti prototype, we have built
|
|
|
892 |
a collection of tools for integrity checking and error recovery.
|
|
|
893 |
Example uses of these tools include: verifying the structure of an
|
|
|
894 |
arena, checking there is an index entry for every block in the data
|
|
|
895 |
log and vice versa, rebuilding the index from the data log, and
|
|
|
896 |
copying an arena to removable media. These tools directly access the
|
|
|
897 |
storage devices containing the data log and index and are executed on
|
|
|
898 |
the server.
|
|
|
899 |
<p>
|
|
|
900 |
|
|
|
901 |
The directory structure at the end of each area enhances the
|
|
|
902 |
efficiency of many integrity and recovery operations, since it is
|
|
|
903 |
typically two orders of magnitude smaller than the arena, yet contains
|
|
|
904 |
most of the needed information. The index checking utility, for
|
|
|
905 |
example, is implemented as a disk based sort of all the arena
|
|
|
906 |
directories, followed by a comparison between this sorted list and the
|
|
|
907 |
index. Our prototype currently contains approximately 150 million
|
|
|
908 |
blocks using 250 Gbytes of storage. An index check takes 2.2 hours,
|
|
|
909 |
which is significantly less than the 6 hours it takes to read all the
|
|
|
910 |
log data.
|
|
|
911 |
<p>
|
|
|
912 |
|
|
|
913 |
An additional integrity and recovery feature is the association of a
|
|
|
914 |
type identifier with every block. This 8 bit identifier is included
|
|
|
915 |
with all client read and write operations and has the effect of
|
|
|
916 |
partitioning the server into multiple independent domains. The idea
|
|
|
917 |
is that type indicates the interpretation of the data contained in the
|
|
|
918 |
block. A client can use this feature, for example, to indicate that a
|
|
|
919 |
block is the root node for a tree of blocks. Currently, the data
|
|
|
920 |
format associated with a type is left entirely to the client; the
|
|
|
921 |
server does not interpret the type other that to use it in conjunction
|
|
|
922 |
with a fingerprint as the key with which to index a block.
|
|
|
923 |
<p>
|
|
|
924 |
|
|
|
925 |
One use of the type identifier is to assist the administrator in
|
|
|
926 |
locating blocks for which a user has accidentally lost the
|
|
|
927 |
fingerprint. Using a tool on the server, the data log can be scanned
|
|
|
928 |
for blocks that match specified criteria, including the block type,
|
|
|
929 |
the write time, and user identifier. The type makes it relatively
|
|
|
930 |
simple to locate forgotten root blocks. Future uses for the type
|
|
|
931 |
might include the ability for the server to determine the location of
|
|
|
932 |
fingerprints within a block, enabling the server to traverse the data
|
|
|
933 |
structures that have been stored.
|
|
|
934 |
<p>
|
|
|
935 |
|
|
|
936 |
By storing the data log on a RAID 5 disk array, our server is
|
|
|
937 |
protected against single drive failures. Obviously, there are many
|
|
|
938 |
scenarios where this is not sufficient: multiple drives may fail,
|
|
|
939 |
there may be a fire in the machine room, the RAID firmware may contain
|
|
|
940 |
bugs, or the device may be stolen.
|
|
|
941 |
<p>
|
|
|
942 |
|
|
|
943 |
Additional protection could be obtained by using one or more off-site
|
|
|
944 |
mirrors for the server. We have not yet implemented this strategy,
|
|
|
945 |
but the architecture of Venti makes this relatively simple. A
|
|
|
946 |
background process on the server copies new blocks from the data log
|
|
|
947 |
to the mirrors. This copying can be achieved using the Venti
|
|
|
948 |
protocol; the server is simply another client to the mirror.
|
|
|
949 |
<p>
|
|
|
950 |
|
|
|
951 |
Even mirroring may not be sufficient. The implementation of Venti may
|
|
|
952 |
contain bugs that can be exploited to compromise the server. An
|
|
|
953 |
automated attack may delete data on many servers simultaneously.
|
|
|
954 |
Storage devices that provide low level enforcement of a write-once
|
|
|
955 |
policy would provide protection for such an attack. Write-once
|
|
|
956 |
read-many optical jukeboxes often provide such protection, but this is
|
|
|
957 |
not yet common for magnetic disk based storage systems. We have thus
|
|
|
958 |
resorted to copying the sealed arenas onto removable media.
|
|
|
959 |
<p>
|
|
|
960 |
|
|
|
961 |
<h1>8. Related Work</h1>
|
|
|
962 |
<p>
|
|
|
963 |
|
|
|
964 |
The Stanford Archival Vault [2] is a prototype archival repository
|
|
|
965 |
intended for digital libraries. The archive consists of a write-once
|
|
|
966 |
log of digital objects (files) and several auxiliary indexes for
|
|
|
967 |
locating objects within the log. Objects are identified by the hash
|
|
|
968 |
of their contents using a cyclic redundancy check (CRC). Unlike
|
|
|
969 |
Venti, this system has no way to share data between objects that are
|
|
|
970 |
partially the same, or to build up complex data structures such as a
|
|
|
971 |
file system hierarchy. Rather, the archive consists of a collection
|
|
|
972 |
of separate objects with a limited ability to group objects into sets.
|
|
|
973 |
<p>
|
|
|
974 |
|
|
|
975 |
On Venti, blocks are organized into more complex data structures by
|
|
|
976 |
creating hash-trees, an idea originally proposed by Merkle [11] for an
|
|
|
977 |
efficient digital signature scheme.
|
|
|
978 |
<p>
|
|
|
979 |
|
|
|
980 |
The approach to block retrieval in the Read-Only Secure File System
|
|
|
981 |
(SFSRO) [3] is comparable to Venti. Blocks are identified by the Sha1
|
|
|
982 |
hash of their contents and this idea is applied recursively to build
|
|
|
983 |
up more complex structures. The focus of this system is security, not
|
|
|
984 |
archival storage. An administrator creates a digitally signed
|
|
|
985 |
database offline. The database contains a public read-only file
|
|
|
986 |
system that can be published on multiple servers and efficiently and
|
|
|
987 |
securely accessed by clients. SFSRO outperforms traditional methods
|
|
|
988 |
for providing data integrity between a client and a file server,
|
|
|
989 |
demonstrating an attractive property of hash-based addressing.
|
|
|
990 |
<p>
|
|
|
991 |
|
|
|
992 |
Given their similarities, it would be simple to implement SFSRO on top
|
|
|
993 |
of Venti. The goal of Venti is to provide a flexible location for
|
|
|
994 |
archival storage and SFSRO is a good example of an application that
|
|
|
995 |
could use this capability. In fact, using Venti would provide a
|
|
|
996 |
trivial solution to SFSRO's problem with stale NFS handles since data
|
|
|
997 |
is never deleted from Venti and thus a stale handle will never be
|
|
|
998 |
encountered.
|
|
|
999 |
<p>
|
|
|
1000 |
|
|
|
1001 |
Content-Derived Names [6] are another example of naming objects based
|
|
|
1002 |
on a secure hash of its contents. This work addresses the issue of
|
|
|
1003 |
naming and managing the various binary software components, in
|
|
|
1004 |
particular shared libraries, that make up an application.
|
|
|
1005 |
<p>
|
|
|
1006 |
|
|
|
1007 |
The philosophy of the Elephant file system [18] is similar to Venti;
|
|
|
1008 |
large, cheap disks make it feasible to retain many versions of data.
|
|
|
1009 |
A feature of the Elephant system is the ability to specify a variety
|
|
|
1010 |
of data retention policies, which can be applied to individual files
|
|
|
1011 |
or directories. These policies attempt to strike a balance between
|
|
|
1012 |
the costs and benefits of storing every version of a file. In
|
|
|
1013 |
contrast, Venti focuses on the problem of how to store information
|
|
|
1014 |
after deciding that it should be retained in perpetuity. A system
|
|
|
1015 |
such as the Elephant file system could incorporate Venti as the
|
|
|
1016 |
storage device for the permanent "landmark" versions of files, much as
|
|
|
1017 |
the Plan 9 file system will use Venti to archive snapshots.
|
|
|
1018 |
<p>
|
|
|
1019 |
|
|
|
1020 |
Self-Securing Storage [19] retains all versions of file system data in
|
|
|
1021 |
order to provide diagnosis and recovery from security breaches. The
|
|
|
1022 |
system is implemented as a self-contained network service that exports
|
|
|
1023 |
an object-based disk interface, providing protection from compromise
|
|
|
1024 |
of the client operating system. Old data is retained for a window of
|
|
|
1025 |
time and then deleted to reclaim storage.
|
|
|
1026 |
<p>
|
|
|
1027 |
|
|
|
1028 |
Venti provides many of the features of self-securing storage: the
|
|
|
1029 |
server is self-contained and accessed through a simple low-level
|
|
|
1030 |
protocol, malicious users cannot corrupt or delete existing data on
|
|
|
1031 |
the server, and old versions of data are available for inspection. It
|
|
|
1032 |
is unlikely that a system would write every file system operation to
|
|
|
1033 |
Venti since storage is never reclaimed, but not deleting data removes
|
|
|
1034 |
the constraint that an intrusion must be detected within a limited
|
|
|
1035 |
window of time. A hybrid approach might retain every version for some
|
|
|
1036 |
time and some versions for all time. Venti could provide the
|
|
|
1037 |
long-term storage for such a hybrid.
|
|
|
1038 |
<p>
|
|
|
1039 |
|
|
|
1040 |
<h1>9. Future Work</h1>
|
|
|
1041 |
<p>
|
|
|
1042 |
|
|
|
1043 |
Venti could be distributed across multiple machines; the approach of
|
|
|
1044 |
identifying data by a hash of its contents simplifies such an
|
|
|
1045 |
extension. For example, the IO performance could be improved by
|
|
|
1046 |
replicating the server and using a simple load balancing algorithm.
|
|
|
1047 |
When storing or retrieving a block, clients direct the operation to a
|
|
|
1048 |
server based on a few bits of the fingerprint. Such load balancing
|
|
|
1049 |
could even be hidden from the client application by interposing a
|
|
|
1050 |
proxy server that performs this operation on behalf of the client.
|
|
|
1051 |
<p>
|
|
|
1052 |
|
|
|
1053 |
Today, Venti provides little security. After authenticating to the
|
|
|
1054 |
server, clients can read any block for which they know the
|
|
|
1055 |
fingerprint. A fingerprint does act as a capability since the space
|
|
|
1056 |
of fingerprints is large and the Venti protocol does not include a
|
|
|
1057 |
means of enumerating the blocks on the server. However, this
|
|
|
1058 |
protection is weak as a single root fingerprint enables access to an
|
|
|
1059 |
entire file tree and once a fingerprint is known, there is no way to
|
|
|
1060 |
restrict access to a particular user. We are exploring ways of
|
|
|
1061 |
providing better access control.
|
|
|
1062 |
<p>
|
|
|
1063 |
|
|
|
1064 |
To date, the structures we have used for storing data on Venti break
|
|
|
1065 |
files into a series of fixed sized blocks. Identical blocks are
|
|
|
1066 |
consolidated on Venti, but this consolidation will not occur if the
|
|
|
1067 |
data is shifted within the file or an application uses a different
|
|
|
1068 |
block size. This limitation can be overcome using an adaptation of
|
|
|
1069 |
Manber's algorithm for finding similarities in files [9]. The idea is
|
|
|
1070 |
to break files into variable sized blocks based on the identification
|
|
|
1071 |
of anchor or break points, increasing the occurrence of duplicate
|
|
|
1072 |
blocks [12]. Such a strategy can be implemented in client
|
|
|
1073 |
applications with no change to the Venti server.
|
|
|
1074 |
<p>
|
|
|
1075 |
|
|
|
1076 |
A more detailed analysis of the decade of daily snapshots of the Plan
|
|
|
1077 |
9 file systems might be interesting. The trace data we have made
|
|
|
1078 |
publicly available contains approximately the same information used
|
|
|
1079 |
for other studies of long term file activity [4].
|
|
|
1080 |
<p>
|
|
|
1081 |
|
|
|
1082 |
<h1>10. Conclusion</h1>
|
|
|
1083 |
<p>
|
|
|
1084 |
|
|
|
1085 |
The approach of identifying a block by the Sha1 hash of its contents
|
|
|
1086 |
is well suited to archival storage. The write-once model and the
|
|
|
1087 |
ability to coalesce duplicate copies of a block makes Venti a useful
|
|
|
1088 |
building block for a number of interesting storage applications.
|
|
|
1089 |
<p>
|
|
|
1090 |
|
|
|
1091 |
The large capacity of magnetic disks allows archival data to be
|
|
|
1092 |
retained and available on-line with performance that is comparable to
|
|
|
1093 |
conventional disks. Stored on our prototype server is over a decade
|
|
|
1094 |
of daily snapshots of two major departmental file servers. These
|
|
|
1095 |
snapshots are stored in a little over 200 Gbytes of disk space.
|
|
|
1096 |
Today, 100 Gbytes drives cost less than $300 and IDE RAID controllers
|
|
|
1097 |
are included on many motherboards. A scaled down version of our
|
|
|
1098 |
server could provide archival storage for a home user at an attractive
|
|
|
1099 |
price. Tomorrow, when terabyte disks can be had for the same price,
|
|
|
1100 |
it seems unlikely that archival data will be deleted to reclaim space.
|
|
|
1101 |
Venti provides an attractive approach to storing that data.
|
|
|
1102 |
<p>
|
|
|
1103 |
|
|
|
1104 |
<h1>11. Acknowledgments</h1>
|
|
|
1105 |
<p>
|
|
|
1106 |
|
|
|
1107 |
This paper was improved by comments and suggestions from Peter Bosch,
|
|
|
1108 |
Eric Grosse, Lorenz Huelsbergen, Rob Pike, Ross Quinlan, and Cliff
|
|
|
1109 |
Young and six anonymous reviewers. The paper's shepherd was Ethan L.
|
|
|
1110 |
Miller. We thank them all for their help.
|
|
|
1111 |
<p>
|
|
|
1112 |
|
|
|
1113 |
<h1>12. References</h1>
|
|
|
1114 |
<p>
|
|
|
1115 |
|
|
|
1116 |
[1] Ann Chervenak, Vivekenand Vellanki, and Zachary Kurmas.
|
|
|
1117 |
Protecting file systems: A survey of backup techniques. In
|
|
|
1118 |
Proceedings Joint NASA and IEEE Mass Storage Conference, March 1998.
|
|
|
1119 |
<p>
|
|
|
1120 |
|
|
|
1121 |
[2] Arturo Crespo and Hector Garcia-Molina. Archival storage for
|
|
|
1122 |
digital libraries. In Proceedings of the 3rd ACM International
|
|
|
1123 |
Conference on Digital Libraries, 1998.
|
|
|
1124 |
<p>
|
|
|
1125 |
|
|
|
1126 |
[3] Kevin Fu, Frans Kaashoek, and David Mazières. Fast and secure
|
|
|
1127 |
distributed read-only file system. In Proceedings of the 4th
|
|
|
1128 |
Symposium on Operating Systems Design and Implementation, 2000.
|
|
|
1129 |
<p>
|
|
|
1130 |
|
|
|
1131 |
[4] Timothy J. Gibson, Ethan L. Miller, and Darrell D. E. Long.
|
|
|
1132 |
Long-term file activity and inter-reference patterns. In Proceedings,
|
|
|
1133 |
24th International Conference on Technology Management and Performance
|
|
|
1134 |
Evaluation of Enterprise-Wide Information Systems, Computer
|
|
|
1135 |
Measurement Group, December 1998.
|
|
|
1136 |
<p>
|
|
|
1137 |
|
|
|
1138 |
[5] Dave Hitz, James Lau, and Michael Malcolm, File system design for
|
|
|
1139 |
an NFS file server appliance, In Proceedings of the Winter 1994 USENIX
|
|
|
1140 |
Conference, San Francisco, CA, January 1994.
|
|
|
1141 |
<p>
|
|
|
1142 |
|
|
|
1143 |
[6] J. K. Hollingsworth and E. L. Miller. Using content-derived names
|
|
|
1144 |
for configuration management. In Proceeding of the 1997 ACM Symposium
|
|
|
1145 |
on Software Reusability, Boston, May 1997.
|
|
|
1146 |
<p>
|
|
|
1147 |
|
|
|
1148 |
[7] John Howard, Michael Kazar, Sherri Menees, David Nichols, Mahadev
|
|
|
1149 |
Satyanarayanan, Robert Sidebotham, and Michael West. Scale and
|
|
|
1150 |
performance in a distributed file system. ACM Transactions on
|
|
|
1151 |
Computer Systems, 6(1):51-81, February 1988.
|
|
|
1152 |
<p>
|
|
|
1153 |
|
|
|
1154 |
[8] Norman C. Hutchinson, Stephen Manley, Mike Federwisch, Guy Harris,
|
|
|
1155 |
Dave Hitz, Steven Kleiman, and Sean O'Malley. Logical vs. physical
|
|
|
1156 |
file system backup. In Proceedings of the 3rd USENIX Symposium on
|
|
|
1157 |
Operating Systems Design and Implementation (OSDI), 1999.
|
|
|
1158 |
<p>
|
|
|
1159 |
|
|
|
1160 |
[9] Udi Manber. Finding similar files in a large file system. In
|
|
|
1161 |
Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA,
|
|
|
1162 |
January 1994.
|
|
|
1163 |
<p>
|
|
|
1164 |
|
|
|
1165 |
[10] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone.
|
|
|
1166 |
Handbook of Applied Cryptography. CRC Press, 1996.
|
|
|
1167 |
<p>
|
|
|
1168 |
|
|
|
1169 |
[11] Ralph C. Merkle. Protocols for public-key cryptosystems. In
|
|
|
1170 |
Proceedings of the IEEE Symposium on Security and Privacy, pp.
|
|
|
1171 |
122-133, April 1980.
|
|
|
1172 |
<p>
|
|
|
1173 |
|
|
|
1174 |
[12] Athicha Muthitacharoen, Benjie Chen, and David Mazières. A
|
|
|
1175 |
low-bandwidth network file system. In Proceedings of the 18th
|
|
|
1176 |
Symposium on Operating Systems Principles, October 2001.
|
|
|
1177 |
<p>
|
|
|
1178 |
|
|
|
1179 |
[13] National Institute of Standards and Technology, FIPS 180-1.
|
|
|
1180 |
Secure Hash Standard. US Department of Commerce, April 1995.
|
|
|
1181 |
<p>
|
|
|
1182 |
|
|
|
1183 |
[14] National Institute of Standards and Technology, Draft FIPS 180-2.
|
|
|
1184 |
Secure Hash Standard. US Department of Commerce, May 2001.
|
|
|
1185 |
<p>
|
|
|
1186 |
|
|
|
1187 |
[15] Evi Nemeth, Garth Snyder, Scott Seebass, and Trent R. Hein. UNIX
|
|
|
1188 |
System Administration Handbook 3rd Edition, Prentice Hall, 2001.
|
|
|
1189 |
<p>
|
|
|
1190 |
|
|
|
1191 |
[16] Rob Pike, Dave Presotto, Sean Dorward, Bob Flandrena, Ken
|
|
|
1192 |
Thompson, Howard Trickey, and Phil Winterbottom. Plan 9 from Bell
|
|
|
1193 |
Labs, Computing Systems, Vol. 8, 3, pp. 221-254, Summer 1995.
|
|
|
1194 |
<p>
|
|
|
1195 |
|
|
|
1196 |
[17] Sean Quinlan. A cache worm file system. Software-Practice and
|
|
|
1197 |
Experience, Vol 21, 12, pp 1289-1299, December 1991.
|
|
|
1198 |
<p>
|
|
|
1199 |
|
|
|
1200 |
[18] Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson,
|
|
|
1201 |
Alistair C. Veitch, Ross W. Carton and Jacob Ofir. Deciding when to
|
|
|
1202 |
forget in the Elephant file system. In Proceedings of the 17th
|
|
|
1203 |
Symposium on Operating Systems Principles, December 12-15, 1999.
|
|
|
1204 |
<p>
|
|
|
1205 |
|
|
|
1206 |
[19] John. D. Strunk, Garth R. Goodson, Michael L. Scheinholtz, Craig
|
|
|
1207 |
A.N. Soules, and Gregory R. Ganger. Self-securing storage: protecting
|
|
|
1208 |
data in compromised systems. In Proceedings of the 4th Symposium on
|
|
|
1209 |
Operating Systems Design and Implementation, October 2000.
|
|
|
1210 |
<p>
|
|
|
1211 |
|
|
|
1212 |
[20] D. A. Thompson and J. S. Best. The future of magnetic data
|
|
|
1213 |
storage technology, IBM Journal of Research and Development, Vol 44,
|
|
|
1214 |
3, pp. 311-322, May 2000.
|
|
|
1215 |
<p>
|
|
|
1216 |
|
|
|
1217 |
[21] J. Ziv and A. Lempel. A universal algorithm for sequential data
|
|
|
1218 |
compression, IEEE Trans. Inform. Theory, vol. IT-23, pp. 337-343,
|
|
|
1219 |
May 1977.
|
|
|
1220 |
<p>
|
|
|
1221 |
|