WebSVN – planix.SVN – Blame – /os/branches/feature_unix/sys/doc/venti/venti.html

Rev	Author	Line No.	Line
2	-	1	`<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN">`
		2	`<html>`
		3
		4	`<head>`
		5
		6	`<meta http-equiv="Content-Language" content="en-us">`
		7	`<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">`
		8
		9	`<title>Venti: a new approach to archival storage</title>`
		10	`</head>`
		11
		12	`<body bgcolor="white">`
		13
		14	`<h1>Venti: a new approach to archival storage</h1>`
		15
		16	`<p>`
		17
		18	`Sean Quinlan and Sean Dorward`
		19	`<br>`
		20
		21	`Bell Labs, Lucent Technologies`
		22	`<p>`
		23
		24	`<h1>Abstract</h1>`
		25	`<p>`
		26
		27	`This paper describes a network storage system, called Venti, intended`
		28	`for archival data. In this system, a unique hash of a block's`
		29	`contents acts as the block identifier for read and write operations.`
		30	`This approach enforces a write-once policy, preventing accidental or`
		31	`malicious destruction of data. In addition, duplicate copies of a`
		32	`block can be coalesced, reducing the consumption of storage and`
		33	`simplifying the implementation of clients. Venti is a building block`
		34	`for constructing a variety of storage applications such as logical`
		35	`backup, physical backup, and snapshot file systems.`
		36	`<p>`
		37
		38	`We have built a prototype of the system and present some preliminary`
		39	`performance results. The system uses magnetic disks as the storage`
		40	`technology, resulting in an access time for archival data that is`
		41	`comparable to non-archival data. The feasibility of the write-once`
		42	`model for storage is demonstrated using data from over a decade's use`
		43	`of two Plan 9 file systems.`
		44	`<p>`
		45
		46	`<h1>1. Introduction</h1>`
		47	`<p>`
		48
		49	`Archival storage is a second class citizen. Many computer`
		50	`environments provide access to a few recent versions of the`
		51	`information stored in file systems and databases, though this access`
		52	`can be tedious and may require the assistance of a system`
		53	`administrator. Less common is the ability for a user to examine data`
		54	`from last month or last year or last decade. Such a feature may not`
		55	`be needed frequently, but when it is needed it is often crucial.`
		56	`<p>`
		57
		58	`The growth in capacity of storage technologies exceeds the ability of`
		59	`many users to generate data, making it practical to archive data in`
		60	`perpetuity. Plan 9, the computing environment that the authors use,`
		61	`includes a file system that stores archival data to an optical jukebox`
		62	`[16, 17]. Ken Thompson observed that, for our usage patterns, the`
		63	`capacity of the jukebox could be considered infinite. In the time it`
		64	`took for us to fill the jukebox, the improvement in technology would`
		65	`allow us to upgrade to a new jukebox with twice the capacity.`
		66	`<p>`
		67
		68	`Abundant storage suggests that an archival system impose a write-once`
		69	`policy. Such a policy prohibits either a user or administrator from`
		70	`deleting or modifying data once it is stored. This approach greatly`
		71	`reduces the opportunities for accidental or malicious data loss and`
		72	`simplifies the system's implementation.`
		73	`<p>`
		74
		75	`Moreover, our experience with Plan 9 is that a write-once policy`
		76	`changes the way one views storage. Obviously, some data is temporary,`
		77	`derivative, or so large that it is either undesirable or impractical`
		78	`to retain forever and should not be archived. However, once it is`
		79	`decided that the data is worth keeping, the resources needed to store`
		80	`the data have been consumed and cannot be reclaimed. This eliminates`
		81	`the task of periodically "cleaning up" and deciding whether the data`
		82	`is still worth keeping. More thought is required before storing the`
		83	`data to a write-once archive, but as the cost of storage continues to`
		84	`fall, this becomes an easy decision.`
		85	`<p>`
		86
		87	`This paper describes the design and implementation of an archival`
		88	`server, called Venti. The goal of Venti is to provide a write-once`
		89	`archival repository that can be shared by multiple client machines and`
		90	`applications. In addition, by using magnetic disks as the primary`
		91	`storage technology, the performance of the system approaches that of`
		92	`non-archival storage.`
		93	`<p>`
		94
		95	`<h1>2. Background</h1>`
		96	`<p>`
		97
		98	`A prevalent form of archival storage is the regular backup of data to`
		99	`magnetic tape [15]. A typical scenario is to provide backup as a`
		100	`central service for a number of client machines. Client software`
		101	`interfaces with a database or file system and determines what data to`
		102	`back up. The data is copied from the client to the tape device, often`
		103	`over a network, and a record of what was copied is stored in a catalog`
		104	`database.`
		105	`<p>`
		106
		107	`Restoring data from a tape backup system can be tedious and error`
		108	`prone. The backup system violates the access permission of the file`
		109	`system, requiring a system administrator or privileged software to`
		110	`perform the task. Since they are tedious, restore operations are`
		111	`infrequent and problems with the process may go undetected. Potential`
		112	`sources of error abound: tapes are mislabeled or reused or lost,`
		113	`drives wander out of alignment and cannot read their old tapes,`
		114	`technology becomes obsolete.`
		115	`<p>`
		116
		117	`For tape backup systems, a tradeoff exists between the performance of`
		118	`backup and restore operations [1]. A full backup simplifies the`
		119	`process of restoring data since all the data is copied to a continuous`
		120	`region on the tape media. For large file systems and databases,`
		121	`incremental backups are more efficient to generate, but such backups`
		122	`are not self-contained; the data for a restore operation is scattered`
		123	`across multiple incremental backups and perhaps multiple tapes. The`
		124	`conventional solution is to limit the extent of this scattering by`
		125	`performing a full backup followed by a small number of incremental`
		126	`backups.`
		127	`<p>`
		128
		129	`File systems such as Plan 9 [16, 17], WAFL [5], and AFS [7] provide a`
		130	`more unified approach to the backup problem by implementing a snapshot`
		131	`feature. A snapshot is a consistent read-only view of the file system`
		132	`at some point in the past. The snapshot retains the file system`
		133	`permissions and can be accessed with standard tools (ls, cat, cp,`
		134	`grep, diff) without special privileges or assistance from an`
		135	`administrator. In our experience, snapshots are a relied-upon and`
		136	`frequently-used resource because they are always available and easy to`
		137	`access.`
		138	`<p>`
		139
		140	`Snapshots avoid the tradeoff between full and incremental backups.`
		141	`Each snapshot is a complete file system tree, much like a full backup.`
		142	`The implementation, however, resembles an incremental backup because`
		143	`the snapshots and the active file system share any blocks that remain`
		144	`unmodified; a snapshot only requires additional storage for the blocks`
		145	`that have changed. To achieve reasonable performance, the device that`
		146	`stores the snapshots must efficiently support random access, limiting`
		147	`the suitability of tape storage for this approach.`
		148	`<p>`
		149
		150	`In the WAFL and AFS systems, snapshots are ephemeral; only a small`
		151	`number of recent versions of the file system are retained. This`
		152	`policy is reasonable since the most recent versions of files are the`
		153	`most useful. For these systems, archival storage requires an`
		154	`additional mechanism such as tape backup.`
		155	`<p>`
		156
		157	`The philosophy of the Plan 9 file system is that random access storage`
		158	`is sufficiently cheap that it is feasible to retain snapshots`
		159	`permanently. The storage required to retain all daily snapshots of a`
		160	`file system is surprisingly modest; later in the paper we present`
		161	`statistics for two file servers that have been in use over the last 10`
		162	`years.`
		163	`<p>`
		164
		165	`Like Plan 9, the Elephant file system [18] retains many versions of`
		166	`data. This system allows a variety of storage reclamation policies`
		167	`that determine when a version of a file should be deleted. In`
		168	`particular, "landmark" versions of files are retained permanently and`
		169	`provide an archival record.`
		170	`<p>`
		171
		172	`<h1>3. The Venti Archival Server</h1>`
		173	`<p>`
		174
		175	`Venti is a block-level network storage system intended for archival`
		176	`data. The interface to the system is a simple protocol that enables`
		177	`client applications to read and write variable sized blocks of data.`
		178	`Venti itself does not provide the services of a file or backup system,`
		179	`but rather the backend archival storage for these types of`
		180	`applications.`
		181	`<p>`
		182
		183	`Venti identifies data blocks by a hash of their contents. By using a`
		184	`collision-resistant hash function with a sufficiently large output, it`
		185	`is possible to consider the hash of a data block as unique. Such a`
		186	`unique hash is called the fingerprint of a block and can be used as`
		187	`the address for read and write operations. This approach results in a`
		188	`storage system with a number of interesting properties.`
		189	`<p>`
		190
		191	`As blocks are addressed by the fingerprint of their contents, a block`
		192	`cannot be modified without changing its address; the behavior is`
		193	`intrinsically write-once. This property distinguishes Venti from most`
		194	`other storage systems, in which the address of a block and its`
		195	`contents are independent.`
		196	`<p>`
		197
		198	`Moreover, writes are idempotent. Multiple writes of the same data can`
		199	`be coalesced and do not require additional storage space. This`
		200	`property can greatly increase the effective storage capacity of the`
		201	`server since it does not rely on the behavior of client applications.`
		202	`For example, an incremental backup application may not be able to`
		203	`determine exactly which blocks have changed, resulting in unnecessary`
		204	`duplication of data. On Venti, such duplicate blocks will be`
		205	`discarded and only one copy of the data will be retained. In fact,`
		206	`replacing the incremental backup with a full backup will consume the`
		207	`same amount of storage. Even duplicate data from different`
		208	`applications and machines can be eliminated if the clients write the`
		209	`data using the same block size and alignment.`
		210	`<p>`
		211
		212	`The hash function can be viewed as generating a universal name space`
		213	`for data blocks. Without cooperating or coordinating, multiple`
		214	`clients can share this name space and share a Venti server. Moreover,`
		215	`the block level interface places few restrictions on the structures`
		216	`and format that clients use to store their data. In contrast,`
		217	`traditional backup and archival systems require more centralized`
		218	`control. For example, backup systems include some form of job`
		219	`scheduler to serialize access to tape devices and may only support a`
		220	`small number of predetermined data formats so that the catalog system`
		221	`can extract pertinent meta-data.`
		222	`<p>`
		223
		224	`Venti provides inherent integrity checking of data. When a block is`
		225	`retrieved, both the client and the server can compute the fingerprint`
		226	`of the data and compare it to the requested fingerprint. This`
		227	`operation allows the client to avoid errors from undetected data`
		228	`corruption and enables the server to identify when error recovery is`
		229	`necessary.`
		230	`<p>`
		231
		232	`Using the fingerprint of a block as its identity facilitates features`
		233	`such as replication, caching, and load balancing. Since the contents`
		234	`of a particular block are immutable, the problem of data coherency is`
		235	`greatly reduced; a cache or a mirror cannot contain a stale or out of`
		236	`date version of a block.`
		237	`<p>`
		238
		239	`<h2>3.1. Choice of Hash Function</h2>`
		240	`<p>`
		241
		242	`The design of Venti requires a hash function that generates a unique`
		243	`fingerprint for every data block that a client may want to store.`
		244	`Obviously, if the size of the fingerprint is smaller than the size of`
		245	`the data blocks, such a hash function cannot exist since there are`
		246	`fewer possible fingerprints than blocks. If the fingerprint is large`
		247	`enough and randomly distributed, this problem does not arise in`
		248	`practice. For a server of a given capacity, the likelihood that two`
		249	`different blocks will have the same hash value, also known as a`
		250	`collision, can be determined. If the probability of a collision is`
		251	`vanishingly small, we can be confident that each fingerprint is`
		252	`unique.`
		253	`<p>`
		254
		255	`It is desirable that Venti employ a cryptographic hash function. For`
		256	`such a function, it is computationally infeasible to find two distinct`
		257	`inputs that hash to the same value [10]. This property is important`
		258	`because it prevents a malicious client from intentionally creating`
		259	`blocks that violate the assumption that each block has a unique`
		260	`fingerprint. As an additional benefit, using a cryptographic hash`
		261	`function strengthens a client's integrity check, preventing a`
		262	`malicious server from fulfilling a read request with fraudulent data.`
		263	`If the fingerprint of the returned block matches the requested`
		264	`fingerprint, the client can be confident the server returned the`
		265	`original data.`
		266	`<p>`
		267
		268	`Venti uses the Sha1 hash function [13] developed by the US National`
		269	`Institute for Standards and Technology (NIST). Sha1 is a popular hash`
		270	`algorithm for many security systems and, to date, there are no known`
		271	`collisions. The output of Sha1 is a 160 bit (20 byte) hash value.`
		272	`Software implementations of Sha1 are relatively efficient; for`
		273	`example, a 700Mhz Pentium 3 can compute the Sha1 hash of 8 Kbyte data`
		274	`blocks in about 130 microseconds, a rate of 60 Mbytes per second.`
		275	`<p>`
		276
		277	`Are the 160 bit hash values generated by Sha1 large enough to ensure`
		278	`the fingerprint of every block is unique? Assuming random hash values`
		279	`with a uniform distribution, a collection of n different data blocks`
		280	`and a hash function that generates b bits, the probability p that`
		281	`there will be one or more collisions is bounded by the number of pairs`
		282	`of blocks multiplied by the probability that a given pair will`
		283	`collide, i.e.`
		284	`<p>`
		285	`      `
		286	`<img src="probablity.gif" ALT="probablity">`
		287	`<p>`
		288
		289	`Today, a large storage system may contain a petabyte (10^15 bytes) of data.`
		290	`Consider an even larger system that contains an exabyte (10^18 bytes)`
		291	`stored as 8 Kbyte blocks (~10^14 blocks). Using the Sha1 hash function, the`
		292	`probability of a collision is less than 10^-20. Such a scenario seems`
		293	`sufficiently unlikely that we ignore it and use the Sha1 hash as a`
		294	`unique identifier for a block. Obviously, as storage technology`
		295	`advances, it may become feasible to store much more than an exabyte,`
		296	`at which point it maybe necessary to move to a larger hash function.`
		297	`NIST has already proposed variants of Sha1 that produce 256, 384, and`
		298	`512 bit results [14]. For the immediate future, however, Sha1 is a`
		299	`suitable choice for generating the fingerprint of a block.`
		300	`<p>`
		301
		302	`<h2>3.2. Choice of Storage Technology</h2>`
		303	`<p>`
		304
		305	`When the Plan 9 file system was designed in 1989, optical jukeboxes`
		306	`offered high capacity with respectable random access performance and`
		307	`thus were an obvious candidate for archival storage. The last decade,`
		308	`however, has seen the capacity of magnetic disks increase at a far`
		309	`faster rate than optical technologies [20]. Today, a disk array costs`
		310	`less than the equivalent capacity optical jukebox and occupies less`
		311	`physical space. Disk technology is even approaching tape in cost per`
		312	`bit.`
		313	`<p>`
		314
		315	`Magnetic disk storage is not as stable or permanent as optical media.`
		316	`Reliability can be improved with technology such as RAID, but unlike`
		317	`write-once optical disks, there is little protection from erasure due`
		318	`to failures of the storage server or RAID array firmware. This issue`
		319	`is discussed in Section 7.`
		320	`<p>`
		321
		322	`Using magnetic disks for Venti has the benefit of reducing the`
		323	`disparity in performance between conventional and archival storage.`
		324	`Operations that previously required data to be restored to magnetic`
		325	`disk can be accomplished directly from the archive. Similarly, the`
		326	`archive can contain the primary copy of often-accessed read-only data.`
		327	`In effect, archival data need not be further down the storage`
		328	`hierarchy; it is differentiated by the write-once policy of the`
		329	`server.`
		330	`<p>`
		331
		332	`<h1>4. Applications</h1>`
		333	`<p>`
		334
		335	`Venti is a building block on which to construct a variety of storage`
		336	`applications. Venti provides a large repository for data that can be`
		337	`shared by many clients, much as tape libraries are currently the`
		338	`foundation of many centralized backup systems. Applications need to`
		339	`accommodate the unique properties of Venti, which are different from`
		340	`traditional block level storage devices, but these properties enable a`
		341	`number of interesting features.`
		342	`<p>`
		343
		344	`Applications use the block level service provided by Venti to store`
		345	`more complex data structures. Data is divided into blocks and written`
		346	`to the server. To enable this data to be retrieved, the application`
		347	`must record the fingerprints of these blocks. One approach is to pack`
		348	`the fingerprints into additional blocks, called pointer blocks, that`
		349	`are also written to the server, a process that can be repeated`
		350	`recursively until a single fingerprint is obtained. This fingerprint`
		351	`represents the root of a tree of blocks and corresponds to a`
		352	`hierarchical hash of the original data.`
		353	`<p>`
		354
		355	`A simple data structure for storing a linear sequence of data blocks`
		356	`is shown in Figure 1. The data blocks are located via a fixed depth`
		357	`tree of pointer blocks which itself is addressed by a root`
		358	`fingerprint. Applications can use such a structure to store a single`
		359	`file or to mimic the behavior of a physical device such as a tape or a`
		360	`disk drive. The write-once nature of Venti does not allow such a tree`
		361	`to be modified, but new versions of the tree can be generated`
		362	`efficiently by storing the new or modified data blocks and reusing the`
		363	`unchanged sections of the tree as depicted in Figure 2.`
		364	`<p>`
		365
		366
		367	`      `
		368	`<img src="SimpleTree.gif" ALT="simple tree">`
		369	`<p>`
		370	`Figure 1. A tree structure for storing a linear sequence of blocks`
		371	`<p>`
		372
		373
		374	`      `
		375	`<img src="ModifiedTree.gif" ALT="modified tree">`
		376	`<p>`
		377	`Figure 2. Build a new version of the tree.`
		378	`<p>`
		379
		380	`By mixing data and fingerprints in a block, more complex data`
		381	`structures can be constructed. For example, a structure for storing a`
		382	`file system may include three types of blocks: directory, pointer, and`
		383	`data. A directory block combines the meta information for a file and`
		384	`the fingerprint to a tree of data blocks containing the file's`
		385	`contents. The depth of the tree can be determined from the size of`
		386	`the file, assuming the pointer and data blocks have a fixed size.`
		387	`Other structures are obviously possible. Venti's block-level`
		388	`interface leaves the choice of format to client applications and`
		389	`different data structures can coexist on a single server.`
		390	`<p>`
		391
		392	`The following sections describes three applications that use Venti as`
		393	`an archival data repository: a user level archive utility called vac,`
		394	`a proposal for a physical level backup utility, and our preliminary`
		395	`work on a new version of the Plan 9 file system.`
		396	`<p>`
		397
		398	`<h2>4.1. Vac</h2>`
		399	`<p>`
		400
		401	`Vac is an application for storing a collection of files and`
		402	`directories as a single object, similar in functionality to the`
		403	`utilities tar and zip. With vac, the contents of the selected files`
		404	`are stored as a tree of blocks on a Venti server. The root`
		405	`fingerprint for this tree is written to a vac archive file specified`
		406	`by the user, which consists of an ASCII representation of the 20 byte`
		407	`root fingerprint plus a fixed header string, and is always 45 bytes`
		408	`long. A corresponding program, called unvac, enables the user to`
		409	`restore files from a vac archive. Naturally, unvac requires access to`
		410	`the Venti server that contains the actual data, but in most situations`
		411	`this is transparent. For a user, it appears that vac compresses any`
		412	`amount of data down to 45 bytes.`
		413	`<p>`
		414
		415	`An important attribute of vac is that it writes each file as a`
		416	`separate collection of Venti blocks, thus ensuring that duplicate`
		417	`copies of a file will be coalesced on the server. If multiple users`
		418	`vac the same data, only one copy will be stored on the server.`
		419	`Similarly, a user may repeatedly vac a directory over time and even if`
		420	`the contents of the directory change, the additional storage consumed`
		421	`on the server will be related to the extent of the changes rather than`
		422	`the total size of the contents. Since Venti coalesces data at the`
		423	`block level, even files that change may share many blocks with`
		424	`previous versions and thus require little space on the server; log and`
		425	`database files are good examples of this scenario.`
		426	`<p>`
		427
		428	`On many Unix systems, the dump utility is used to back up file`
		429	`systems. Dump has the ability to perform incremental backups of data;`
		430	`a user specifies a dump level, and only files that are new or have`
		431	`changed since the last dump at this level are written to the archive.`
		432	`To implement incremental backups, dump examines the modified time`
		433	`associated with each file, which is an efficient method of filtering`
		434	`out the unchanged files.`
		435	`<p>`
		436
		437	`Vac also implements an incremental option based on the file`
		438	`modification times. The user specifies an existing vac file and this`
		439	`archive is used to reduce the number of blocks written to the Venti`
		440	`server. For each file, vac examines the modified time in both the`
		441	`file system and the vac archive. If they are the same, vac copies the`
		442	`fingerprint for the file from the old archive into the new archive.`
		443	`Copying just the 20-byte fingerprint enables the new archive to`
		444	`include the entire file without reading the data from the file system`
		445	`nor writing the data across the network to the Venti server. In`
		446	`addition, unlike an incremental dump, the resulting archive will be`
		447	`identical to an archive generated without the incremental option; it`
		448	`is only a performance improvement. This means there is no need to`
		449	`have multiple levels of backups, some incremental, some full, and so`
		450	`restore operations are greatly simplified.`
		451	`<p>`
		452
		453	`A variant of the incremental option improves the backup of files`
		454	`without reference to modification times. As vac reads a file, it`
		455	`computes the fingerprint for each block. Concurrently, the pointer`
		456	`blocks of the old archive are examined to determine the fingerprint`
		457	`for the block at the same offset in the old version of the file. If`
		458	`the fingerprints are the same, the block does not need to be written`
		459	`to Venti. Instead, the fingerprint can simply be copied into the`
		460	`appropriate pointer block. This optimization reduces the number of`
		461	`writes to the Venti server, saving both network and disk bandwidth.`
		462	`Like the file level optimization above, the resulting vac file is no`
		463	`different from the one produced without this optimization. It does,`
		464	`however, require the data for the file to be read and is only`
		465	`effective if there are a significant number of unchanged blocks.`
		466	`<p>`
		467
		468	`<h2>4.2. Physical backup</h2>`
		469	`<p>`
		470
		471	`Utilities such as vac, tar, and dump archive data at the file or`
		472	`logical level: they walk the file hierarchy converting both data and`
		473	`meta-data into their own internal format. An alternative approach is`
		474	`block-level or physical backup, in which the disk blocks that make up`
		475	`the file system are directly copied without interpretation. Physical`
		476	`backup has a number of benefits including simplicity and potentially`
		477	`much higher throughput [8]. A physical backup utility for file`
		478	`systems that stores the resulting data on Venti appears attractive,`
		479	`though we have not yet implemented such an application.`
		480	`<p>`
		481
		482	`The simplest form of physical backup is to copy the raw contents of`
		483	`one or mores disk drives to Venti. The backup also includes a tree of`
		484	`pointer blocks, which enables access to the data blocks. Like vac,`
		485	`the end result is a single fingerprint representing the root of the`
		486	`tree; that fingerprint needs to be recorded outside of Venti.`
		487	`<p>`
		488
		489	`Coalescing duplicate blocks is the main advantage of making a physical`
		490	`backup to Venti rather than copying the data to another storage medium`
		491	`such as tape. Since file systems are inherently block based, we`
		492	`expect coalescing to be effective. Not only will backups of a file`
		493	`system over time share many unchanged blocks, but even file systems`
		494	`for different machines that are running the same operating system may`
		495	`have many blocks in common. As with vac, the user sees a full backup`
		496	`of the device, while retaining the storage space advantages of an`
		497	`incremental backup.`
		498	`<p>`
		499
		500	`One enhancement to physical backup is to copy only blocks that are`
		501	`actively in use in the file system. For most file system formats it`
		502	`is relatively easy to determine if a block is in use or free without`
		503	`walking the file system hierarchy. Free blocks generally contain the`
		504	`remnants of temporary files that were created and removed in the time`
		505	`between backups and it is advantageous not to store such blocks. This`
		506	`optimization requires that the backup format be able to represent`
		507	`missing blocks, which can easily be achieved on Venti by storing a`
		508	`null value for the appropriate entry in the pointer tree.`
		509	`<p>`
		510
		511	`The random access performance of Venti is sufficiently good that it is`
		512	`possible to use a physical backup without first restoring it to disk.`
		513	`With operating system support, it is feasible to directly mount a`
		514	`backup file system image from Venti. Access to this file system is`
		515	`read only, but it provides a natural method of restoring a subset of`
		516	`files. For situations where a full restore is required, it might be`
		517	`possible to do this restore in a lazy fashion, copying blocks from`
		518	`Venti to the file system as needed, instead of copying the entire`
		519	`contents of the file system before resuming normal operation.`
		520	`<p>`
		521
		522	`The time to perform a physical backup can be reduced using a variety`
		523	`of incremental techniques. Like vac, the backup utility can compute`
		524	`the fingerprint of each block and compare this fingerprint with the`
		525	`appropriate entry in the pointer tree of a previous backup. This`
		526	`optimization reduces the number of writes to the Venti server. If the`
		527	`file system provides information about which blocks have changed, as`
		528	`is the case with WAFL, the backup utility can avoid even reading the`
		529	`unchanged blocks. Again, a major advantage of using Venti is that the`
		530	`backup utility can implement these incremental techniques while still`
		531	`providing the user with a full backup. The backup utility writes the`
		532	`new blocks to the Venti server and constructs a pointer tree with the`
		533	`appropriate fingerprint for the unchanged blocks.`
		534	`<p>`
		535
		536	`<h2>4.3. Plan 9 File system</h2>`
		537	`<p>`
		538
		539	`When combined with a small amount of read/write storage, Venti can be`
		540	`used as the primary location for data rather than a place to store`
		541	`backups. A new version of the Plan 9 file system, which we are`
		542	`developing, exemplifies this approach.`
		543	`<p>`
		544
		545	`Previously, the Plan 9 file system was stored on a combination of`
		546	`magnetic disks and a write-once optical jukebox. The jukebox`
		547	`furnishes the permanent storage for the system, while the magnetic`
		548	`disks act as a cache for the jukebox. The cache provides faster file`
		549	`access and, more importantly, accumulates the changes to the file`
		550	`system during the period between snapshots. When a snapshot is taken,`
		551	`new or modified blocks are written from the disk cache to the jukebox.`
		552	`<p>`
		553
		554	`The disk cache can be smaller than the active file system, needing`
		555	`only to be big enough to contain the daily changes to the file system.`
		556	`However, accesses that miss the cache are significantly slower since`
		557	`changing platters in the jukebox takes several seconds. This`
		558	`performance penalty makes certain operations on old snapshots`
		559	`prohibitively expensive. Also, on the rare occasions when the disk`
		560	`cache has been reinitialized due to corruption, the file server spends`
		561	`several days filling the cache before performance returns to normal.`
		562	`<p>`
		563
		564	`The new version of the Plan 9 file system uses Venti instead of an`
		565	`optical jukebox as its storage device. Since the performance of Venti`
		566	`is comparable to disk, this substitution equalizes access both to the`
		567	`active and to the archival view of the file system. It also allows`
		568	`the disk cache to be quite small; the cache accumulates changes to the`
		569	`file system between snapshots, but does not speed file access.`
		570	`<p>`
		571
		572	`<h1>5. Implementation</h1>`
		573	`<p>`
		574
		575	`We have implemented a prototype of Venti. The implementation uses an`
		576	`append-only log of data blocks and an index that maps fingerprints to`
		577	`locations in this log. It also includes a number of features that`
		578	`improve robustness and performance. This section gives a brief`
		579	`overview of the implementation. Figure 3 shows a block diagram of the`
		580	`server.`
		581	`<p>`
		582
		583
		584	`      `
		585	`<img src="Block.gif" ALT="block diagram">`
		586	`<p>`
		587	`Figure 3. A block diagram of the Venti prototype.`
		588	`<p>`
		589
		590	`Since Venti is intended for archival storage, one goal of our`
		591	`prototype is robustness. The approach we have taken is to separate`
		592	`the storage of data blocks from the index used to locate a block. In`
		593	`particular, blocks are stored in an append-only log on a RAID array of`
		594	`disk drives. The simplicity of the append-only log structure`
		595	`eliminates many possible software errors that might cause data`
		596	`corruption and facilitates a variety of additional integrity`
		597	`strategies. A separate index structure allows a block to be`
		598	`efficiently located in the log; however, the index can be regenerated`
		599	`from the data log if required and thus does not have the same`
		600	`reliability constraints as the log itself.`
		601	`<p>`
		602
		603	`The structure of the data log is illustrated in Figure 4. To ease`
		604	`maintenance, the log is divided into self-contained sections called`
		605	`arenas. Each arena contains a large number of data blocks and is`
		606	`sized to facilitate operations such as copying to removable media.`
		607	`Within an arena is a section for data bocks that is filled in an`
		608	`append-only manner. In Venti, data blocks are variable sized, up to a`
		609	`current limit of 52 Kbytes, but since blocks are immutable they can be`
		610	`densely packed into an arena without fragmentation.`
		611	`<p>`
		612
		613	`      `
		614	`<img src="LogFormat.gif" ALT="log format">`
		615	`<p>`
		616	`Figure 4. The format of the data log.`
		617	`<p>`
		618
		619	`Each block is prefixed by a header that describes the contents of the`
		620	`block. The primary purpose of the header is to provide integrity`
		621	`checking during normal operation and to assist in data recovery. The`
		622	`header includes a magic number, the fingerprint and size of the block,`
		623	`the time when the block was first written, and identity of the user`
		624	`that wrote it. The header also includes a user-supplied type`
		625	`identifier, which is explained in Section 7. Note, only one copy of a`
		626	`given block is stored in the log, thus the user and wtime fields`
		627	`correspond to the first time the block was stored to the server.`
		628	`<p>`
		629
		630	`Before storing a block in the log, an attempt is made to compress its`
		631	`contents. The inclusion of data compression increases the effective`
		632	`capacity of the archive and is simple to add given the log structure.`
		633	`Obviously, some blocks are incompressible. The encoding field in the`
		634	`block header indicates whether the data was compressed and, if so, the`
		635	`algorithm used. The esize field indicates the size of the data after`
		636	`compression, enabling the location of the next block in the arena to`
		637	`be determined. The downside of using compression is the computational`
		638	`cost, typically resulting in a decrease in the rate that blocks can be`
		639	`stored and retrieved. Our prototype uses a custom Lempel-Ziv '77 [21]`
		640	`algorithm that is optimized for speed. Compression is not a`
		641	`performance bottleneck for our existing server. Future`
		642	`implementations may benefit from hardware solutions.`
		643	`<p>`
		644
		645	`In addition to a log of data blocks, an arena includes a header, a`
		646	`directory, and a trailer. The header identifies the arena. The`
		647	`directory contains a copy of the block header and offset for every`
		648	`block in the arena. By replicating the headers of all the blocks in`
		649	`one relatively small part of the arena, the server can rapidly check`
		650	`or rebuild the system's global block index. The directory also`
		651	`facilitates error recovery if part of the arena is destroyed or`
		652	`corrupted. The trailer summarizes the current state of the arena`
		653	`itself, including the number of blocks and the size of the log.`
		654	`Within the arena, the data log and the directory start at opposite`
		655	`ends and grow towards each other. When the arena is filled, it is`
		656	`marked as sealed, and a fingerprint is computed for the contents of`
		657	`the entire arena. Sealed arenas are never modified.`
		658	`<p>`
		659
		660	`The basic operation of Venti is to store and retrieve blocks based on`
		661	`their fingerprints. A fingerprint is 160 bits long, and the number of`
		662	`possible fingerprints far exceeds the number of blocks stored on a`
		663	`server. The disparity between the number of fingerprints and blocks`
		664	`means it is impractical to map the fingerprint directly to a location`
		665	`on a storage device. Instead, we use an index to locate a block`
		666	`within the log.`
		667	`<p>`
		668
		669	`We implement the index using a disk-resident hash table as illustrated`
		670	`in Figure 5. The index is divided into fixed-sized buckets, each of`
		671	`which is stored as a single disk block. Each bucket contains the`
		672	`index map for a small section of the fingerprint space. A hash`
		673	`function is used to map fingerprints to index buckets in a roughly`
		674	`uniform manner, and then the bucket is examined using binary search.`
		675	`If provisioned with sufficient buckets, the index hash table will be`
		676	`relatively empty and bucket overflows will be extremely rare. If a`
		677	`bucket does overflow, the extra entries are placed in an adjacent`
		678	`bucket. This structure is simple and efficient, requiring one disk`
		679	`access to locate a block in almost all cases.`
		680	`<p>`
		681
		682
		683	`<p>`
		684	`      `
		685	`<img src="Index.gif" ALT="index format">`
		686	`<p>`
		687
		688	`Figure 5. Format of the index.`
		689	`<p>`
		690
		691	`The need to go through an index is the main performance penalty for`
		692	`Venti compared to a conventional block storage device. Our prototype`
		693	`uses three techniques to increase the performance: caching, striping,`
		694	`and write buffering.`
		695	`<p>`
		696
		697	`The current implementation has two important caches of approximately`
		698	`equal size: a block cache and an index cache. A hit in the block`
		699	`cache returns the data for that fingerprint, bypassing the both the`
		700	`index lookup and access to the data log. Hits in the index cache`
		701	`eliminate only the index lookup, but the entries are much smaller and`
		702	`the hit rate correspondingly higher.`
		703	`<p>`
		704
		705	`Unfortunately, these caches do not speed the process of storing a new`
		706	`block to Venti. The server must check that the block is not a`
		707	`duplicate by examining the index. If the block is not contained on`
		708	`the server, it will obviously not be in any cache. Since the`
		709	`fingerprint of the block contains no internal structure, the location`
		710	`of a fingerprint in the index is essentially random. Furthermore, the`
		711	`archival nature of Venti means the entire index will not fit in memory`
		712	`because of the large number of blocks. Combining these factors means`
		713	`that the write performance of Venti will be limited to the random IO`
		714	`performance of the index disk, which for current technology is a few`
		715	`hundred accesses per second. By striping the index across multiple`
		716	`disks, however, we get a linear speedup. This requires a sufficient`
		717	`number of concurrent accesses, which we assure by buffering the writes`
		718	`before accessing the index.`
		719	`<p>`
		720
		721	`The prototype Venti server is implemented for the Plan 9 operating`
		722	`system in about 10,000 lines of C. The server runs on a dedicated dual`
		723	`550Mhz Pentium III processor system with 2 Gbyte of memory and is`
		724	`accessed over a 100Mbs Ethernet network. The data log is stored on a`
		725	`500 Gbyte MaxTronic IDE Raid 5 Array and the index resides on a string`
		726	`of 8 Seagate Cheetah 18XL 9 Gbyte SCSI drives.`
		727	`<p>`
		728
		729	`<h1>6. Performance</h1>`
		730	`<p>`
		731
		732	`Table 1 gives the preliminary performance results for read and write`
		733	`operations in a variety of situations. For comparison, we include the`
		734	`SCSI performance of the RAID array. Although the performance is still`
		735	`several times slower than directly accessing the disk, we believe the`
		736	`results are promising and will improve as the system matures.`
		737	`<p>`
		738	`Table 1. The performance of read and write operations in Mbytes/s for 8 Kbyte blocks.`
		739	`<p>`
		740	`<p>`
		741	`<table align=center>`
		742	`<tr>`
		743	`<th></th>`
		744	`<th width=150>sequential reads</th>`
		745	`<th width=150>random reads</th>`
		746	`<th width=150>virgin writes</th>`
		747	`<th width=150>duplicate writes</th>`
		748	`</tr>`
		749	`<tr>`
		750	`<td>uncached</td>`
		751	`<td align=center>0.9</td>`
		752	`<td align=center>0.4</td>`
		753	`<td align=center>3.7</td>`
		754	`<td align=center>5.6</td>`
		755	`</tr>`
		756	`<tr>`
		757	`<td>index cache</td>`
		758	`<td align=center>4.2</td>`
		759	`<td align=center>0.7</td>`
		760	`<td align=center>-</td>`
		761	`<td align=center>6.2</td>`
		762	`</tr>`
		763	`<tr>`
		764	`<td>block cache</td>`
		765	`<td align=center>6.8</td>`
		766	`<td align=center>-</td>`
		767	`<td align=center>-</td>`
		768	`<td align=center>6.5</td>`
		769	`</tr>`
		770	`<tr>`
		771	`<td>raw raid</td>`
		772	`<td align=center>14.8</td>`
		773	`<td align=center>1.0</td>`
		774	`<td align=center>12.4</td>`
		775	`<td align=center>12.4</td>`
		776	`</tr>`
		777	`</table>`
		778	`<p>`
		779
		780
		781	`The uncached sequential read performance is particularly bad. The`
		782	`problem is that these sequential reads require a random read of the`
		783	`index. Without assistance from the client, the read operations are`
		784	`not overlapped and do not benefit from the striping of the index. One`
		785	`possible solution is a form of read-ahead. When reading a block from`
		786	`the data log, it is feasible to also read several following blocks.`
		787	`These extra blocks can be added to the caches without referencing the`
		788	`index. If blocks are read in the same order they were written to the`
		789	`log, the latency of uncached index lookups will be avoided. This`
		790	`strategy should work well for streaming data such as multimedia files.`
		791	`<p>`
		792
		793	`The basic assumption in Venti is that the growth in capacity of disks`
		794	`combined with the removal of duplicate blocks and compression of their`
		795	`contents enables a model in which it is not necessary to reclaim space`
		796	`by deleting archival data. To demonstrate why we believe this model`
		797	`is practical, we present some statistics derived from a decade's use`
		798	`of the Plan 9 file system.`
		799	`<p>`
		800
		801	`The computing environment in which we work includes two Plan 9 file`
		802	`servers named bootes and emelie. Bootes was our primary file`
		803	`repository from 1990 until 1997 at which point it was superseded by`
		804	`emelie. Over the life of these two file servers there have been 522`
		805	`user accounts of which between 50 and 100 were active at any given`
		806	`time. The file servers have hosted numerous development projects and`
		807	`also contain several large data sets including chess end games,`
		808	`astronomical data, satellite imagery, and multimedia files.`
		809	`<p>`
		810
		811	`Figure 6 depicts the size of the active file system as measured over`
		812	`time by du, the space consumed on the jukebox, and the size of the`
		813	`jukebox's data if it were to be stored on Venti. The ratio of the`
		814	`size of the archival data and the active file system is also given.`
		815	`As can be seen, even without using Venti, the storage required to`
		816	`implement the daily snapshots in Plan 9 is relatively modest, a result`
		817	`of the block level incremental approach to generating a snapshot.`
		818	`When the archival data is stored to Venti the cost of retaining the`
		819	`snapshots is reduced significantly. In the case of the emelie file`
		820	`system, the size on Venti is only slightly larger than the active file`
		821	`system; the cost of retaining the daily snapshots is almost zero.`
		822	`Note that the amount of storage that Venti uses for the snapshots`
		823	`would be the same even if more conventional methods were used to back`
		824	`up the file system. The Plan 9 approach to snapshots is not a`
		825	`necessity, since Venti will remove duplicate blocks.`
		826	`<p>`
		827	`<img src="bootes.gif" ALT="storage sizes for bootes">`
		828	`<img src="emelie.gif" ALT="storage sizes for emelie">`
		829	`<img src="bootes2.gif" ALT="ratio of sizes for bootes">`
		830	`<img src="emelie2.gif" ALT="ratio of sizes for emelie">`
		831	`<p>`
		832	`Figure 6. Graphs of the various sizes of two Plan 9 file servers.`
		833	`<p>`
		834
		835	`When stored on Venti, the size of the jukebox data is reduced by three`
		836	`factors: elimination of duplicate blocks, elimination of block`
		837	`fragmentation, and compression of the block contents. Table 2`
		838	`presents the percent reduction for each of these factors. Note,`
		839	`bootes uses a 6 Kbyte block size while emelie uses 16 Kbyte, so the`
		840	`effect of removing fragmentation is more significant on emelie.`
		841	`<p>`
		842
		843	`The 10 year history of the two Plan 9 file servers may be of interest`
		844	`to other researchers. We have made available per-block information`
		845	`including a hash of each block's contents, all the block pointers, and`
		846	`most of the directory information. The traces do not include the`
		847	`actual contents of files nor the file names. There is sufficient`
		848	`information to reconstruct the structure of the file system and to`
		849	`track the daily changes to this structure over time. The traces are`
		850	`available at http://www.cs.bell-labs.com/~seanq/p9trace.html.`
		851	`<p>`
		852
		853	`Table 2. The percentage reduction in the size of data stored on`
		854	`Venti.`
		855	`<p>`
		856	`<table align=center>`
		857	`<tr>`
		858	`<th></th>`
		859	`<th width=150>bootes</th>`
		860	`<th width=150>emelie</th>`
		861	`</tr>`
		862	`<tr>`
		863	`<td>Elimination of duplicates</td>`
		864	`<td align=center>27.8%</td>`
		865	`<td align=center>31.3%</td>`
		866	`</tr>`
		867	`<tr>`
		868	`<td>Elimination of fragments</td>`
		869	`<td align=center>10.2%</td>`
		870	`<td align=center>25.4%</td>`
		871	`</tr>`
		872	`<tr>`
		873	`<td>Data Compression</td>`
		874	`<td align=center>33.8%</td>`
		875	`<td align=center>54.1%</td>`
		876	`</tr>`
		877	`<tr>`
		878	`<td>Total Reduction</td>`
		879	`<td align=center>59.7%</td>`
		880	`<td align=center>76.5%</td>`
		881	`</tr>`
		882	`</table>`
		883	`<p>`
		884
		885
		886	`<p>`
		887
		888	`<h1>7. Reliability and Recovery</h1>`
		889	`<p>`
		890
		891	`In concert with the development of the Venti prototype, we have built`
		892	`a collection of tools for integrity checking and error recovery.`
		893	`Example uses of these tools include: verifying the structure of an`
		894	`arena, checking there is an index entry for every block in the data`
		895	`log and vice versa, rebuilding the index from the data log, and`
		896	`copying an arena to removable media. These tools directly access the`
		897	`storage devices containing the data log and index and are executed on`
		898	`the server.`
		899	`<p>`
		900
		901	`The directory structure at the end of each area enhances the`
		902	`efficiency of many integrity and recovery operations, since it is`
		903	`typically two orders of magnitude smaller than the arena, yet contains`
		904	`most of the needed information. The index checking utility, for`
		905	`example, is implemented as a disk based sort of all the arena`
		906	`directories, followed by a comparison between this sorted list and the`
		907	`index. Our prototype currently contains approximately 150 million`
		908	`blocks using 250 Gbytes of storage. An index check takes 2.2 hours,`
		909	`which is significantly less than the 6 hours it takes to read all the`
		910	`log data.`
		911	`<p>`
		912
		913	`An additional integrity and recovery feature is the association of a`
		914	`type identifier with every block. This 8 bit identifier is included`
		915	`with all client read and write operations and has the effect of`
		916	`partitioning the server into multiple independent domains. The idea`
		917	`is that type indicates the interpretation of the data contained in the`
		918	`block. A client can use this feature, for example, to indicate that a`
		919	`block is the root node for a tree of blocks. Currently, the data`
		920	`format associated with a type is left entirely to the client; the`
		921	`server does not interpret the type other that to use it in conjunction`
		922	`with a fingerprint as the key with which to index a block.`
		923	`<p>`
		924
		925	`One use of the type identifier is to assist the administrator in`
		926	`locating blocks for which a user has accidentally lost the`
		927	`fingerprint. Using a tool on the server, the data log can be scanned`
		928	`for blocks that match specified criteria, including the block type,`
		929	`the write time, and user identifier. The type makes it relatively`
		930	`simple to locate forgotten root blocks. Future uses for the type`
		931	`might include the ability for the server to determine the location of`
		932	`fingerprints within a block, enabling the server to traverse the data`
		933	`structures that have been stored.`
		934	`<p>`
		935
		936	`By storing the data log on a RAID 5 disk array, our server is`
		937	`protected against single drive failures. Obviously, there are many`
		938	`scenarios where this is not sufficient: multiple drives may fail,`
		939	`there may be a fire in the machine room, the RAID firmware may contain`
		940	`bugs, or the device may be stolen.`
		941	`<p>`
		942
		943	`Additional protection could be obtained by using one or more off-site`
		944	`mirrors for the server. We have not yet implemented this strategy,`
		945	`but the architecture of Venti makes this relatively simple. A`
		946	`background process on the server copies new blocks from the data log`
		947	`to the mirrors. This copying can be achieved using the Venti`
		948	`protocol; the server is simply another client to the mirror.`
		949	`<p>`
		950
		951	`Even mirroring may not be sufficient. The implementation of Venti may`
		952	`contain bugs that can be exploited to compromise the server. An`
		953	`automated attack may delete data on many servers simultaneously.`
		954	`Storage devices that provide low level enforcement of a write-once`
		955	`policy would provide protection for such an attack. Write-once`
		956	`read-many optical jukeboxes often provide such protection, but this is`
		957	`not yet common for magnetic disk based storage systems. We have thus`
		958	`resorted to copying the sealed arenas onto removable media.`
		959	`<p>`
		960
		961	`<h1>8. Related Work</h1>`
		962	`<p>`
		963
		964	`The Stanford Archival Vault [2] is a prototype archival repository`
		965	`intended for digital libraries. The archive consists of a write-once`
		966	`log of digital objects (files) and several auxiliary indexes for`
		967	`locating objects within the log. Objects are identified by the hash`
		968	`of their contents using a cyclic redundancy check (CRC). Unlike`
		969	`Venti, this system has no way to share data between objects that are`
		970	`partially the same, or to build up complex data structures such as a`
		971	`file system hierarchy. Rather, the archive consists of a collection`
		972	`of separate objects with a limited ability to group objects into sets.`
		973	`<p>`
		974
		975	`On Venti, blocks are organized into more complex data structures by`
		976	`creating hash-trees, an idea originally proposed by Merkle [11] for an`
		977	`efficient digital signature scheme.`
		978	`<p>`
		979
		980	`The approach to block retrieval in the Read-Only Secure File System`
		981	`(SFSRO) [3] is comparable to Venti. Blocks are identified by the Sha1`
		982	`hash of their contents and this idea is applied recursively to build`
		983	`up more complex structures. The focus of this system is security, not`
		984	`archival storage. An administrator creates a digitally signed`
		985	`database offline. The database contains a public read-only file`
		986	`system that can be published on multiple servers and efficiently and`
		987	`securely accessed by clients. SFSRO outperforms traditional methods`
		988	`for providing data integrity between a client and a file server,`
		989	`demonstrating an attractive property of hash-based addressing.`
		990	`<p>`
		991
		992	`Given their similarities, it would be simple to implement SFSRO on top`
		993	`of Venti. The goal of Venti is to provide a flexible location for`
		994	`archival storage and SFSRO is a good example of an application that`
		995	`could use this capability. In fact, using Venti would provide a`
		996	`trivial solution to SFSRO's problem with stale NFS handles since data`
		997	`is never deleted from Venti and thus a stale handle will never be`
		998	`encountered.`
		999	`<p>`
		1000
		1001	`Content-Derived Names [6] are another example of naming objects based`
		1002	`on a secure hash of its contents. This work addresses the issue of`
		1003	`naming and managing the various binary software components, in`
		1004	`particular shared libraries, that make up an application.`
		1005	`<p>`
		1006
		1007	`The philosophy of the Elephant file system [18] is similar to Venti;`
		1008	`large, cheap disks make it feasible to retain many versions of data.`
		1009	`A feature of the Elephant system is the ability to specify a variety`
		1010	`of data retention policies, which can be applied to individual files`
		1011	`or directories. These policies attempt to strike a balance between`
		1012	`the costs and benefits of storing every version of a file. In`
		1013	`contrast, Venti focuses on the problem of how to store information`
		1014	`after deciding that it should be retained in perpetuity. A system`
		1015	`such as the Elephant file system could incorporate Venti as the`
		1016	`storage device for the permanent "landmark" versions of files, much as`
		1017	`the Plan 9 file system will use Venti to archive snapshots.`
		1018	`<p>`
		1019
		1020	`Self-Securing Storage [19] retains all versions of file system data in`
		1021	`order to provide diagnosis and recovery from security breaches. The`
		1022	`system is implemented as a self-contained network service that exports`
		1023	`an object-based disk interface, providing protection from compromise`
		1024	`of the client operating system. Old data is retained for a window of`
		1025	`time and then deleted to reclaim storage.`
		1026	`<p>`
		1027
		1028	`Venti provides many of the features of self-securing storage: the`
		1029	`server is self-contained and accessed through a simple low-level`
		1030	`protocol, malicious users cannot corrupt or delete existing data on`
		1031	`the server, and old versions of data are available for inspection. It`
		1032	`is unlikely that a system would write every file system operation to`
		1033	`Venti since storage is never reclaimed, but not deleting data removes`
		1034	`the constraint that an intrusion must be detected within a limited`
		1035	`window of time. A hybrid approach might retain every version for some`
		1036	`time and some versions for all time. Venti could provide the`
		1037	`long-term storage for such a hybrid.`
		1038	`<p>`
		1039
		1040	`<h1>9. Future Work</h1>`
		1041	`<p>`
		1042
		1043	`Venti could be distributed across multiple machines; the approach of`
		1044	`identifying data by a hash of its contents simplifies such an`
		1045	`extension. For example, the IO performance could be improved by`
		1046	`replicating the server and using a simple load balancing algorithm.`
		1047	`When storing or retrieving a block, clients direct the operation to a`
		1048	`server based on a few bits of the fingerprint. Such load balancing`
		1049	`could even be hidden from the client application by interposing a`
		1050	`proxy server that performs this operation on behalf of the client.`
		1051	`<p>`
		1052
		1053	`Today, Venti provides little security. After authenticating to the`
		1054	`server, clients can read any block for which they know the`
		1055	`fingerprint. A fingerprint does act as a capability since the space`
		1056	`of fingerprints is large and the Venti protocol does not include a`
		1057	`means of enumerating the blocks on the server. However, this`
		1058	`protection is weak as a single root fingerprint enables access to an`
		1059	`entire file tree and once a fingerprint is known, there is no way to`
		1060	`restrict access to a particular user. We are exploring ways of`
		1061	`providing better access control.`
		1062	`<p>`
		1063
		1064	`To date, the structures we have used for storing data on Venti break`
		1065	`files into a series of fixed sized blocks. Identical blocks are`
		1066	`consolidated on Venti, but this consolidation will not occur if the`
		1067	`data is shifted within the file or an application uses a different`
		1068	`block size. This limitation can be overcome using an adaptation of`
		1069	`Manber's algorithm for finding similarities in files [9]. The idea is`
		1070	`to break files into variable sized blocks based on the identification`
		1071	`of anchor or break points, increasing the occurrence of duplicate`
		1072	`blocks [12]. Such a strategy can be implemented in client`
		1073	`applications with no change to the Venti server.`
		1074	`<p>`
		1075
		1076	`A more detailed analysis of the decade of daily snapshots of the Plan`
		1077	`9 file systems might be interesting. The trace data we have made`
		1078	`publicly available contains approximately the same information used`
		1079	`for other studies of long term file activity [4].`
		1080	`<p>`
		1081
		1082	`<h1>10. Conclusion</h1>`
		1083	`<p>`
		1084
		1085	`The approach of identifying a block by the Sha1 hash of its contents`
		1086	`is well suited to archival storage. The write-once model and the`
		1087	`ability to coalesce duplicate copies of a block makes Venti a useful`
		1088	`building block for a number of interesting storage applications.`
		1089	`<p>`
		1090
		1091	`The large capacity of magnetic disks allows archival data to be`
		1092	`retained and available on-line with performance that is comparable to`
		1093	`conventional disks. Stored on our prototype server is over a decade`
		1094	`of daily snapshots of two major departmental file servers. These`
		1095	`snapshots are stored in a little over 200 Gbytes of disk space.`
		1096	`Today, 100 Gbytes drives cost less than $300 and IDE RAID controllers`
		1097	`are included on many motherboards. A scaled down version of our`
		1098	`server could provide archival storage for a home user at an attractive`
		1099	`price. Tomorrow, when terabyte disks can be had for the same price,`
		1100	`it seems unlikely that archival data will be deleted to reclaim space.`
		1101	`Venti provides an attractive approach to storing that data.`
		1102	`<p>`
		1103
		1104	`<h1>11. Acknowledgments</h1>`
		1105	`<p>`
		1106
		1107	`This paper was improved by comments and suggestions from Peter Bosch,`
		1108	`Eric Grosse, Lorenz Huelsbergen, Rob Pike, Ross Quinlan, and Cliff`
		1109	`Young and six anonymous reviewers. The paper's shepherd was Ethan L.`
		1110	`Miller. We thank them all for their help.`
		1111	`<p>`
		1112
		1113	`<h1>12. References</h1>`
		1114	`<p>`
		1115
		1116	`[1] Ann Chervenak, Vivekenand Vellanki, and Zachary Kurmas.`
		1117	`Protecting file systems: A survey of backup techniques. In`
		1118	`Proceedings Joint NASA and IEEE Mass Storage Conference, March 1998.`
		1119	`<p>`
		1120
		1121	`[2] Arturo Crespo and Hector Garcia-Molina. Archival storage for`
		1122	`digital libraries. In Proceedings of the 3rd ACM International`
		1123	`Conference on Digital Libraries, 1998.`
		1124	`<p>`
		1125
		1126	`[3] Kevin Fu, Frans Kaashoek, and David Mazières. Fast and secure`
		1127	`distributed read-only file system. In Proceedings of the 4th`
		1128	`Symposium on Operating Systems Design and Implementation, 2000.`
		1129	`<p>`
		1130
		1131	`[4] Timothy J. Gibson, Ethan L. Miller, and Darrell D. E. Long.`
		1132	`Long-term file activity and inter-reference patterns. In Proceedings,`
		1133	`24th International Conference on Technology Management and Performance`
		1134	`Evaluation of Enterprise-Wide Information Systems, Computer`
		1135	`Measurement Group, December 1998.`
		1136	`<p>`
		1137
		1138	`[5] Dave Hitz, James Lau, and Michael Malcolm, File system design for`
		1139	`an NFS file server appliance, In Proceedings of the Winter 1994 USENIX`
		1140	`Conference, San Francisco, CA, January 1994.`
		1141	`<p>`
		1142
		1143	`[6] J. K. Hollingsworth and E. L. Miller. Using content-derived names`
		1144	`for configuration management. In Proceeding of the 1997 ACM Symposium`
		1145	`on Software Reusability, Boston, May 1997.`
		1146	`<p>`
		1147
		1148	`[7] John Howard, Michael Kazar, Sherri Menees, David Nichols, Mahadev`
		1149	`Satyanarayanan, Robert Sidebotham, and Michael West. Scale and`
		1150	`performance in a distributed file system. ACM Transactions on`
		1151	`Computer Systems, 6(1):51-81, February 1988.`
		1152	`<p>`
		1153
		1154	`[8] Norman C. Hutchinson, Stephen Manley, Mike Federwisch, Guy Harris,`
		1155	`Dave Hitz, Steven Kleiman, and Sean O'Malley. Logical vs. physical`
		1156	`file system backup. In Proceedings of the 3rd USENIX Symposium on`
		1157	`Operating Systems Design and Implementation (OSDI), 1999.`
		1158	`<p>`
		1159
		1160	`[9] Udi Manber. Finding similar files in a large file system. In`
		1161	`Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA,`
		1162	`January 1994.`
		1163	`<p>`
		1164
		1165	`[10] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone.`
		1166	`Handbook of Applied Cryptography. CRC Press, 1996.`
		1167	`<p>`
		1168
		1169	`[11] Ralph C. Merkle. Protocols for public-key cryptosystems. In`
		1170	`Proceedings of the IEEE Symposium on Security and Privacy, pp.`
		1171	`122-133, April 1980.`
		1172	`<p>`
		1173
		1174	`[12] Athicha Muthitacharoen, Benjie Chen, and David Mazières. A`
		1175	`low-bandwidth network file system. In Proceedings of the 18th`
		1176	`Symposium on Operating Systems Principles, October 2001.`
		1177	`<p>`
		1178
		1179	`[13] National Institute of Standards and Technology, FIPS 180-1.`
		1180	`Secure Hash Standard. US Department of Commerce, April 1995.`
		1181	`<p>`
		1182
		1183	`[14] National Institute of Standards and Technology, Draft FIPS 180-2.`
		1184	`Secure Hash Standard. US Department of Commerce, May 2001.`
		1185	`<p>`
		1186
		1187	`[15] Evi Nemeth, Garth Snyder, Scott Seebass, and Trent R. Hein. UNIX`
		1188	`System Administration Handbook 3rd Edition, Prentice Hall, 2001.`
		1189	`<p>`
		1190
		1191	`[16] Rob Pike, Dave Presotto, Sean Dorward, Bob Flandrena, Ken`
		1192	`Thompson, Howard Trickey, and Phil Winterbottom. Plan 9 from Bell`
		1193	`Labs, Computing Systems, Vol. 8, 3, pp. 221-254, Summer 1995.`
		1194	`<p>`
		1195
		1196	`[17] Sean Quinlan. A cache worm file system. Software-Practice and`
		1197	`Experience, Vol 21, 12, pp 1289-1299, December 1991.`
		1198	`<p>`
		1199
		1200	`[18] Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson,`
		1201	`Alistair C. Veitch, Ross W. Carton and Jacob Ofir. Deciding when to`
		1202	`forget in the Elephant file system. In Proceedings of the 17th`
		1203	`Symposium on Operating Systems Principles, December 12-15, 1999.`
		1204	`<p>`
		1205
		1206	`[19] John. D. Strunk, Garth R. Goodson, Michael L. Scheinholtz, Craig`
		1207	`A.N. Soules, and Gregory R. Ganger. Self-securing storage: protecting`
		1208	`data in compromised systems. In Proceedings of the 4th Symposium on`
		1209	`Operating Systems Design and Implementation, October 2000.`
		1210	`<p>`
		1211
		1212	`[20] D. A. Thompson and J. S. Best. The future of magnetic data`
		1213	`storage technology, IBM Journal of Research and Development, Vol 44,`
		1214	`3, pp. 311-322, May 2000.`
		1215	`<p>`
		1216
		1217	`[21] J. Ziv and A. Lempel. A universal algorithm for sequential data`
		1218	`compression, IEEE Trans. Inform. Theory, vol. IT-23, pp. 337-343,`
		1219	`May 1977.`
		1220	`<p>`
		1221

Subversion Repositories planix.SVN

(root)/os/branches/feature_unix/sys/doc/venti/venti.html – Rev 60