File replication with the intermezzo file system

The InterMezzo File System
Peter J. Braam, [email protected]
Carnegie Mellon University & Stelias Computing
Michael Callahan, [email protected]
The Roda Group
Phil Schwan, [email protected]
Stelias Computing
To keep our project modest, we are firsttargeting the project of online directory Is it possible to implement a distributed file replication on a system area network, allowing for network and system failures. This exposes protocol features of systems like Coda, while many of the subsystems needed, but leaves getting an implementation that is an order of magnitude simpler? We claim the answer is describes the broader design issues, parts of “yes”, and InterMezzo is well on its way which will be implemented and likely modified towards being a prototype of such a system.
in the process of bringing InterMezzo forward.
The key design decisions were to exploit localfile systems as server storage and as a client The code for InterMezzo will soon be available cache and make the kernel file system driver a wrapper around a local file system. We use arich, high level language and work with Acknowledgements: The authors thank the
asynchronous completion instead of threads.
other members of the Coda project for fruitful Finally, we rely on existing infrastructure, such discussions. This research was supported by as TCP. This is in contrast with systems like the Air Force Materiel Command (AFMC) under Coda and AFS, which implement their entire DARPA contract number F19628-96-C-0061.
infrastructure from the ground up, in C. This IBM and Novell provided additional support.
The views and conclusions contained in here are those of the authors and should not be interpreted as necessarily representing theofficial policies or endorsements, eitherexpress or implied, of AFMC, DARPA, IBM, Introduction
Novell, Carnegie Mellon University, SteliasComputing Inc., The Roda Group LLC or the Designing and implementing distributed file systems poses a major intellectual challenge.
Many excellent ideas have come out of research projects, and some systems offer Designing Distributed File
The basic question we wanted to answer is the There are several components to distributed feasibility of building an interesting distributed file systems. First, network file systems have file system with an order of magnitude less clients, which can fetch data from servers and code than a system like Coda [coda] has, yet send modifications to servers. This requires retaining some of the advanced features. We network request processing including bulk claim the answer is “yes”. In this paper we transfer. The fact that the client gains file system access to the files requires a kernel implementation decisions for the InterMezzo level file system driver. Without adding bells and whistles here we already have a list of Delicate protocols and versioning information Another key contributor to difficulties with file is needed to give good semantics of sharing.
clients that cached versions of the files or correct implementation of a relatively simple directories are out of date. This requires the client to both make requests to servers, on issues have to be addressed, resulting in a behalf of client processes accessing and/or long arduous development path. UCLA’s Ficus modifying file system objects, as well as to [ficus] project made an important contribution answer requests from the server to maintain in this field, namely to use stackable file systems [stackable], also available for Linux [zadok]. These are file system layers that
leverage and extend the functions of other file systems. We use this approach in InterMezzo pervasive in a design. One must minimize the and avoid the difficulties associated with a full number of network interactions, and also make these interactions as efficient as possible.
Security, management and backup add InterMezzo is at present a prototype, and the another dimension to a file system project.
first target for the system is to do reliable directory replication between two nodes, on a Given this picture, it should not come as a secure system area network. Its protocols are surprise that not all the advanced file system suitable for extensions, and a fully featured projects have delivered systems that gained distributed file system can be constructed from widespread use. Fortunately many features and ideas have been explored and thoroughlyevaluated during the last 15 years. A central Overview of InterMezzo
question that has not been answeredappropriately is how to implement file systems Faced with a multitude of design choices, we reliably, with a reasonable amount of code.
drew up the following list of requirements.
1. The server file storage must reside in a distributed systems. Ericsson was facing the daunting task of implementing control systems 2. InterMezzo’s client kernel level file system for telephone exchanges. The Erlang language should exploit existing file systems, and [erlang] was developed as a result of extensive studies how to construct reliable distributed 3. File system objects should have meta-data systems. The Teapot [teapot] project at the suitable for disconnected operation.
4. Scalability and recovery of the distributed specific languages enabling construction and verification of distributed shared memory software. This toolkit was subsequently used 5. The system should perform kernel level by the Berkeley XFS project [xfs]. The ACE toolkit [ace], developed at George Washington frameworks for building distributed systems.
protocols such as rsync for synchronization Interestingly all of these projects have reached conclusions bearing considerable similarity: a server file systems should differ in policy, notifications is exploited, which gives a better overview of the distributed state than the classical “multi-threaded” server models.
have not yet decided if we should identifyInterMezzo file objects (i.e. files & directories) server/device/inode triples. Our system is software is not aware of the presence of file model to another with a relatively small effort.
The file set location database is implemented Security and management are important. AFS as a sparse tree of directories, handled as a and Coda set good examples for management special object in the file system. An update of server space, by working with file sets, record is associated with each modification of a.k.a. volumes. File sets are groups of files the file set tree, and version numbers are that form a sub-tree of the InterMezzo name associated with each update. This allows the space, and which are typically much bigger than a single directory and much smaller than downloading the latest version (which is also an entire disk partition. Coda and AFS, as well as DCE/DFS and Microsoft dfs, have a single name space. This means that all client systems InterMezzo makes a distinction between clients see one large file tree containing all the file and servers, but mostly as a matter of policy, are used, the server maintains the on-diskstructures as authoritative records of file state, Often it is desirable to work with a small and also coordinates state coherence across collection of file sets separately, so we left the InterMezzo clients. Each server will also room to manipulate the InterMezzo exported be an InterMezzo client, but the file system file sets more explicitly than in those systems.
layout on the server is slightly different from that on other clients. A file set mount point on File sets, mount points, servers
a server is not necessarily a directory in the and clients
InterMezzo file system as it is on a client. If the server stores the file set, a symbolic link is Each file tree made available by a cluster of placed at the mount point, which targets a InterMezzo servers is built up of file sets or directory on the server holding the data. This volumes, which are similar to Coda and AFS target directory is the file set holding location volumes. Each file set has root directory and can contain InterMezzo mount points of otherfile sets. An InterMezzo mount point is a Journaling updates and filtering
concept similar to but distinct from a Unix mount point. A client can have any file set asthe root of an InterMezzo file system.
An InterMezzo cache is simply a local media file system, which is mounted with an extra filter layer wrapped around this local file associated with it, describing the server system. The filter layer is called “Presto” for holding the file set. The file set location Linux and “Vivace” for Windows platforms.
database describes the mount points of file • Filter access, to validate currency of the sharing a single file set location database • Journal updates made to the file system A client will be able to choose any file set as manager on the system. Note that Lento acts InterMezzo file system. If the root file set is both as the file server and as the client cache not the default, then all ancestors of the manager. Figure 1 graphically describes the mount point of that file set are invisible, this operation of the InterMezzo filter driver.
accommodates exporting sub-trees of theentire cluster file tree. File sets contain files The code in Presto is quite straightforward.
and directories, and generally user level informed of the file system type that it is if (!HAVE_DATA(inode) && !ISLENTO) associated with dentries, inodes and files of the wrapped file system in the “bottom_ops” structures. The “bottom_ops” are typically those of the ext2 file system. Presto must filter all access and journal all updates, butmust make an exception for the process Updates to objects are made as follows. First a Permit is acquired to execute concurrency control. On clients, the updates are made in modifications of the cache should be immune modification log for reintegration to servers.
The server holding the file set repeats the As an example let us describe the open call for modifications made to the volume and will directory inodes. When an object is accessed and it is not present in the cache, the object is registered for replication. We call the process fetched. Files are presently fetched in their of forwarding modification logs to the clients entirety. Directory fetching is implemented by creating the name, and creating sparse files reintegration and update notification are and empty subdirectories for each of their entries. While this introduces significantlatency in the access to directories, The modification logs details modifications compensation comes from the fact that the made to directories and file attributes, as well subsequent acquisition of attributes of the as all close operations on files. Such files, which have been closed after a modification, are then “back-fetched” by the server. If a In pseudo code, the kernel open method has server detects that it does not have a directory created by the client, it recursively back-fetches the contents of that directory from the Figure 1: Basic operation of the InterMezzo system.
In a trusted environment, an optimization can take place, which simply changes the holder ofthe file set to the system modifying the file set.
This system then is responsible for updating al if (!HAVE_PERMIT(inode) && !ISLENTO) Protocols
[dfs-protocols] have many attractive features, and our intention was to preserve many aspects Each cached object (i.e. directory or file) has attributes HAS_DATA, HAS_ATTR to indicate if its data/attributes are valid in the cache.
Before accessing an object, the client will verify A client can place a replication request for a file that it is current, and re-fetch it if necessary.
set with the server holding the file set – the The protocol has a FetchDir and FetchFile call
server will then propagate all updates it receives for fetching and a Validate call for assessing
currency. The presence of a HAS_DATA orHAS_ATTR flag is called a call-back and enables If a file set is modified on the server, an almost the client to reuse the object without contacting identical process takes place. The updates are journaled but now only propagated to clients registered for replication. Other clients will The server will notify clients holding callbacks on invalidate their caches when they learn of the objects before storing modified objects.
updates, and re-fetch the data. In order to InterMezzo’s design could also allow breaking accomplish this, the holding location of a file set these callbacks before the modification is is itself mounted as an InterMezzo file system to initiated. In either case, a server-initiated request BreakCallback is issued. The
BreakCallback request is handled as a multi RPC,
as in Coda. This means that the BreakCallback operation of clients and servers, and Lento request is sent out to all clients and only then running on clients and servers must be capable replies are gathered -- this avoids multiple timeouts. When data is to be modified, a client • File Service: fetching (on clients) and back- will get a permit to do so. There is a GetPermit
fetching (on servers) of files and directories • Reintegration service: receiving modification acquisition is indicated with a HAS_PERMIT flag logs (on servers) and update notifications Every object will be identified by an identifier We see that both servers and clients should be capable of making modifications and serving identifier and version-stamp are identical. Inaddition to object version stamps, volumes have A key distinction is that a server implements stamps too, and allow for rapid revalidation of security while a client can trust an authenticated entire volumes, in the common case where no server and implement changes without checking permissions on an object basis. Secondly a clientmay make changes to its cache solely for freeing InterMezzo can validate cached objects with up space. Such modifications are, of course, not Validate requests. As in Coda, these can be
issued at the volume and file system object For prototyping we used the POE (Perl Object Environment) toolkit, see [poe]. POEimplements a session framework as described The propagation of updates is done through the above and has Wheels that consist of Drivers to Reintegrate request. This request uses the
do the I/O and Filters to pre-process the data modification log, in a way that is similar to Coda’s reintegration. The client modification log(CML) is shipped to the server. The server We added two wheels to POE. The PacketWheel proceeds to incorporate the changes to the is there to send and receive network requests, directory tree and then fetches the files from the which the filter unpacks and then gives to the client, for which close operations were in the connection. The connection determines the destination session for the request, which canbe an existing session or the request dispatcher, Coda avoids many race conditions by including in case a new or specially chosen session must the version with the RPC for the file affected by handle the request. Our UpcallWheel unpacks the update. For example, the creation of a new requests originating in the kernel, which reach directory would include the version of the parent Lento through a character device /dev/presto.
directory present on the client. This allows the server to make sure that the server version being modified is that of the client. If the versions are unequal, the reintegration of this modification causes a conflict, which needs special handling. InterMezzo will do this as well.
The processing of net and kernel requests isgraphically indicated in figures 2 and 3.
So at present InterMezzo’s protocol is very As an example we will give the pseudo code forthe upcall session servicing an “FetchFile” Lento – cache manager and file
request from the kernel. The notation below describes a session as a hash of event handlers.
Lento is responsible for handling file service requests from the network or the kernel.
Traditional servers and cache managers are implemented using a thread pool model. The requests come in and are placed on a queue.
requests, and block while doing I/O, during which other worker threads can continue.
We chose to do start with a single threaded asynchronous I/O. In Lento, requests come in and a session is instantiated to handle them. A session is not a thread, but a data structure that contains state and event handlers. I/O is done completion of I/O, through dispatch of events.
The kernel activates all event handlers and is also responsible for garbage collecting sessions when no events can reach them. Sessions have Each of the event handlers may engage in ablocking I/O operation, for example to fetch the attributes. In doing so it indicates what event the upcall. A queue associated with the session will complete the asynchronous operation.
allows further requests to fetch the data for the Ultimately, when the data is all on the client, the same object to be handled by the session that is request completes to the kernel by replying to got_upcall
g o t_ e rro r
g o t_ e rro r
g o t_ w h e e l
Lento also needs a variety of daemon sessions, example, we have the ReqDispatcher, which which are not garbage collected but perform sets up new request handling sessions. For the management of client initiated connections, we use Connector daemons, which create a TCP packet: requests, messages, replies, data, end- socket connect to a server, and manage the Lento also uses various tables: tables exist for servers, volumes, and file system objects, each Completion Tokens (ACT’s, see [ace]) which with their own attributes. When Lento starts a encapsulate the source and destination sessions for request handling. There are 6 types of Some of the set-up processes are quite involved.
stamp. If the version stamps of the objects are For example, setting up the file system database equal the objects should be equal, and version (Fsdb) may involve recovery. (Currently this changes originating from a single client should consists simply of wiping the cache.) The server semantics supplement the Fsdb database withcallback and permit tracking, as these are Consider the situation of a crash while fetching data into the cache. The situation to avoid is to believe that the cache is up to date, while in fact Recovery and cache validation in
the data was not fully fetched. To achieve this, more depth
the version stamp should not be written topersistent storage until it is certain that the On Linux, InterMezzo can use the ext2 file fetched data has made it to the disk. On Linux system to hold server and cached data. We this can be achieved by simply waiting 30 envisage having a fast and sophisticated file seconds before flushing the Fsdb database system database (Fsdb) for extra meta-data buffers. Upon reconnection to a server, al objects having an out of date version stamp managed by Lento, and would hold the version should be flushed from the cache. If recovery is information for cached objects, as well as inode- followed by disconnected operation, one could to-filename or inode-to file identifier translation allow the possibly incompletely fetched objects information. If a client crashes some of the to be used, for maximum availability. Such data will be held in buffers, other data will be on objects can be identified as those having a ctime the disk and it is necessary to think through less than 30 seconds since the last Fsdb flush.
precisely what is needed for clean recovery.
If such objects are modified and reintegrate,they be regarded as “suspect” and be handled The simplest recovery process is to simply wipe the cache, which for large caches is clearly veryundesirable. The basic attribute for establishing If a client is modifying existing data when currency of a cached object is the version crashing, then the reverse risk exists: the client, when recovering, may see an old version stamp in the database, and possibly also an old mtime sophisticated filtering of file system requests and in the inode. If the object has just been fetched, combine them with up-calls. Simple examples the version stamp may even be older than that of such filter drivers are found in the FileMon on the server. Yet the data on the client disk utility [filemon]. For Windows 9x an additional may be newer than that on the server. There difficulty arises from the non-reentrancy of the are several approaches here, the simplest being operating system. Lento will have to be a 32-bit “laissez faire” and accepting the fact that a few DOS application, just like Coda’s cache manager seconds of data modifications might be lost. This Venus. Fortunately, the Coda project provides is not very desirable, since everything should be the tools to make a port possible, and somewhat done not to lose the latest version of data.
amazingly the Coda DOS cache manager worksvery well.
Alternatively, the inode under modification could be updated with a new mtime and synced to Porting InterMezzo to other Unix systems should disk. Provided modifications do not happen be possible, but might require knowledge only within the granularity of mtime units this allows the recovery mechanism to identify this file assuspect. The collection of suspect files is Conclusion
defined as those having an mtime later than thelast sync time of the Fsdb. Such suspect files With 2,500 lines of C kernel code and 6,000 lines of Perl code (don’t worry, no one liners here) we have managed to implement the basicfunctionality of the InterMezzo file system.
Yet another solution is to write the Fsdbdatabase record to disk, before starting the modification, indicating that the inode has now continue towards a fully functional robust file changed. The latter involves a context switch to Lento, but since disk traffic is involved anyway,this is probably not a significant overhead, see References
This detailed and slow initiation of modification procedure would not affect newly createdobjects.
[bottlenecks]Removing Bottlenecks in Distributed Filesystems: Coda and Intermezzo as examples, controlled situations, for example to provide Peter J. Braam & Philip Nelson, Linux Expo 99.
replication for WWW data. In such caseshandling conflicting updates should not involve user interaction and we will introduce policies to handle conflicts automatically, for example bygiving preference to the server copy and notifying an administrator of a conflict.
InterMezzo on Windows 9x,
Windows 2000 and other Unix

Distributed File Systems for Clusters: a Protocol Perspective, Peter J. Braam, Usenix Technical We believe that the InterMezzo system can fairly Conference, Extreme Linux Track, 1999.
easily be ported to Windows 9x and Windows NT/2000, building on the experience we gained when doing this for Coda [coda-win].
[stackable]John Shelby Heidemann. Stackable Design ofFile Systems. Technical Report CSD-950032, University of California, Los Angeles, September,1995.
Satish Chandra, Brad Richards, and James R.
Larus. Teapot: Language Support for Writing Memory Coherence Protocols. In Proceedings ofthe SIGPLAN '96 Conference on ProgrammingLanguage Design and Implementation (PLDI),


Microsoft word - document

AETNA LIFE INSURANCE COMPANY, the Boeing Company Long Term Disability Plan and the Boeing Company Life Insurance Plan, Defendants. Background: Trustee of plan beneficiary's estate brought action against employee long-term disability benefits plan, employee life insurance plan, and insurer serving as plan administrator, alleging breach of Employee Retirement Income Security Act (ERISA) and se

Microsoft word - march break registration form 2014 _francofun!_

FrancoFUN March Break Session Registration Form Daily Drop-In = $55 per day Directors: Caillie Cipriano (B. Ed.) and Sandra Smith (B. Ed.) I would like to register my child for the following days of the daily Camper Information: March ____________________________________________, 2014 Last Name: ___________________________________________________ ** If you would like to add addi

Copyright © 2011-2018 Health Abstracts