The InterMezzo File System Peter J. Braam, [email protected] Carnegie Mellon University & Stelias Computing Michael Callahan, [email protected] The Roda Group Phil Schwan, [email protected] Stelias Computing Abstract
To keep our project modest, we are firsttargeting the project of online directory
Is it possible to implement a distributed file
replication on a system area network, allowing
for network and system failures. This exposes
protocol features of systems like Coda, while
many of the subsystems needed, but leaves
getting an implementation that is an order of
magnitude simpler? We claim the answer is
describes the broader design issues, parts of
“yes”, and InterMezzo is well on its way
which will be implemented and likely modified
towards being a prototype of such a system.
in the process of bringing InterMezzo forward.
The key design decisions were to exploit localfile systems as server storage and as a client
The code for InterMezzo will soon be available
cache and make the kernel file system driver a
wrapper around a local file system. We use arich, high level language and work with
Acknowledgements: The authors thank the
asynchronous completion instead of threads.
other members of the Coda project for fruitful
Finally, we rely on existing infrastructure, such
discussions. This research was supported by
as TCP. This is in contrast with systems like
the Air Force Materiel Command (AFMC) under
Coda and AFS, which implement their entire
DARPA contract number F19628-96-C-0061.
infrastructure from the ground up, in C. This
IBM and Novell provided additional support.
The views and conclusions contained in here
are those of the authors and should not be
interpreted as necessarily representing theofficial policies or endorsements, eitherexpress or implied, of AFMC, DARPA, IBM,
Introduction
Novell, Carnegie Mellon University, SteliasComputing Inc., The Roda Group LLC or the
Designing and implementing distributed file
systems poses a major intellectual challenge. Many excellent ideas have come out of
research projects, and some systems offer
Designing Distributed File
The basic question we wanted to answer is the
There are several components to distributed
feasibility of building an interesting distributed
file systems. First, network file systems have
file system with an order of magnitude less
clients, which can fetch data from servers and
code than a system like Coda [coda] has, yet
send modifications to servers. This requires
retaining some of the advanced features. We
network request processing including bulk
claim the answer is “yes”. In this paper we
transfer. The fact that the client gains file
system access to the files requires a kernel
implementation decisions for the InterMezzo
level file system driver. Without adding bells
and whistles here we already have a list of
Delicate protocols and versioning information
Another key contributor to difficulties with file
is needed to give good semantics of sharing.
clients that cached versions of the files or
correct implementation of a relatively simple
directories are out of date. This requires the
client to both make requests to servers, on
issues have to be addressed, resulting in a
behalf of client processes accessing and/or
long arduous development path. UCLA’s Ficus
modifying file system objects, as well as to
[ficus] project made an important contribution
answer requests from the server to maintain
in this field, namely to use stackable file
systems [stackable], also available for Linux
[zadok]. These are file system layers that
leverage and extend the functions of other file
systems. We use this approach in InterMezzo
pervasive in a design. One must minimize the
and avoid the difficulties associated with a full
number of network interactions, and also make
these interactions as efficient as possible. Security, management and backup add
InterMezzo is at present a prototype, and the
another dimension to a file system project.
first target for the system is to do reliable
directory replication between two nodes, on a
Given this picture, it should not come as a
secure system area network. Its protocols are
surprise that not all the advanced file system
suitable for extensions, and a fully featured
projects have delivered systems that gained
distributed file system can be constructed from
widespread use. Fortunately many features
and ideas have been explored and thoroughlyevaluated during the last 15 years. A central
Overview of InterMezzo
question that has not been answeredappropriately is how to implement file systems
Faced with a multitude of design choices, we
reliably, with a reasonable amount of code.
drew up the following list of requirements.
1. The server file storage must reside in a
distributed systems. Ericsson was facing the
daunting task of implementing control systems
2. InterMezzo’s client kernel level file system
for telephone exchanges. The Erlang language
should exploit existing file systems, and
[erlang] was developed as a result of extensive
studies how to construct reliable distributed
3. File system objects should have meta-data
systems. The Teapot [teapot] project at the
suitable for disconnected operation.
4. Scalability and recovery of the distributed
specific languages enabling construction and
verification of distributed shared memory
software. This toolkit was subsequently used
5. The system should perform kernel level
by the Berkeley XFS project [xfs]. The ACE
toolkit [ace], developed at George Washington
frameworks for building distributed systems.
protocols such as rsync for synchronization
Interestingly all of these projects have reached
conclusions bearing considerable similarity: a
server file systems should differ in policy,
notifications is exploited, which gives a better
overview of the distributed state than the
classical “multi-threaded” server models.
have not yet decided if we should identifyInterMezzo file objects (i.e. files & directories)
server/device/inode triples. Our system is
software is not aware of the presence of file
model to another with a relatively small effort.
The file set location database is implemented
Security and management are important. AFS
as a sparse tree of directories, handled as a
and Coda set good examples for management
special object in the file system. An update
of server space, by working with file sets,
record is associated with each modification of
a.k.a. volumes. File sets are groups of files
the file set tree, and version numbers are
that form a sub-tree of the InterMezzo name
associated with each update. This allows the
space, and which are typically much bigger
than a single directory and much smaller than
downloading the latest version (which is also
an entire disk partition. Coda and AFS, as well
as DCE/DFS and Microsoft dfs, have a single
name space. This means that all client systems
InterMezzo makes a distinction between clients
see one large file tree containing all the file
and servers, but mostly as a matter of policy,
are used, the server maintains the on-diskstructures as authoritative records of file state,
Often it is desirable to work with a small
and also coordinates state coherence across
collection of file sets separately, so we left
the InterMezzo clients. Each server will also
room to manipulate the InterMezzo exported
be an InterMezzo client, but the file system
file sets more explicitly than in those systems.
layout on the server is slightly different from
that on other clients. A file set mount point on
File sets, mount points, servers
a server is not necessarily a directory in the
and clients
InterMezzo file system as it is on a client. If
the server stores the file set, a symbolic link is
Each file tree made available by a cluster of
placed at the mount point, which targets a
InterMezzo servers is built up of file sets or
directory on the server holding the data. This
volumes, which are similar to Coda and AFS
target directory is the file set holding location
volumes. Each file set has root directory and
can contain InterMezzo mount points of otherfile sets. An InterMezzo mount point is a
Journaling updates and filtering
concept similar to but distinct from a Unix
mount point. A client can have any file set asthe root of an InterMezzo file system.
An InterMezzo cache is simply a local media
file system, which is mounted with an extra
filter layer wrapped around this local file
associated with it, describing the server
system. The filter layer is called “Presto” for
holding the file set. The file set location
Linux and “Vivace” for Windows platforms.
database describes the mount points of file
• Filter access, to validate currency of the
sharing a single file set location database
• Journal updates made to the file system
A client will be able to choose any file set as
manager on the system. Note that Lento acts
InterMezzo file system. If the root file set is
both as the file server and as the client cache
not the default, then all ancestors of the
manager. Figure 1 graphically describes the
mount point of that file set are invisible, this
operation of the InterMezzo filter driver.
accommodates exporting sub-trees of theentire cluster file tree. File sets contain files
The code in Presto is quite straightforward.
and directories, and generally user level
informed of the file system type that it is
if (!HAVE_DATA(inode) && !ISLENTO)
associated with dentries, inodes and files of
the wrapped file system in the “bottom_ops”
structures. The “bottom_ops” are typically
those of the ext2 file system. Presto must
filter all access and journal all updates, butmust make an exception for the process
Updates to objects are made as follows. First
a Permit is acquired to execute concurrency
control. On clients, the updates are made in
modifications of the cache should be immune
modification log for reintegration to servers. The server holding the file set repeats the
As an example let us describe the open call for
modifications made to the volume and will
directory inodes. When an object is accessed
and it is not present in the cache, the object is
registered for replication. We call the process
fetched. Files are presently fetched in their
of forwarding modification logs to the clients
entirety. Directory fetching is implemented by
creating the name, and creating sparse files
reintegration and update notification are
and empty subdirectories for each of their
entries. While this introduces significantlatency in the access to directories,
The modification logs details modifications
compensation comes from the fact that the
made to directories and file attributes, as well
subsequent acquisition of attributes of the
as all close operations on files. Such files,
which have been closed after a modification,
are then “back-fetched” by the server. If a
In pseudo code, the kernel open method has
server detects that it does not have a directory
created by the client, it recursively back-fetches the contents of that directory from the
Figure 1: Basic operation of the InterMezzo system.
In a trusted environment, an optimization can
take place, which simply changes the holder ofthe file set to the system modifying the file set.
This system then is responsible for updating al
if (!HAVE_PERMIT(inode) && !ISLENTO)
Protocols
[dfs-protocols] have many attractive features,
and our intention was to preserve many aspects
Each cached object (i.e. directory or file) has
attributes HAS_DATA, HAS_ATTR to indicate if
its data/attributes are valid in the cache. Before accessing an object, the client will verify
A client can place a replication request for a file
that it is current, and re-fetch it if necessary.
set with the server holding the file set – the
The protocol has a FetchDir and FetchFile call
server will then propagate all updates it receives
for fetching and a Validate call for assessing
currency. The presence of a HAS_DATA orHAS_ATTR flag is called a call-back and enables
If a file set is modified on the server, an almost
the client to reuse the object without contacting
identical process takes place. The updates are
journaled but now only propagated to clients
registered for replication. Other clients will
The server will notify clients holding callbacks on
invalidate their caches when they learn of the
objects before storing modified objects.
updates, and re-fetch the data. In order to
InterMezzo’s design could also allow breaking
accomplish this, the holding location of a file set
these callbacks before the modification is
is itself mounted as an InterMezzo file system to
initiated. In either case, a server-initiated
request BreakCallback is issued. The BreakCallback request is handled as a multi RPC,
as in Coda. This means that the BreakCallback
operation of clients and servers, and Lento
request is sent out to all clients and only then
running on clients and servers must be capable
replies are gathered -- this avoids multiple
timeouts. When data is to be modified, a client
• File Service: fetching (on clients) and back-
will get a permit to do so. There is a GetPermit
fetching (on servers) of files and directories
• Reintegration service: receiving modification
acquisition is indicated with a HAS_PERMIT flag
logs (on servers) and update notifications
Every object will be identified by an identifier
We see that both servers and clients should be
capable of making modifications and serving
identifier and version-stamp are identical. Inaddition to object version stamps, volumes have
A key distinction is that a server implements
stamps too, and allow for rapid revalidation of
security while a client can trust an authenticated
entire volumes, in the common case where no
server and implement changes without checking
permissions on an object basis. Secondly a clientmay make changes to its cache solely for freeing
InterMezzo can validate cached objects with
up space. Such modifications are, of course, not
Validate requests. As in Coda, these can be
issued at the volume and file system object
For prototyping we used the POE (Perl Object
Environment) toolkit, see [poe]. POEimplements a session framework as described
The propagation of updates is done through the
above and has Wheels that consist of Drivers to
Reintegrate request. This request uses the
do the I/O and Filters to pre-process the data
modification log, in a way that is similar to
Coda’s reintegration. The client modification log(CML) is shipped to the server. The server
We added two wheels to POE. The PacketWheel
proceeds to incorporate the changes to the
is there to send and receive network requests,
directory tree and then fetches the files from the
which the filter unpacks and then gives to the
client, for which close operations were in the
connection. The connection determines the
destination session for the request, which canbe an existing session or the request dispatcher,
Coda avoids many race conditions by including
in case a new or specially chosen session must
the version with the RPC for the file affected by
handle the request. Our UpcallWheel unpacks
the update. For example, the creation of a new
requests originating in the kernel, which reach
directory would include the version of the parent
Lento through a character device /dev/presto.
directory present on the client. This allows the
server to make sure that the server version
being modified is that of the client. If the
versions are unequal, the reintegration of this
modification causes a conflict, which needs
special handling. InterMezzo will do this as well.
The processing of net and kernel requests isgraphically indicated in figures 2 and 3.
So at present InterMezzo’s protocol is very
As an example we will give the pseudo code forthe upcall session servicing an “FetchFile”
Lento – cache manager and file
request from the kernel. The notation below
describes a session as a hash of event handlers.
Lento is responsible for handling file service
requests from the network or the kernel.
Traditional servers and cache managers are
implemented using a thread pool model. The
requests come in and are placed on a queue.
requests, and block while doing I/O, during
which other worker threads can continue.
We chose to do start with a single threaded
asynchronous I/O. In Lento, requests come in
and a session is instantiated to handle them. A
session is not a thread, but a data structure that
contains state and event handlers. I/O is done
completion of I/O, through dispatch of events.
The kernel activates all event handlers and is
also responsible for garbage collecting sessions
when no events can reach them. Sessions have
Each of the event handlers may engage in ablocking I/O operation, for example to fetch the
attributes. In doing so it indicates what event
the upcall. A queue associated with the session
will complete the asynchronous operation.
allows further requests to fetch the data for the
Ultimately, when the data is all on the client, the
same object to be handled by the session that is
request completes to the kernel by replying to
got_upcall get_connection got_connection got_error got_wheel got_error g o t_ e rro r g o t_ e rro r g o t_ w h e e l
Lento also needs a variety of daemon sessions,
example, we have the ReqDispatcher, which
which are not garbage collected but perform
sets up new request handling sessions. For the
management of client initiated connections, we
use Connector daemons, which create a TCP
packet: requests, messages, replies, data, end-
socket connect to a server, and manage the
Lento also uses various tables: tables exist for
servers, volumes, and file system objects, each
Completion Tokens (ACT’s, see [ace]) which
with their own attributes. When Lento starts a
encapsulate the source and destination sessions
for request handling. There are 6 types of
Some of the set-up processes are quite involved.
stamp. If the version stamps of the objects are
For example, setting up the file system database
equal the objects should be equal, and version
(Fsdb) may involve recovery. (Currently this
changes originating from a single client should
consists simply of wiping the cache.) The server
semantics supplement the Fsdb database withcallback and permit tracking, as these are
Consider the situation of a crash while fetching
data into the cache. The situation to avoid is to
believe that the cache is up to date, while in fact
Recovery and cache validation in
the data was not fully fetched. To achieve this,
more depth
the version stamp should not be written topersistent storage until it is certain that the
On Linux, InterMezzo can use the ext2 file
fetched data has made it to the disk. On Linux
system to hold server and cached data. We
this can be achieved by simply waiting 30
envisage having a fast and sophisticated file
seconds before flushing the Fsdb database
system database (Fsdb) for extra meta-data
buffers. Upon reconnection to a server, al
objects having an out of date version stamp
managed by Lento, and would hold the version
should be flushed from the cache. If recovery is
information for cached objects, as well as inode-
followed by disconnected operation, one could
to-filename or inode-to file identifier translation
allow the possibly incompletely fetched objects
information. If a client crashes some of the
to be used, for maximum availability. Such
data will be held in buffers, other data will be on
objects can be identified as those having a ctime
the disk and it is necessary to think through
less than 30 seconds since the last Fsdb flush.
precisely what is needed for clean recovery.
If such objects are modified and reintegrate,they be regarded as “suspect” and be handled
The simplest recovery process is to simply wipe
the cache, which for large caches is clearly veryundesirable. The basic attribute for establishing
If a client is modifying existing data when
currency of a cached object is the version
crashing, then the reverse risk exists: the client,
when recovering, may see an old version stamp
in the database, and possibly also an old mtime
sophisticated filtering of file system requests and
in the inode. If the object has just been fetched,
combine them with up-calls. Simple examples
the version stamp may even be older than that
of such filter drivers are found in the FileMon
on the server. Yet the data on the client disk
utility [filemon]. For Windows 9x an additional
may be newer than that on the server. There
difficulty arises from the non-reentrancy of the
are several approaches here, the simplest being
operating system. Lento will have to be a 32-bit
“laissez faire” and accepting the fact that a few
DOS application, just like Coda’s cache manager
seconds of data modifications might be lost. This
Venus. Fortunately, the Coda project provides
is not very desirable, since everything should be
the tools to make a port possible, and somewhat
done not to lose the latest version of data.
amazingly the Coda DOS cache manager worksvery well.
Alternatively, the inode under modification could
be updated with a new mtime and synced to
Porting InterMezzo to other Unix systems should
disk. Provided modifications do not happen
be possible, but might require knowledge only
within the granularity of mtime units this allows
the recovery mechanism to identify this file assuspect. The collection of suspect files is
Conclusion
defined as those having an mtime later than thelast sync time of the Fsdb. Such suspect files
With 2,500 lines of C kernel code and 6,000
lines of Perl code (don’t worry, no one liners
here) we have managed to implement the basicfunctionality of the InterMezzo file system.
Yet another solution is to write the Fsdbdatabase record to disk, before starting the
modification, indicating that the inode has now
continue towards a fully functional robust file
changed. The latter involves a context switch to
Lento, but since disk traffic is involved anyway,this is probably not a significant overhead, see
References
This detailed and slow initiation of modification
procedure would not affect newly createdobjects.
[bottlenecks]Removing Bottlenecks in Distributed
Filesystems: Coda and Intermezzo as examples,
controlled situations, for example to provide
Peter J. Braam & Philip Nelson, Linux Expo 99.
replication for WWW data. In such caseshandling conflicting updates should not involve
user interaction and we will introduce policies to
handle conflicts automatically, for example bygiving preference to the server copy and
notifying an administrator of a conflict. InterMezzo on Windows 9x, Windows 2000 and other Unix
Distributed File Systems for Clusters: a Protocol
Perspective, Peter J. Braam, Usenix Technical
We believe that the InterMezzo system can fairly
Conference, Extreme Linux Track, 1999.
easily be ported to Windows 9x and Windows
NT/2000, building on the experience we gained
when doing this for Coda [coda-win].
[stackable]John Shelby Heidemann. Stackable Design ofFile Systems. Technical Report CSD-950032,
University of California, Los Angeles, September,1995.
Satish Chandra, Brad Richards, and James R. Larus. Teapot: Language Support for Writing
Memory Coherence Protocols. In Proceedings ofthe SIGPLAN '96 Conference on ProgrammingLanguage Design and Implementation (PLDI),
AETNA LIFE INSURANCE COMPANY, the Boeing Company Long Term Disability Plan and the Boeing Company Life Insurance Plan, Defendants. Background: Trustee of plan beneficiary's estate brought action against employee long-term disability benefits plan, employee life insurance plan, and insurer serving as plan administrator, alleging breach of Employee Retirement Income Security Act (ERISA) and se
FrancoFUN March Break Session Registration Form Daily Drop-In = $55 per day Directors: Caillie Cipriano (B. Ed.) and Sandra Smith (B. Ed.) I would like to register my child for the following days of the daily Camper Information: March ____________________________________________, 2014 Last Name: ___________________________________________________ ** If you would like to add addi