mirror of
				https://github.com/KevinMidboe/linguist.git
				synced 2025-10-29 17:50:22 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			1175 lines
		
	
	
		
			50 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			1175 lines
		
	
	
		
			50 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
.\"	$NetBSD: fsinterface.ms,v 1.4 2003/08/07 10:30:42 agc Exp $
 | 
						|
.\"
 | 
						|
.\" Copyright (c) 1986 The Regents of the University of California.
 | 
						|
.\" All rights reserved.
 | 
						|
.\"
 | 
						|
.\" Redistribution and use in source and binary forms, with or without
 | 
						|
.\" modification, are permitted provided that the following conditions
 | 
						|
.\" are met:
 | 
						|
.\" 1. Redistributions of source code must retain the above copyright
 | 
						|
.\"    notice, this list of conditions and the following disclaimer.
 | 
						|
.\" 2. Redistributions in binary form must reproduce the above copyright
 | 
						|
.\"    notice, this list of conditions and the following disclaimer in the
 | 
						|
.\"    documentation and/or other materials provided with the distribution.
 | 
						|
.\" 3. Neither the name of the University nor the names of its contributors
 | 
						|
.\"    may be used to endorse or promote products derived from this software
 | 
						|
.\"    without specific prior written permission.
 | 
						|
.\"
 | 
						|
.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 | 
						|
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 | 
						|
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 | 
						|
.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 | 
						|
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 | 
						|
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 | 
						|
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 | 
						|
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 | 
						|
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 | 
						|
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 | 
						|
.\" SUCH DAMAGE.
 | 
						|
.\"
 | 
						|
.\"	@(#)fsinterface.ms	1.4 (Berkeley) 4/16/91
 | 
						|
.\"
 | 
						|
.if \nv .rm CM
 | 
						|
.de UX
 | 
						|
.ie \\n(UX \s-1UNIX\s0\\$1
 | 
						|
.el \{\
 | 
						|
\s-1UNIX\s0\\$1\(dg
 | 
						|
.FS
 | 
						|
\(dg \s-1UNIX\s0 is a registered trademark of AT&T.
 | 
						|
.FE
 | 
						|
.nr UX 1
 | 
						|
.\}
 | 
						|
..
 | 
						|
.TL
 | 
						|
Toward a Compatible Filesystem Interface
 | 
						|
.AU
 | 
						|
Michael J. Karels
 | 
						|
Marshall Kirk McKusick
 | 
						|
.AI
 | 
						|
Computer Systems Research Group
 | 
						|
Computer Science Division
 | 
						|
Department of Electrical Engineering and Computer Science
 | 
						|
University of California, Berkeley
 | 
						|
Berkeley, California  94720
 | 
						|
.AB
 | 
						|
.LP
 | 
						|
As network or remote filesystems have been implemented for
 | 
						|
.UX ,
 | 
						|
several stylized interfaces between the filesystem implementation
 | 
						|
and the rest of the kernel have been developed.
 | 
						|
.FS
 | 
						|
This is an update of a paper originally presented
 | 
						|
at the September 1986 conference of the European
 | 
						|
.UX
 | 
						|
Users' Group.
 | 
						|
Last modified April 16, 1991.
 | 
						|
.FE
 | 
						|
Notable among these are Sun Microsystems' Virtual Filesystem interface (VFS)
 | 
						|
using vnodes, Digital Equipment's Generic File System (GFS) architecture,
 | 
						|
and AT&T's File System Switch (FSS).
 | 
						|
Each design attempts to isolate filesystem-dependent details
 | 
						|
below a generic interface and to provide a framework within which
 | 
						|
new filesystems may be incorporated.
 | 
						|
However, each of these interfaces is different from
 | 
						|
and incompatible with the others.
 | 
						|
Each of them addresses somewhat different design goals.
 | 
						|
Each was based on a different starting version of
 | 
						|
.UX ,
 | 
						|
targetted a different set of filesystems with varying characteristics,
 | 
						|
and uses a different set of primitive operations provided by the filesystem.
 | 
						|
The current study compares the various filesystem interfaces.
 | 
						|
Criteria for comparison include generality, completeness, robustness,
 | 
						|
efficiency and esthetics.
 | 
						|
Several of the underlying design issues are examined in detail.
 | 
						|
As a result of this comparison, a proposal for a new filesystem interface
 | 
						|
is advanced that includes the best features of the existing implementations.
 | 
						|
The proposal adopts the calling convention for name lookup introduced
 | 
						|
in 4.3BSD, but is otherwise closely related to Sun's VFS.
 | 
						|
A prototype implementation is now being developed at Berkeley.
 | 
						|
This proposal and the rationale underlying its development
 | 
						|
have been presented to major software vendors
 | 
						|
as an early step toward convergence on a compatible filesystem interface.
 | 
						|
.AE
 | 
						|
.SH
 | 
						|
Introduction
 | 
						|
.PP
 | 
						|
As network communications and workstation environments
 | 
						|
became common elements in
 | 
						|
.UX
 | 
						|
systems, several vendors of
 | 
						|
.UX
 | 
						|
systems have designed and built network file systems
 | 
						|
that allow client process on one
 | 
						|
.UX
 | 
						|
machine to access files on a server machine.
 | 
						|
Examples include Sun's Network File System, NFS [Sandberg85],
 | 
						|
AT&T's recently-announced Remote File Sharing, RFS [Rifkin86],
 | 
						|
the LOCUS distributed filesystem [Walker85],
 | 
						|
and Masscomp's extended filesystem [Cole85].
 | 
						|
Other remote filesystems have been implemented in research or university groups
 | 
						|
for internal use, notably the network filesystem in the Eighth Edition
 | 
						|
.UX
 | 
						|
system [Weinberger84] and two different filesystems used at Carnegie-Mellon
 | 
						|
University [Satyanarayanan85].
 | 
						|
Numerous other remote file access methods have been devised for use
 | 
						|
within individual
 | 
						|
.UX
 | 
						|
processes,
 | 
						|
many of them by modifications to the C I/O library
 | 
						|
similar to those in the Newcastle Connection [Brownbridge82].
 | 
						|
.PP
 | 
						|
Multiple network filesystems may frequently
 | 
						|
be found in use within a single organization.
 | 
						|
These circumstances make it highly desirable to be able to transport filesystem
 | 
						|
implementations from one system to another.
 | 
						|
Such portability is considerably enhanced by the use of a stylized interface
 | 
						|
with carefully-defined entry points to separate the filesystem from the rest
 | 
						|
of the operating system.
 | 
						|
This interface should be similar to the interface between device drivers
 | 
						|
and the kernel.
 | 
						|
Although varying somewhat among the common versions of
 | 
						|
.UX ,
 | 
						|
the device driver interfaces are sufficiently similar that device drivers
 | 
						|
may be moved from one system to another without major problems.
 | 
						|
A clean, well-defined interface to the filesystem also allows a single
 | 
						|
system to support multiple local filesystem types.
 | 
						|
.PP
 | 
						|
For reasons such as these, several filesystem interfaces have been used
 | 
						|
when integrating new filesystems into the system.
 | 
						|
The best-known of these are Sun Microsystems' Virtual File System interface,
 | 
						|
VFS [Kleiman86], and AT&T's File System Switch, FSS.
 | 
						|
Another interface, known as the Generic File System, GFS,
 | 
						|
has been implemented for the ULTRIX\(dd
 | 
						|
.FS
 | 
						|
\(dd ULTRIX is a trademark of Digital Equipment Corp.
 | 
						|
.FE
 | 
						|
system by Digital [Rodriguez86].
 | 
						|
There are numerous differences among these designs.
 | 
						|
The differences may be understood from the varying philosophies
 | 
						|
and design goals of the groups involved, from the systems under which
 | 
						|
the implementations were done, and from the filesystems originally targetted
 | 
						|
by the designs.
 | 
						|
These differences are summarized in the following sections
 | 
						|
within the limitations of the published specifications.
 | 
						|
.SH
 | 
						|
Design goals
 | 
						|
.PP
 | 
						|
There are several design goals which, in varying degrees,
 | 
						|
have driven the various designs.
 | 
						|
Each attempts to divide the filesystem into a filesystem-type-independent
 | 
						|
layer and individual filesystem implementations.
 | 
						|
The division between these layers occurs at somewhat different places
 | 
						|
in these systems, reflecting different views of the diversity and types
 | 
						|
of the filesystems that may be accommodated.
 | 
						|
Compatibility with existing local filesystems has varying importance;
 | 
						|
at the user-process level, each attempts to be completely transparent
 | 
						|
except for a few filesystem-related system management programs.
 | 
						|
The AT&T interface also makes a major effort to retain familiar internal
 | 
						|
system interfaces, and even to retain object-file-level binary compatibility
 | 
						|
with operating system modules such as device drivers.
 | 
						|
Both Sun and DEC were willing to change internal data structures and interfaces
 | 
						|
so that other operating system modules might require recompilation
 | 
						|
or source-code modification.
 | 
						|
.PP
 | 
						|
AT&T's interface both allows and requires filesystems to support the full
 | 
						|
and exact semantics of their previous filesystem,
 | 
						|
including interruptions of system calls on slow operations.
 | 
						|
System calls that deal with remote files are encapsulated
 | 
						|
with their environment and sent to a server where execution continues.
 | 
						|
The system call may be aborted by either client or server, returning
 | 
						|
control to the client.
 | 
						|
Most system calls that descend into the file-system dependent layer
 | 
						|
of a filesystem other than the standard local filesystem do not return
 | 
						|
to the higher-level kernel calling routines.
 | 
						|
Instead, the filesystem-dependent code completes the requested
 | 
						|
operation and then executes a non-local goto (\fIlongjmp\fP) to exit the
 | 
						|
system call.
 | 
						|
These efforts to avoid modification of main-line kernel code
 | 
						|
indicate a far greater emphasis on internal compatibility than on modularity,
 | 
						|
clean design, or efficiency.
 | 
						|
.PP
 | 
						|
In contrast, the Sun VFS interface makes major modifications to the internal
 | 
						|
interfaces in the kernel, with a very clear separation
 | 
						|
of filesystem-independent and -dependent data structures and operations.
 | 
						|
The semantics of the filesystem are largely retained for local operations,
 | 
						|
although this is achieved at some expense where it does not fit the internal
 | 
						|
structuring well.
 | 
						|
The filesystem implementations are not required to support the same
 | 
						|
semantics as local
 | 
						|
.UX
 | 
						|
filesystems.
 | 
						|
Several historical features of
 | 
						|
.UX
 | 
						|
filesystem behavior are difficult to achieve using the VFS interface,
 | 
						|
including the atomicity of file and link creation and the use of open files
 | 
						|
whose names have been removed.
 | 
						|
.PP
 | 
						|
A major design objective of Sun's network filesystem,
 | 
						|
statelessness,
 | 
						|
permeates the VFS interface.
 | 
						|
No locking may be done in the filesystem-independent layer,
 | 
						|
and locking in the filesystem-dependent layer may occur only during
 | 
						|
a single call into that layer.
 | 
						|
.PP
 | 
						|
A final design goal of most implementors is performance.
 | 
						|
For remote filesystems,
 | 
						|
this goal tends to be in conflict with the goals of complete semantic
 | 
						|
consistency, compatibility and modularity.
 | 
						|
Sun has chosen performance over modularity in some areas,
 | 
						|
but has emphasized clean separation of the layers within the filesystem
 | 
						|
at the expense of performance.
 | 
						|
Although the performance of RFS is yet to be seen,
 | 
						|
AT&T seems to have considered compatibility far more important than modularity
 | 
						|
or performance.
 | 
						|
.SH
 | 
						|
Differences among filesystem interfaces
 | 
						|
.PP
 | 
						|
The existing filesystem interfaces may be characterized
 | 
						|
in several ways.
 | 
						|
Each system is centered around a few data structures or objects,
 | 
						|
along with a set of primitives for performing operations upon these objects.
 | 
						|
In the original
 | 
						|
.UX
 | 
						|
filesystem [Ritchie74],
 | 
						|
the basic object used by the filesystem is the inode, or index node.
 | 
						|
The inode contains all of the information about a file except its name:
 | 
						|
its type, identification, ownership, permissions, timestamps and location.
 | 
						|
Inodes are identified by the filesystem device number and the index within
 | 
						|
the filesystem.
 | 
						|
The major entry points to the filesystem are \fInamei\fP,
 | 
						|
which translates a filesystem pathname into the underlying inode,
 | 
						|
and \fIiget\fP, which locates an inode by number and installs it in the in-core
 | 
						|
inode table.
 | 
						|
\fINamei\fP performs name translation by iterative lookup
 | 
						|
of each component name in its directory to find its inumber,
 | 
						|
then using \fIiget\fP to return the actual inode.
 | 
						|
If the last component has been reached, this inode is returned;
 | 
						|
otherwise, the inode describes the next directory to be searched.
 | 
						|
The inode returned may be used in various ways by the caller;
 | 
						|
it may be examined, the file may be read or written,
 | 
						|
types and access may be checked, and fields may be modified.
 | 
						|
Modified inodes are automatically written back the filesystem
 | 
						|
on disk when the last reference is released with \fIiput\fP.
 | 
						|
Although the details are considerably different,
 | 
						|
the same general scheme is used in the faster filesystem in 4.2BSD
 | 
						|
.UX
 | 
						|
[Mckusick85].
 | 
						|
.PP
 | 
						|
Both the AT&T interface and, to a lesser extent, the DEC interface
 | 
						|
attempt to preserve the inode-oriented interface.
 | 
						|
Each modify the inode to allow different varieties of the structure
 | 
						|
for different filesystem types by separating the filesystem-dependent
 | 
						|
parts of the inode into a separate structure or one arm of a union.
 | 
						|
Both interfaces allow operations
 | 
						|
equivalent to the \fInamei\fP and \fIiget\fP operations
 | 
						|
of the old filesystem to be performed in the filesystem-independent
 | 
						|
layer, with entry points to the individual filesystem implementations to support
 | 
						|
the type-specific parts of these operations.  Implicit in this interface
 | 
						|
is that files may be conveniently be named by and located using a single
 | 
						|
index within a filesystem.
 | 
						|
The GFS provides specific entry points to the filesystems
 | 
						|
to change most file properties rather than allowing arbitrary changes
 | 
						|
to be made to the generic part of the inode.
 | 
						|
.PP
 | 
						|
In contrast, the Sun VFS interface replaces the inode as the primary object
 | 
						|
with the vnode.
 | 
						|
The vnode contains no filesystem-dependent fields except the pointer
 | 
						|
to the set of operations implemented by the filesystem.
 | 
						|
Properties of a vnode that might be transient, such as the ownership,
 | 
						|
permissions, size and timestamps, are maintained by the lower layer.
 | 
						|
These properties may be presented in a generic format upon request;
 | 
						|
callers are expected not to hold this information for any length of time,
 | 
						|
as they may not be up-to-date later on.
 | 
						|
The vnode operations do not include a corollary for \fIiget\fP;
 | 
						|
the only external interface for obtaining vnodes for specific files
 | 
						|
is the name lookup operation.
 | 
						|
(Separate procedures are provided outside of this interface
 | 
						|
that obtain a ``file handle'' for a vnode which may be given
 | 
						|
to a client by a server, such that the vnode may be retrieved
 | 
						|
upon later presentation of the file handle.)
 | 
						|
.SH
 | 
						|
Name translation issues
 | 
						|
.PP
 | 
						|
Each of the systems described include a mechanism for performing
 | 
						|
pathname-to-internal-representation translation.
 | 
						|
The style of the name translation function is very different in all
 | 
						|
three systems.
 | 
						|
As described above, the AT&T and DEC systems retain the \fInamei\fP function.
 | 
						|
The two are quite different, however, as the ULTRIX interface uses
 | 
						|
the \fInamei\fP calling convention introduced in 4.3BSD.
 | 
						|
The parameters and context for the name lookup operation
 | 
						|
are collected in a \fInameidata\fP structure which is passed to \fInamei\fP
 | 
						|
for operation.
 | 
						|
Intent to create or delete the named file is declared in advance,
 | 
						|
so that the final directory scan in \fInamei\fP may retain information
 | 
						|
such as the offset in the directory at which the modification will be made.
 | 
						|
Filesystems that use such mechanisms to avoid redundant work
 | 
						|
must therefore lock the directory to be modified so that it may not
 | 
						|
be modified by another process before completion.
 | 
						|
In the System V filesystem, as in previous versions of
 | 
						|
.UX ,
 | 
						|
this information is stored in the per-process \fIuser\fP structure
 | 
						|
by \fInamei\fP for use by a low-level routine called after performing
 | 
						|
the actual creation or deletion of the file itself.
 | 
						|
In 4.3BSD and in the GFS interface, these side effects of \fInamei\fP
 | 
						|
are stored in the \fInameidata\fP structure given as argument to \fInamei\fP,
 | 
						|
which is also presented to the routine implementing file creation or deletion.
 | 
						|
.PP
 | 
						|
The ULTRIX \fInamei\fP routine is responsible for the generic
 | 
						|
parts of the name translation process, such as copying the name into
 | 
						|
an internal buffer, validating it, interpolating
 | 
						|
the contents of symbolic links, and indirecting at mount points.
 | 
						|
As in 4.3BSD, the name is copied into the buffer in a single call,
 | 
						|
according to the location of the name.
 | 
						|
After determining the type of the filesystem at the start of translation
 | 
						|
(the current directory or root directory), it calls the filesystem's
 | 
						|
\fInamei\fP entry with the same structure it received from its caller.
 | 
						|
The filesystem-specific routine translates the name, component by component,
 | 
						|
as long as no mount points are reached.
 | 
						|
It may return after any number of components have been processed.
 | 
						|
\fINamei\fP performs any processing at mount points, then calls
 | 
						|
the correct translation routine for the next filesystem.
 | 
						|
Network filesystems may pass the remaining pathname to a server for translation,
 | 
						|
or they may look up the pathname components one at a time.
 | 
						|
The former strategy would be more efficient,
 | 
						|
but the latter scheme allows mount points within a remote filesystem
 | 
						|
without server knowledge of all client mounts.
 | 
						|
.PP
 | 
						|
The AT&T \fInamei\fP interface is presumably the same as that in previous
 | 
						|
.UX
 | 
						|
systems, accepting the name of a routine to fetch pathname characters
 | 
						|
and an operation (one of: lookup, lookup for creation, or lookup for deletion).
 | 
						|
It translates, component by component, as before.
 | 
						|
If it detects that a mount point crosses to a remote filesystem,
 | 
						|
it passes the remainder of the pathname to the remote server.
 | 
						|
A pathname-oriented request other than open may be completed
 | 
						|
within the \fInamei\fP call,
 | 
						|
avoiding return to the (unmodified) system call handler
 | 
						|
that called \fInamei\fP.
 | 
						|
.PP
 | 
						|
In contrast to the first two systems, Sun's VFS interface has replaced
 | 
						|
\fInamei\fP with \fIlookupname\fP.
 | 
						|
This routine simply calls a new pathname-handling module to allocate
 | 
						|
a pathname buffer and copy in the pathname (copying a character per call),
 | 
						|
then calls \fIlookuppn\fP.
 | 
						|
\fILookuppn\fP performs the iteration over the directories leading
 | 
						|
to the destination file; it copies each pathname component to a local buffer,
 | 
						|
then calls the filesystem \fIlookup\fP entry to locate the vnode
 | 
						|
for that file in the current directory.
 | 
						|
Per-filesystem \fIlookup\fP routines may translate only one component
 | 
						|
per call.
 | 
						|
For creation and deletion of new files, the lookup operation is unmodified;
 | 
						|
the lookup of the final component only serves to check for the existence
 | 
						|
of the file.
 | 
						|
The subsequent creation or deletion call, if any, must repeat the final
 | 
						|
name translation and associated directory scan.
 | 
						|
For new file creation in particular, this is rather inefficient,
 | 
						|
as file creation requires two complete scans of the directory.
 | 
						|
.PP
 | 
						|
Several of the important performance improvements in 4.3BSD
 | 
						|
were related to the name translation process [McKusick85][Leffler84].
 | 
						|
The following changes were made:
 | 
						|
.IP 1. 4
 | 
						|
A system-wide cache of recent translations is maintained.
 | 
						|
The cache is separate from the inode cache, so that multiple names
 | 
						|
for a file may be present in the cache.
 | 
						|
The cache does not hold ``hard'' references to the inodes,
 | 
						|
so that the normal reference pattern is not disturbed.
 | 
						|
.IP 2.
 | 
						|
A per-process cache is kept of the directory and offset
 | 
						|
at which the last successful name lookup was done.
 | 
						|
This allows sequential lookups of all the entries in a directory to be done
 | 
						|
in linear time.
 | 
						|
.IP 3.
 | 
						|
The entire pathname is copied into a kernel buffer in a single operation,
 | 
						|
rather than using two subroutine calls per character.
 | 
						|
.IP 4.
 | 
						|
A pool of pathname buffers are held by \fInamei\fP, avoiding allocation
 | 
						|
overhead.
 | 
						|
.LP
 | 
						|
All of these performance improvements from 4.3BSD are well worth using
 | 
						|
within a more generalized filesystem framework.
 | 
						|
The generalization of the structure may otherwise make an already-expensive
 | 
						|
function even more costly.
 | 
						|
Most of these improvements are present in the GFS system, as it derives
 | 
						|
from the beta-test version of 4.3BSD.
 | 
						|
The Sun system uses a name-translation cache generally like that in 4.3BSD.
 | 
						|
The name cache is a filesystem-independent facility provided for the use
 | 
						|
of the filesystem-specific lookup routines.
 | 
						|
The Sun cache, like that first used at Berkeley but unlike that in 4.3,
 | 
						|
holds a ``hard'' reference to the vnode (increments the reference count).
 | 
						|
The ``soft'' reference scheme in 4.3BSD cannot be used with the current
 | 
						|
NFS implementation, as NFS allocates vnodes dynamically and frees them
 | 
						|
when the reference count returns to zero rather than caching them.
 | 
						|
As a result, fewer names may be held in the cache
 | 
						|
than (local filesystem) vnodes, and the cache distorts the normal reference
 | 
						|
patterns otherwise seen by the LRU cache.
 | 
						|
As the name cache references overflow the local filesystem inode table,
 | 
						|
the name cache must be purged to make room in the inode table.
 | 
						|
Also, to determine whether a vnode is in use (for example,
 | 
						|
before mounting upon it), the cache must be flushed to free any
 | 
						|
cache reference.
 | 
						|
These problems should be corrected
 | 
						|
by the use of the soft cache reference scheme.
 | 
						|
.PP
 | 
						|
A final observation on the efficiency of name translation in the current
 | 
						|
Sun VFS architecture is that the number of subroutine calls used
 | 
						|
by a multi-component name lookup is dramatically larger
 | 
						|
than in the other systems.
 | 
						|
The name lookup scheme in GFS suffers from this problem much less,
 | 
						|
at no expense in violation of layering.
 | 
						|
.PP
 | 
						|
A final problem to be considered is synchronization and consistency.
 | 
						|
As the filesystem operations are more stylized and broken into separate
 | 
						|
entry points for parts of operations, it is more difficult to guarantee
 | 
						|
consistency throughout an operation and/or to synchronize with other
 | 
						|
processes using the same filesystem objects.
 | 
						|
The Sun interface suffers most severely from this,
 | 
						|
as it forbids the filesystems from locking objects across calls
 | 
						|
to the filesystem.
 | 
						|
It is possible that a file may be created between the time that a lookup
 | 
						|
is performed and a subsequent creation is requested.
 | 
						|
Perhaps more strangely, after a lookup fails to find the target
 | 
						|
of a creation attempt, the actual creation might find that the target
 | 
						|
now exists and is a symbolic link.
 | 
						|
The call will either fail unexpectedly, as the target is of the wrong type,
 | 
						|
or the generic creation routine will have to note the error
 | 
						|
and restart the operation from the lookup.
 | 
						|
This problem will always exist in a stateless filesystem,
 | 
						|
but the VFS interface forces all filesystems to share the problem.
 | 
						|
This restriction against locking between calls also
 | 
						|
forces duplication of work during file creation and deletion.
 | 
						|
This is considered unacceptable.
 | 
						|
.SH
 | 
						|
Support facilities and other interactions
 | 
						|
.PP
 | 
						|
Several support facilities are used by the current
 | 
						|
.UX
 | 
						|
filesystem and require generalization for use by other filesystem types.
 | 
						|
For filesystem implementations to be portable,
 | 
						|
it is desirable that these modified support facilities
 | 
						|
should also have a uniform interface and 
 | 
						|
behave in a consistent manner in target systems.
 | 
						|
A prominent example is the filesystem buffer cache.
 | 
						|
The buffer cache in a standard (System V or 4.3BSD)
 | 
						|
.UX
 | 
						|
system contains physical disk blocks with no reference to the files containing
 | 
						|
them.
 | 
						|
This works well for the local filesystem, but has obvious problems
 | 
						|
for remote filesystems.
 | 
						|
Sun has modified the buffer cache routines to describe buffers by vnode
 | 
						|
rather than by device.
 | 
						|
For remote files, the vnode used is that of the file, and the block
 | 
						|
numbers are virtual data blocks.
 | 
						|
For local filesystems, a vnode for the block device is used for cache reference,
 | 
						|
and the block numbers are filesystem physical blocks.
 | 
						|
Use of per-file cache description does not easily accommodate
 | 
						|
caching of indirect blocks, inode blocks, superblocks or cylinder group blocks.
 | 
						|
However, the vnode describing the block device for the cache
 | 
						|
is one created internally,
 | 
						|
rather than the vnode for the device looked up when mounting,
 | 
						|
and it is located by searching a private list of vnodes
 | 
						|
rather than by holding it in the mount structure.
 | 
						|
Although the Sun modification makes it possible to use the buffer
 | 
						|
cache for data blocks of remote files, a better generalization
 | 
						|
of the buffer cache is needed.
 | 
						|
.PP
 | 
						|
The RFS filesystem used by AT&T does not currently cache data blocks
 | 
						|
on client systems, thus the buffer cache is probably unmodified.
 | 
						|
The form of the buffer cache in ULTRIX is unknown to us.
 | 
						|
.PP
 | 
						|
Another subsystem that has a large interaction with the filesystem
 | 
						|
is the virtual memory system.
 | 
						|
The virtual memory system must read data from the filesystem
 | 
						|
to satisfy fill-on-demand page faults.
 | 
						|
For efficiency, this read call is arranged to place the data directly
 | 
						|
into the physical pages assigned to the process (a ``raw'' read) to avoid
 | 
						|
copying the data.
 | 
						|
Although the read operation normally bypasses the filesystem buffer cache,
 | 
						|
consistency must be maintained by checking the buffer cache and copying
 | 
						|
or flushing modified data not yet stored on disk.
 | 
						|
The 4.2BSD virtual memory system, like that of Sun and ULTRIX,
 | 
						|
maintains its own cache of reusable text pages.
 | 
						|
This creates additional complications.
 | 
						|
As the virtual memory systems are redesigned, these problems should be
 | 
						|
resolved by reading through the buffer cache, then mapping the cached
 | 
						|
data into the user address space.
 | 
						|
If the buffer cache or the process pages are changed while the other reference
 | 
						|
remains, the data would have to be copied (``copy-on-write'').
 | 
						|
.PP
 | 
						|
In the meantime, the current virtual memory systems must be used
 | 
						|
with the new filesystem framework.
 | 
						|
Both the Sun and AT&T filesystem interfaces
 | 
						|
provide entry points to the filesystem for optimization of the virtual
 | 
						|
memory system by performing logical-to-physical block number translation
 | 
						|
when setting up a fill-on-demand image for a process.
 | 
						|
The VFS provides a vnode operation analogous to the \fIbmap\fP function of the
 | 
						|
.UX
 | 
						|
filesystem.
 | 
						|
Given a vnode and logical block number, it returns a vnode and block number
 | 
						|
which may be read to obtain the data.
 | 
						|
If the filesystem is local, it returns the private vnode for the block device
 | 
						|
and the physical block number.
 | 
						|
As the \fIbmap\fP operations are all performed at one time, during process
 | 
						|
startup, any indirect blocks for the file will remain in the cache
 | 
						|
after they are once read.
 | 
						|
In addition, the interface provides a \fIstrategy\fP entry that may be used
 | 
						|
for ``raw'' reads from a filesystem device,
 | 
						|
used to read data blocks into an address space without copying.
 | 
						|
This entry uses a buffer header (\fIbuf\fP structure)
 | 
						|
to describe the I/O operation
 | 
						|
instead of a \fIuio\fP structure.
 | 
						|
The buffer-style interface is the same as that used by disk drivers internally.
 | 
						|
This difference allows the current \fIuio\fP primitives to be avoided,
 | 
						|
as they copy all data to/from the current user process address space.
 | 
						|
Instead, for local filesystems these operations could be done internally
 | 
						|
with the standard raw disk read routines,
 | 
						|
which use a \fIuio\fP interface.
 | 
						|
When loading from a remote filesystems,
 | 
						|
the data will be received in a network buffer.
 | 
						|
If network buffers are suitably aligned,
 | 
						|
the data may be mapped into the process address space by a page swap
 | 
						|
without copying.
 | 
						|
In either case, it should be possible to use the standard filesystem
 | 
						|
read entry from the virtual memory system.
 | 
						|
.PP
 | 
						|
Other issues that must be considered in devising a portable
 | 
						|
filesystem implementation include kernel memory allocation,
 | 
						|
the implicit use of user-structure global context,
 | 
						|
which may create problems with reentrancy,
 | 
						|
the style of the system call interface,
 | 
						|
and the conventions for synchronization
 | 
						|
(sleep/wakeup, handling of interrupted system calls, semaphores).
 | 
						|
.SH
 | 
						|
The Berkeley Proposal
 | 
						|
.PP
 | 
						|
The Sun VFS interface has been most widely used of the three described here.
 | 
						|
It is also the most general of the three, in that filesystem-specific
 | 
						|
data and operations are best separated from the generic layer.
 | 
						|
Although it has several disadvantages which were described above,
 | 
						|
most of them may be corrected with minor changes to the interface
 | 
						|
(and, in a few areas, philosophical changes).
 | 
						|
The DEC GFS has other advantages, in particular the use of the 4.3BSD
 | 
						|
\fInamei\fP interface and optimizations.
 | 
						|
It allows single or multiple components of a pathname
 | 
						|
to be translated in a single call to the specific filesystem
 | 
						|
and thus accommodates filesystems with either preference.
 | 
						|
The FSS is least well understood, as there is little public information
 | 
						|
about the interface.
 | 
						|
However, the design goals are the least consistent with those of the Berkeley
 | 
						|
research groups.
 | 
						|
Accordingly, a new filesystem interface has been devised to avoid
 | 
						|
some of the problems in the other systems.
 | 
						|
The proposed interface derives directly from Sun's VFS,
 | 
						|
but, like GFS, uses a 4.3BSD-style name lookup interface.
 | 
						|
Additional context information has been moved from the \fIuser\fP structure
 | 
						|
to the \fInameidata\fP structure so that name translation may be independent
 | 
						|
of the global context of a user process.
 | 
						|
This is especially desired in any system where kernel-mode servers
 | 
						|
operate as light-weight or interrupt-level processes,
 | 
						|
or where a server may store or cache context for several clients.
 | 
						|
This calling interface has the additional advantage
 | 
						|
that the call parameters need not all be pushed onto the stack for each call
 | 
						|
through the filesystem interface,
 | 
						|
and they may be accessed using short offsets from a base pointer
 | 
						|
(unlike global variables in the \fIuser\fP structure).
 | 
						|
.PP
 | 
						|
The proposed filesystem interface is described very tersely here.
 | 
						|
For the most part, data structures and procedures are analogous
 | 
						|
to those used by VFS, and only the changes will be be treated here.
 | 
						|
See [Kleiman86] for complete descriptions of the vfs and vnode operations
 | 
						|
in Sun's interface.
 | 
						|
.PP
 | 
						|
The central data structure for name translation is the \fInameidata\fP
 | 
						|
structure.
 | 
						|
The same structure is used to pass parameters to \fInamei\fP,
 | 
						|
to pass these same parameters to filesystem-specific lookup routines,
 | 
						|
to communicate completion status from the lookup routines back to \fInamei\fP,
 | 
						|
and to return completion status to the calling routine.
 | 
						|
For creation or deletion requests, the parameters to the filesystem operation
 | 
						|
to complete the request are also passed in this same structure.
 | 
						|
The form of the \fInameidata\fP structure is:
 | 
						|
.br
 | 
						|
.ne 2i
 | 
						|
.ID
 | 
						|
.nf
 | 
						|
.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
 | 
						|
/*
 | 
						|
 * Encapsulation of namei parameters.
 | 
						|
 * One of these is located in the u. area to
 | 
						|
 * minimize space allocated on the kernel stack
 | 
						|
 * and to retain per-process context.
 | 
						|
 */
 | 
						|
struct nameidata {
 | 
						|
		/* arguments to namei and related context: */
 | 
						|
	caddr_t	ni_dirp;		/* pathname pointer */
 | 
						|
	enum	uio_seg ni_seg;		/* location of pathname */
 | 
						|
	short	ni_nameiop;		/* see below */
 | 
						|
	struct	vnode *ni_cdir;		/* current directory */
 | 
						|
	struct	vnode *ni_rdir;		/* root directory, if not normal root */
 | 
						|
	struct	ucred *ni_cred;		/* credentials */
 | 
						|
 | 
						|
		/* shared between namei, lookup routines and commit routines: */
 | 
						|
	caddr_t	ni_pnbuf;		/* pathname buffer */
 | 
						|
	char	*ni_ptr;		/* current location in pathname */
 | 
						|
	int	ni_pathlen;		/* remaining chars in path */
 | 
						|
	short	ni_more;		/* more left to translate in pathname */
 | 
						|
	short	ni_loopcnt;		/* count of symlinks encountered */
 | 
						|
 | 
						|
		/* results: */
 | 
						|
	struct	vnode *ni_vp;		/* vnode of result */
 | 
						|
	struct	vnode *ni_dvp;		/* vnode of intermediate directory */
 | 
						|
 | 
						|
/* BEGIN UFS SPECIFIC */
 | 
						|
	struct diroffcache {		/* last successful directory search */
 | 
						|
		struct	vnode *nc_prevdir;	/* terminal directory */
 | 
						|
		long	nc_id;			/* directory's unique id */
 | 
						|
		off_t	nc_prevoffset;		/* where last entry found */
 | 
						|
	} ni_nc;
 | 
						|
/* END UFS SPECIFIC */
 | 
						|
};
 | 
						|
.DE
 | 
						|
.DS
 | 
						|
.ta \w'#define\0\0'u +\w'WANTPARENT\0\0'u +\w'0x40\0\0\0\0\0\0\0'u
 | 
						|
/*
 | 
						|
 * namei operations and modifiers
 | 
						|
 */
 | 
						|
#define	LOOKUP	0	/* perform name lookup only */
 | 
						|
#define	CREATE	1	/* setup for file creation */
 | 
						|
#define	DELETE	2	/* setup for file deletion */
 | 
						|
#define	WANTPARENT	0x10	/* return parent directory vnode also */
 | 
						|
#define	NOCACHE	0x20	/* name must not be left in cache */
 | 
						|
#define	FOLLOW	0x40	/* follow symbolic links */
 | 
						|
#define	NOFOLLOW	0x0	/* don't follow symbolic links (pseudo) */
 | 
						|
.DE
 | 
						|
As in current systems other than Sun's VFS, \fInamei\fP is called
 | 
						|
with an operation request, one of LOOKUP, CREATE or DELETE.
 | 
						|
For a LOOKUP, the operation is exactly like the lookup in VFS.
 | 
						|
CREATE and DELETE allow the filesystem to ensure consistency
 | 
						|
by locking the parent inode (private to the filesystem),
 | 
						|
and (for the local filesystem) to avoid duplicate directory scans
 | 
						|
by storing the new directory entry and its offset in the directory
 | 
						|
in the \fIndirinfo\fP structure.
 | 
						|
This is intended to be opaque to the filesystem-independent levels.
 | 
						|
Not all lookups for creation or deletion are actually followed
 | 
						|
by the intended operation; permission may be denied, the filesystem
 | 
						|
may be read-only, etc.
 | 
						|
Therefore, an entry point to the filesystem is provided
 | 
						|
to abort a creation or deletion operation
 | 
						|
and allow release of any locked internal data.
 | 
						|
After a \fInamei\fP with a CREATE or DELETE flag, the pathname pointer
 | 
						|
is set to point to the last filename component.
 | 
						|
Filesystems that choose to implement creation or deletion entirely
 | 
						|
within the subsequent call to a create or delete entry
 | 
						|
are thus free to do so.
 | 
						|
.PP
 | 
						|
The \fInameidata\fP is used to store context used during name translation.
 | 
						|
The current and root directories for the translation are stored here.
 | 
						|
For the local filesystem, the per-process directory offset cache
 | 
						|
is also kept here.
 | 
						|
A file server could leave the directory offset cache empty,
 | 
						|
could use a single cache for all clients,
 | 
						|
or could hold caches for several recent clients.
 | 
						|
.PP
 | 
						|
Several other data structures are used in the filesystem operations.
 | 
						|
One is the \fIucred\fP structure which describes a client's credentials
 | 
						|
to the filesystem.
 | 
						|
This is modified slightly from the Sun structure;
 | 
						|
the ``accounting'' group ID has been merged into the groups array.
 | 
						|
The actual number of groups in the array is given explicitly
 | 
						|
to avoid use of a reserved group ID as a terminator.
 | 
						|
Also, typedefs introduced in 4.3BSD for user and group ID's have been used.
 | 
						|
The \fIucred\fP structure is thus:
 | 
						|
.DS
 | 
						|
.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
 | 
						|
/*
 | 
						|
 * Credentials.
 | 
						|
 */
 | 
						|
struct ucred {
 | 
						|
	u_short	cr_ref;			/* reference count */
 | 
						|
	uid_t	cr_uid;			/* effective user id */
 | 
						|
	short	cr_ngroups;		/* number of groups */
 | 
						|
	gid_t	cr_groups[NGROUPS];	/* groups */
 | 
						|
	/*
 | 
						|
	 * The following either should not be here,
 | 
						|
	 * or should be treated as opaque.
 | 
						|
	 */
 | 
						|
	uid_t   cr_ruid;		/* real user id */
 | 
						|
	gid_t   cr_svgid;		/* saved set-group id */
 | 
						|
};
 | 
						|
.DE
 | 
						|
.PP
 | 
						|
A final structure used by the filesystem interface is the \fIuio\fP
 | 
						|
structure mentioned earlier.
 | 
						|
This structure describes the source or destination of an I/O
 | 
						|
operation, with provision for scatter/gather I/O.
 | 
						|
It is used in the read and write entries to the filesystem.
 | 
						|
The \fIuio\fP structure presented here is modified from the one
 | 
						|
used in 4.2BSD to specify the location of each vector of the operation
 | 
						|
(user or kernel space)
 | 
						|
and to allow an alternate function to be used to implement the data movement.
 | 
						|
The alternate function might perform page remapping rather than a copy,
 | 
						|
for example.
 | 
						|
.DS
 | 
						|
.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
 | 
						|
/*
 | 
						|
 * Description of an I/O operation which potentially
 | 
						|
 * involves scatter-gather, with individual sections
 | 
						|
 * described by iovec, below.  uio_resid is initially
 | 
						|
 * set to the total size of the operation, and is
 | 
						|
 * decremented as the operation proceeds.  uio_offset
 | 
						|
 * is incremented by the amount of each operation.
 | 
						|
 * uio_iov is incremented and uio_iovcnt is decremented
 | 
						|
 * after each vector is processed.
 | 
						|
 */
 | 
						|
struct uio {
 | 
						|
	struct	iovec *uio_iov;
 | 
						|
	int	uio_iovcnt;
 | 
						|
	off_t	uio_offset;
 | 
						|
	int	uio_resid;
 | 
						|
	enum	uio_rw uio_rw;
 | 
						|
};
 | 
						|
 | 
						|
enum	uio_rw { UIO_READ, UIO_WRITE };
 | 
						|
.DE
 | 
						|
.DS
 | 
						|
.ta .5i +\w'caddr_t\0\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
 | 
						|
/*
 | 
						|
 * Description of a contiguous section of an I/O operation.
 | 
						|
 * If iov_op is non-null, it is called to implement the copy
 | 
						|
 * operation, possibly by remapping, with the call
 | 
						|
 *	(*iov_op)(from, to, count);
 | 
						|
 * where from and to are caddr_t and count is int.
 | 
						|
 * Otherwise, the copy is done in the normal way,
 | 
						|
 * treating base as a user or kernel virtual address
 | 
						|
 * according to iov_segflg.
 | 
						|
 */
 | 
						|
struct iovec {
 | 
						|
	caddr_t	iov_base;
 | 
						|
	int	iov_len;
 | 
						|
	enum	uio_seg iov_segflg;
 | 
						|
	int	(*iov_op)();
 | 
						|
};
 | 
						|
.DE
 | 
						|
.DS
 | 
						|
.ta .5i +\w'UIO_USERISPACE\0\0\0\0\0'u
 | 
						|
/*
 | 
						|
 * Segment flag values.
 | 
						|
 */
 | 
						|
enum	uio_seg {
 | 
						|
	UIO_USERSPACE,		/* from user data space */
 | 
						|
	UIO_SYSSPACE,		/* from system space */
 | 
						|
	UIO_USERISPACE		/* from user I space */
 | 
						|
};
 | 
						|
.DE
 | 
						|
.SH
 | 
						|
File and filesystem operations
 | 
						|
.PP
 | 
						|
With the introduction of the data structures used by the filesystem
 | 
						|
operations, the complete list of filesystem entry points may be listed.
 | 
						|
As noted, they derive mostly from the Sun VFS interface.
 | 
						|
Lines marked with \fB+\fP are additions to the Sun definitions;
 | 
						|
lines marked with \fB!\fP are modified from VFS.
 | 
						|
.PP
 | 
						|
The structure describing the externally-visible features of a mounted
 | 
						|
filesystem, \fIvfs\fP, is:
 | 
						|
.DS
 | 
						|
.ta .5i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
 | 
						|
/*
 | 
						|
 * Structure per mounted file system.
 | 
						|
 * Each mounted file system has an array of
 | 
						|
 * operations and an instance record.
 | 
						|
 * The file systems are put on a doubly linked list.
 | 
						|
 */
 | 
						|
struct vfs {
 | 
						|
	struct vfs	*vfs_next;		/* next vfs in vfs list */
 | 
						|
\fB+\fP	struct vfs	*vfs_prev;		/* prev vfs in vfs list */
 | 
						|
	struct vfsops	*vfs_op;		/* operations on vfs */
 | 
						|
	struct vnode	*vfs_vnodecovered;	/* vnode we mounted on */
 | 
						|
	int	vfs_flag;		/* flags */
 | 
						|
\fB!\fP	int	vfs_fsize;		/* fundamental block size */
 | 
						|
\fB+\fP	int	vfs_bsize;		/* optimal transfer size */
 | 
						|
\fB!\fP	uid_t	vfs_exroot;		/* exported fs uid 0 mapping */
 | 
						|
	short	vfs_exflags;		/* exported fs flags */
 | 
						|
	caddr_t	vfs_data;		/* private data */
 | 
						|
};
 | 
						|
.DE
 | 
						|
.DS
 | 
						|
.ta \w'\fB+\fP 'u +\w'#define\0\0'u +\w'VFS_EXPORTED\0\0'u +\w'0x40\0\0\0\0\0'u
 | 
						|
	/*
 | 
						|
	 * vfs flags.
 | 
						|
	 * VFS_MLOCK lock the vfs so that name lookup cannot proceed past the vfs.
 | 
						|
	 * This keeps the subtree stable during mounts and unmounts.
 | 
						|
	 */
 | 
						|
	#define	VFS_RDONLY	0x01		/* read only vfs */
 | 
						|
\fB+\fP	#define	VFS_NOEXEC	0x02		/* can't exec from filesystem */
 | 
						|
	#define	VFS_MLOCK	0x04		/* lock vfs so that subtree is stable */
 | 
						|
	#define	VFS_MWAIT	0x08		/* someone is waiting for lock */
 | 
						|
	#define	VFS_NOSUID	0x10		/* don't honor setuid bits on vfs */
 | 
						|
	#define	VFS_EXPORTED	0x20		/* file system is exported (NFS) */
 | 
						|
 | 
						|
	/*
 | 
						|
	 * exported vfs flags.
 | 
						|
	 */
 | 
						|
	#define	EX_RDONLY	0x01		/* exported read only */
 | 
						|
.DE
 | 
						|
.LP
 | 
						|
The operations supported by the filesystem-specific layer
 | 
						|
on an individual filesystem are:
 | 
						|
.DS
 | 
						|
.ta .5i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
 | 
						|
/*
 | 
						|
 * Operations supported on virtual file system.
 | 
						|
 */
 | 
						|
struct vfsops {
 | 
						|
\fB!\fP	int	(*vfs_mount)(		/* vfs, path, data, datalen */ );
 | 
						|
\fB!\fP	int	(*vfs_unmount)(		/* vfs, forcibly */ );
 | 
						|
\fB+\fP	int	(*vfs_mountroot)();
 | 
						|
	int	(*vfs_root)(		/* vfs, vpp */ );
 | 
						|
\fB!\fP	int	(*vfs_statfs)(		/* vfs, vp, sbp */ );
 | 
						|
\fB!\fP	int	(*vfs_sync)(		/* vfs, waitfor */ );
 | 
						|
\fB+\fP	int	(*vfs_fhtovp)(		/* vfs, fhp, vpp */ );
 | 
						|
\fB+\fP	int	(*vfs_vptofh)(		/* vp, fhp */ );
 | 
						|
};
 | 
						|
.DE
 | 
						|
.LP
 | 
						|
The \fIvfs_statfs\fP entry returns a structure of the form:
 | 
						|
.DS
 | 
						|
.ta .5i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
 | 
						|
/*
 | 
						|
 * file system statistics
 | 
						|
 */
 | 
						|
struct statfs {
 | 
						|
\fB!\fP	short	f_type;			/* type of filesystem */
 | 
						|
\fB+\fP	short	f_flags;		/* copy of vfs (mount) flags */
 | 
						|
\fB!\fP	long	f_fsize;		/* fundamental file system block size */
 | 
						|
\fB+\fP	long	f_bsize;		/* optimal transfer block size */
 | 
						|
	long	f_blocks;		/* total data blocks in file system */
 | 
						|
	long	f_bfree;		/* free blocks in fs */
 | 
						|
	long	f_bavail;		/* free blocks avail to non-superuser */
 | 
						|
	long	f_files;		/* total file nodes in file system */
 | 
						|
	long	f_ffree;		/* free file nodes in fs */
 | 
						|
	fsid_t	f_fsid;			/* file system id */
 | 
						|
\fB+\fP	char	*f_mntonname;		/* directory on which mounted */
 | 
						|
\fB+\fP	char	*f_mntfromname;		/* mounted filesystem */
 | 
						|
	long	f_spare[7];		/* spare for later */
 | 
						|
};
 | 
						|
 | 
						|
typedef long fsid_t[2];			/* file system id type */
 | 
						|
.DE
 | 
						|
.LP
 | 
						|
The modifications to Sun's interface at this level are minor.
 | 
						|
Additional arguments are present for the \fIvfs_mount\fP and \fIvfs_umount\fP
 | 
						|
entries.
 | 
						|
\fIvfs_statfs\fP accepts a vnode as well as filesystem identifier,
 | 
						|
as the information may not be uniform throughout a filesystem.
 | 
						|
For example,
 | 
						|
if a client may mount a file tree that spans multiple physical
 | 
						|
filesystems on a server, different sections may have different amounts
 | 
						|
of free space.
 | 
						|
(NFS does not allow remotely-mounted file trees to span physical filesystems
 | 
						|
on the server.)
 | 
						|
The final additions are the entries that support file handles.
 | 
						|
\fIvfs_vptofh\fP is provided for the use of file servers,
 | 
						|
which need to obtain an opaque
 | 
						|
file handle to represent the current vnode for transmission to clients.
 | 
						|
This file handle may later be used to relocate the vnode using \fIvfs_fhtovp\fP
 | 
						|
without requiring the vnode to remain in memory.
 | 
						|
.PP
 | 
						|
Finally, the external form of a filesystem object, the \fIvnode\fP, is:
 | 
						|
.DS
 | 
						|
.ta .5i +\w'struct vnodeops\0\0'u +\w'*v_vfsmountedhere;\0\0\0'u
 | 
						|
/*
 | 
						|
 * vnode types. VNON means no type.
 | 
						|
 */
 | 
						|
enum vtype 	{ VNON, VREG, VDIR, VBLK, VCHR, VLNK, VSOCK };
 | 
						|
 | 
						|
struct vnode {
 | 
						|
	u_short	v_flag;			/* vnode flags (see below) */
 | 
						|
	u_short	v_count;		/* reference count */
 | 
						|
	u_short	v_shlockc;		/* count of shared locks */
 | 
						|
	u_short	v_exlockc;		/* count of exclusive locks */
 | 
						|
	struct vfs	*v_vfsmountedhere;	/* ptr to vfs mounted here */
 | 
						|
	struct vfs	*v_vfsp;		/* ptr to vfs we are in */
 | 
						|
	struct vnodeops	*v_op;			/* vnode operations */
 | 
						|
\fB+\fP	struct text	*v_text;		/* text/mapped region */
 | 
						|
	enum vtype	v_type;			/* vnode type */
 | 
						|
	caddr_t	v_data;			/* private data for fs */
 | 
						|
};
 | 
						|
.DE
 | 
						|
.DS
 | 
						|
.ta \w'#define\0\0'u +\w'NOFOLLOW\0\0'u +\w'0x40\0\0\0\0\0\0\0'u
 | 
						|
/*
 | 
						|
 * vnode flags.
 | 
						|
 */
 | 
						|
#define	VROOT	0x01	/* root of its file system */
 | 
						|
#define	VTEXT	0x02	/* vnode is a pure text prototype */
 | 
						|
#define	VEXLOCK	0x10	/* exclusive lock */
 | 
						|
#define	VSHLOCK	0x20	/* shared lock */
 | 
						|
#define	VLWAIT	0x40	/* proc is waiting on shared or excl. lock */
 | 
						|
.DE
 | 
						|
.LP
 | 
						|
The operations supported by the filesystems on individual \fIvnode\fP\^s
 | 
						|
are:
 | 
						|
.DS
 | 
						|
.ta .5i +\w'int\0\0\0\0\0'u  +\w'(*vn_getattr)(\0\0\0\0\0'u
 | 
						|
/*
 | 
						|
 * Operations on vnodes.
 | 
						|
 */
 | 
						|
struct vnodeops {
 | 
						|
\fB!\fP	int	(*vn_lookup)(		/* ndp */ );
 | 
						|
\fB!\fP	int	(*vn_create)(		/* ndp, vap, fflags */ );
 | 
						|
\fB+\fP	int	(*vn_mknod)(		/* ndp, vap, fflags */ );
 | 
						|
\fB!\fP	int	(*vn_open)(		/* vp, fflags, cred */ );
 | 
						|
	int	(*vn_close)(		/* vp, fflags, cred */ );
 | 
						|
	int	(*vn_access)(		/* vp, fflags, cred */ );
 | 
						|
	int	(*vn_getattr)(		/* vp, vap, cred */ );
 | 
						|
	int	(*vn_setattr)(		/* vp, vap, cred */ );
 | 
						|
 | 
						|
\fB+\fP	int	(*vn_read)(		/* vp, uiop, offp, ioflag, cred */ );
 | 
						|
\fB+\fP	int	(*vn_write)(		/* vp, uiop, offp, ioflag, cred */ );
 | 
						|
\fB!\fP	int	(*vn_ioctl)(		/* vp, com, data, fflag, cred */ );
 | 
						|
	int	(*vn_select)(		/* vp, which, cred */ );
 | 
						|
\fB+\fP	int	(*vn_mmap)(		/* vp, ..., cred */ );
 | 
						|
	int	(*vn_fsync)(		/* vp, cred */ );
 | 
						|
\fB+\fP	int	(*vn_seek)(		/* vp, offp, off, whence */ );
 | 
						|
 | 
						|
\fB!\fP	int	(*vn_remove)(		/* ndp */ );
 | 
						|
\fB!\fP	int	(*vn_link)(		/* vp, ndp */ );
 | 
						|
\fB!\fP	int	(*vn_rename)(		/* src ndp, target ndp */ );
 | 
						|
\fB!\fP	int	(*vn_mkdir)(		/* ndp, vap */ );
 | 
						|
\fB!\fP	int	(*vn_rmdir)(		/* ndp */ );
 | 
						|
\fB!\fP	int	(*vn_symlink)(		/* ndp, vap, nm */ );
 | 
						|
	int	(*vn_readdir)(		/* vp, uiop, offp, ioflag, cred */ );
 | 
						|
	int	(*vn_readlink)(		/* vp, uiop, ioflag, cred */ );
 | 
						|
 | 
						|
\fB+\fP	int	(*vn_abortop)(		/* ndp */ );
 | 
						|
\fB+\fP	int	(*vn_lock)(		/* vp */ );
 | 
						|
\fB+\fP	int	(*vn_unlock)(		/* vp */ );
 | 
						|
\fB!\fP	int	(*vn_inactive)(		/* vp */ );
 | 
						|
};
 | 
						|
.DE
 | 
						|
.DS
 | 
						|
.ta \w'#define\0\0'u +\w'NOFOLLOW\0\0'u +\w'0x40\0\0\0\0\0'u
 | 
						|
/*
 | 
						|
 * flags for ioflag
 | 
						|
 */
 | 
						|
#define	IO_UNIT	0x01		/* do io as atomic unit for VOP_RDWR */
 | 
						|
#define	IO_APPEND	0x02		/* append write for VOP_RDWR */
 | 
						|
#define	IO_SYNC	0x04		/* sync io for VOP_RDWR */
 | 
						|
.DE
 | 
						|
.LP
 | 
						|
The argument types listed in the comments following each operation are:
 | 
						|
.sp
 | 
						|
.IP ndp 10
 | 
						|
A pointer to a \fInameidata\fP structure.
 | 
						|
.IP vap
 | 
						|
A pointer to a \fIvattr\fP structure (vnode attributes; see below).
 | 
						|
.IP fflags
 | 
						|
File open flags, possibly including O_APPEND, O_CREAT, O_TRUNC and O_EXCL.
 | 
						|
.IP vp
 | 
						|
A pointer to a \fIvnode\fP previously obtained with \fIvn_lookup\fP.
 | 
						|
.IP cred
 | 
						|
A pointer to a \fIucred\fP credentials structure.
 | 
						|
.IP uiop
 | 
						|
A pointer to a \fIuio\fP structure.
 | 
						|
.IP ioflag
 | 
						|
Any of the IO flags defined above.
 | 
						|
.IP com
 | 
						|
An \fIioctl\fP command, with type \fIunsigned long\fP.
 | 
						|
.IP data
 | 
						|
A pointer to a character buffer used to pass data to or from an \fIioctl\fP.
 | 
						|
.IP which
 | 
						|
One of FREAD, FWRITE or 0 (select for exceptional conditions).
 | 
						|
.IP off
 | 
						|
A file offset of type \fIoff_t\fP.
 | 
						|
.IP offp
 | 
						|
A pointer to file offset of type \fIoff_t\fP.
 | 
						|
.IP whence
 | 
						|
One of L_SET, L_INCR, or L_XTND.
 | 
						|
.IP fhp
 | 
						|
A pointer to a file handle buffer.
 | 
						|
.sp
 | 
						|
.PP
 | 
						|
Several changes have been made to Sun's set of vnode operations.
 | 
						|
Most obviously, the \fIvn_lookup\fP receives a \fInameidata\fP structure
 | 
						|
containing its arguments and context as described.
 | 
						|
The same structure is also passed to one of the creation or deletion
 | 
						|
entries if the lookup operation is for CREATE or DELETE to complete
 | 
						|
an operation, or to the \fIvn_abortop\fP entry if no operation
 | 
						|
is undertaken.
 | 
						|
For filesystems that perform no locking between lookup for creation
 | 
						|
or deletion and the call to implement that action,
 | 
						|
the final pathname component may be left untranslated by the lookup
 | 
						|
routine.
 | 
						|
In any case, the pathname pointer points at the final name component,
 | 
						|
and the \fInameidata\fP contains a reference to the vnode of the parent
 | 
						|
directory.
 | 
						|
The interface is thus flexible enough to accommodate filesystems
 | 
						|
that are fully stateful or fully stateless, while avoiding redundant
 | 
						|
operations whenever possible.
 | 
						|
One operation remains problematical, the \fIvn_rename\fP call.
 | 
						|
It is tempting to look up the source of the rename for deletion
 | 
						|
and the target for creation.
 | 
						|
However, filesystems that lock directories during such lookups must avoid
 | 
						|
deadlock if the two paths cross.
 | 
						|
For that reason, the source is translated for LOOKUP only,
 | 
						|
with the WANTPARENT flag set;
 | 
						|
the target is then translated with an operation of CREATE.
 | 
						|
.PP
 | 
						|
In addition to the changes concerned with the \fInameidata\fP interface,
 | 
						|
several other changes were made in the vnode operations.
 | 
						|
The \fIvn_rdrw\fP entry was split into \fIvn_read\fP and \fIvn_write\fP;
 | 
						|
frequently, the read/write entry amounts to a routine that checks
 | 
						|
the direction flag, then calls either a read routine or a write routine.
 | 
						|
The two entries may be identical for any given filesystem;
 | 
						|
the direction flag is contained in the \fIuio\fP given as an argument.
 | 
						|
.PP
 | 
						|
All of the read and write operations use a \fIuio\fP to describe
 | 
						|
the file offset and buffer locations.
 | 
						|
All of these fields must be updated before return.
 | 
						|
In particular, the \fIvn_readdir\fP entry uses this
 | 
						|
to return a new file offset token for its current location.
 | 
						|
.PP
 | 
						|
Several new operations have been added.
 | 
						|
The first, \fIvn_seek\fP, is a concession to record-oriented files
 | 
						|
such as directories.
 | 
						|
It allows the filesystem to verify that a seek leaves a file at a sensible
 | 
						|
offset, or to return a new offset token relative to an earlier one.
 | 
						|
For most filesystems and files, this operation amounts to performing
 | 
						|
simple arithmetic.
 | 
						|
Another new entry point is \fIvn_mmap\fP, for use in mapping device memory
 | 
						|
into a user process address space.
 | 
						|
Its semantics are not yet decided.
 | 
						|
The final additions are the \fIvn_lock\fP and \fIvn_unlock\fP entries.
 | 
						|
These are used to request that the underlying file be locked against
 | 
						|
changes for short periods of time if the filesystem implementation allows it.
 | 
						|
They are used to maintain consistency
 | 
						|
during internal operations such as \fIexec\fP,
 | 
						|
and may not be used to construct atomic operations from other filesystem
 | 
						|
operations.
 | 
						|
.PP
 | 
						|
The attributes of a vnode are not stored in the vnode,
 | 
						|
as they might change with time and may need to be read from a remote
 | 
						|
source.
 | 
						|
Attributes have the form:
 | 
						|
.DS
 | 
						|
.ta .5i +\w'struct vnodeops\0\0'u +\w'*v_vfsmountedhere;\0\0\0'u
 | 
						|
/*
 | 
						|
 * Vnode attributes.  A field value of -1
 | 
						|
 * represents a field whose value is unavailable
 | 
						|
 * (getattr) or which is not to be changed (setattr).
 | 
						|
 */
 | 
						|
struct vattr {
 | 
						|
	enum vtype	va_type;	/* vnode type (for create) */
 | 
						|
	u_short	va_mode;	/* files access mode and type */
 | 
						|
\fB!\fP	uid_t	va_uid;		/* owner user id */
 | 
						|
\fB!\fP	gid_t	va_gid;		/* owner group id */
 | 
						|
	long	va_fsid;	/* file system id (dev for now) */
 | 
						|
\fB!\fP	long	va_fileid;	/* file id */
 | 
						|
	short	va_nlink;	/* number of references to file */
 | 
						|
	u_long	va_size;	/* file size in bytes (quad?) */
 | 
						|
\fB+\fP	u_long	va_size1;	/* reserved if not quad */
 | 
						|
	long	va_blocksize;	/* blocksize preferred for i/o */
 | 
						|
	struct timeval	va_atime;	/* time of last access */
 | 
						|
	struct timeval	va_mtime;	/* time of last modification */
 | 
						|
	struct timeval	va_ctime;	/* time file changed */
 | 
						|
	dev_t	va_rdev;	/* device the file represents */
 | 
						|
	u_long	va_bytes;	/* bytes of disk space held by file */
 | 
						|
\fB+\fP	u_long	va_bytes1;	/* reserved if va_bytes not a quad */
 | 
						|
};
 | 
						|
.DE
 | 
						|
.SH
 | 
						|
Conclusions
 | 
						|
.PP
 | 
						|
The Sun VFS filesystem interface is the most widely used generic
 | 
						|
filesystem interface.
 | 
						|
Of the interfaces examined, it creates the cleanest separation
 | 
						|
between the filesystem-independent and -dependent layers and data structures.
 | 
						|
It has several flaws, but it is felt that certain changes in the interface
 | 
						|
can ameliorate most of them.
 | 
						|
The interface proposed here includes those changes.
 | 
						|
The proposed interface is now being implemented by the Computer Systems
 | 
						|
Research Group at Berkeley.
 | 
						|
If the design succeeds in improving the flexibility and performance
 | 
						|
of the filesystem layering, it will be advanced as a model interface.
 | 
						|
.SH
 | 
						|
Acknowledgements
 | 
						|
.PP
 | 
						|
The filesystem interface described here is derived from Sun's VFS interface.
 | 
						|
It also includes features similar to those of DEC's GFS interface.
 | 
						|
We are indebted to members of the Sun and DEC system groups
 | 
						|
for long discussions of the issues involved.
 | 
						|
.br
 | 
						|
.ne 2i
 | 
						|
.SH
 | 
						|
References
 | 
						|
 | 
						|
.IP Brownbridge82 \w'Satyanarayanan85\0\0'u
 | 
						|
Brownbridge, D.R., L.F. Marshall, B. Randell,
 | 
						|
``The Newcastle Connection, or UNIXes of the World Unite!,''
 | 
						|
\fISoftware\- Practice and Experience\fP, Vol. 12, pp. 1147-1162, 1982.
 | 
						|
 | 
						|
.IP Cole85
 | 
						|
Cole, C.T., P.B. Flinn, A.B. Atlas,
 | 
						|
``An Implementation of an Extended File System for UNIX,''
 | 
						|
\fIUsenix Conference Proceedings\fP,
 | 
						|
pp. 131-150, June, 1985.
 | 
						|
 | 
						|
.IP Kleiman86
 | 
						|
``Vnodes: An Architecture for Multiple File System Types in Sun UNIX,''
 | 
						|
\fIUsenix Conference Proceedings\fP,
 | 
						|
pp. 238-247, June, 1986.
 | 
						|
 | 
						|
.IP Leffler84
 | 
						|
Leffler, S., M.K. McKusick, M. Karels,
 | 
						|
``Measuring and Improving the Performance of 4.2BSD,''
 | 
						|
\fIUsenix Conference Proceedings\fP, pp. 237-252, June, 1984.
 | 
						|
 | 
						|
.IP McKusick84
 | 
						|
McKusick, M.K., W.N. Joy, S.J. Leffler, R.S. Fabry,
 | 
						|
``A Fast File System for UNIX,'' \fITransactions on Computer Systems\fP,
 | 
						|
Vol. 2, pp. 181-197,
 | 
						|
ACM, August, 1984.
 | 
						|
 | 
						|
.IP McKusick85
 | 
						|
McKusick, M.K., M. Karels, S. Leffler,
 | 
						|
``Performance Improvements and Functional Enhancements in 4.3BSD,''
 | 
						|
\fIUsenix Conference Proceedings\fP, pp. 519-531, June, 1985.
 | 
						|
 | 
						|
.IP Rifkin86
 | 
						|
Rifkin, A.P., M.P. Forbes, R.L. Hamilton, M. Sabrio, S. Shah, and K. Yueh,
 | 
						|
``RFS Architectural Overview,'' \fIUsenix Conference Proceedings\fP,
 | 
						|
pp. 248-259, June, 1986.
 | 
						|
 | 
						|
.IP Ritchie74
 | 
						|
Ritchie, D.M. and K. Thompson, ``The Unix Time-Sharing System,''
 | 
						|
\fICommunications of the ACM\fP, Vol. 17, pp. 365-375, July, 1974.
 | 
						|
 | 
						|
.IP Rodriguez86
 | 
						|
Rodriguez, R., M. Koehler, R. Hyde,
 | 
						|
``The Generic File System,'' \fIUsenix Conference Proceedings\fP,
 | 
						|
pp. 260-269, June, 1986.
 | 
						|
 | 
						|
.IP Sandberg85
 | 
						|
Sandberg, R., D. Goldberg, S. Kleiman, D. Walsh, B. Lyon,
 | 
						|
``Design and Implementation of the Sun Network Filesystem,''
 | 
						|
\fIUsenix Conference Proceedings\fP,
 | 
						|
pp. 119-130, June, 1985.
 | 
						|
 | 
						|
.IP Satyanarayanan85
 | 
						|
Satyanarayanan, M., \fIet al.\fP,
 | 
						|
``The ITC Distributed File System: Principles and Design,''
 | 
						|
\fIProc. 10th Symposium on Operating Systems Principles\fP, pp. 35-50,
 | 
						|
ACM, December, 1985.
 | 
						|
 | 
						|
.IP Walker85
 | 
						|
Walker, B.J. and S.H. Kiser, ``The LOCUS Distributed Filesystem,''
 | 
						|
\fIThe LOCUS Distributed System Architecture\fP,
 | 
						|
G.J. Popek and B.J. Walker, ed., The MIT Press, Cambridge, MA, 1985.
 | 
						|
 | 
						|
.IP Weinberger84
 | 
						|
Weinberger, P.J., ``The Version 8 Network File System,''
 | 
						|
\fIUsenix Conference presentation\fP,
 | 
						|
June, 1984.
 |