NFS: Network File System

29.1 Introduction

In this chapter we describe NFS, the Network File System, another popular application that provides transparent file access for client applications. The building block of NFS is Sun RPC: Remote Procedure Call, which we must describe first.

Nothing special need be done by the client program to use NFS. The kernel detects that the file being accessed is on an NFS server and automatically generates the RPC calls to access the file.

Our interest in NFS is not in all the details on file access, but in its use of the Internet protocols, especially UDP.

29.2 Sun Remote Procedure Call

Most network programming is done by writing application programs that call system-provided functions to perform specific network operations. For example, one function performs a TCP active open, another performs a TCP passive open, another sends data across a TCP connection, another sets specific protocol options (enable TCP's keepalive timer), and so on. In Section 1.15 we mentioned that two popular sets of functions for network programming (called APIs) are sockets and TLI. The API used by the client and the API used by the server can be different, as can the operating systems running on the client and server. It is the communication protocol and application protocol that determine if a given client and server can communicate with each other. A Unix client written in C using sockets and TCP can communicate with a mainframe server written in COBOL using some other API and TCP, if both hosts are connected across a network and both have a TCP/IP implementation.

Typically the client sends commands to the server, and the server sends replies back to the client. All the applications we've looked at so far - Ping, Traceroute, routing daemons, and the clients and servers for the DNS, TFTP, BOOTP, SNMP, Telnet, FTP, and SMTP-are built this way.

RPC, Remote Procedure Call, is a different way of doing network programming. A client program is written that just calls functions in the server program. This is how it appears to the programmer, but the following steps actually take place.

When the client calls the remote procedure, it's really calling a function on the local host that's generated by the RPC package. This function is called the client stub. The client stub packages the procedure arguments into a network message, and sends this message to the server.
A server stub on the server host receives the network message. It takes the arguments from the network message, and calls the server procedure that the application programmer wrote.
When the server function returns, it returns to the server stub, which takes the return values, packages them into a network message, and sends the message back to the client stub.
The client stub takes the return values from the network message and returns to the client application.

The network programming done by the stubs and the RPC library routines uses an API such as sockets or TLI, but the user application-the client program, and the server procedures called by the client-never deal with this API. The client application just calls the server procedures and all the network programming details are hidden by the RPC package, the client stub, and the server stub. An RPC package provides numerous benefits.

The programming is easier since there is little or no network programming involved. The application programmer just writes a client program and the server procedures that the client calls.
If an unreliable protocol such as UDP is used, details like timeout and retransmission are handled by the RPC package. This simplifies the user application.
The RPC library handles any required data translation for the arguments and return values. For example, if the arguments consist of integers and floating point numbers, the RPC package handles any differences in the way integers and floating point numbers are stored on the client and server. This simplifies coding clients and servers that can operate in heterogeneous environments.

Details of RPC programming are provided in Chapter 18 of [Stevens 1990]. Two popular RPC packages are Sun RPC and the RPC package in the Open Software Foundation's (OSF) Distributed Computing Environment (DCE). Our interest in RPC is to see what the procedure call and procedure return messages look like for the Sun RPC package, since it's used by the Network File System, which we describe in this chapter. Version 2 of Sun RPC is defined in RFC 1057 [Sun Microsystems 1988a].

Sun RPC

Sun RPC comes in two flavors. One version is built using the sockets API and works with TCP and UDP. Another, called TI-RPC (for "transport independent"), is built using the TLI API and works with any transport layer provided by the kernel. From our perspective the two are the same, although we talk only about TCP and UDP in this chapter.

Figure 29.1 shows the format of an RPC procedure call message, when UDP is used.

Figure 29.1 Format of RPC procedure call message as a UDP datagram.

The IP and UDP headers are the standard ones we showed earlier (Figures 3.1 and 11.2). What follows after the UDP header is defined by the RPC package.

The transaction ID (XID) is set by the client and returned by the server. When the client receives a reply it compares the XID returned by the server with the XID of the request it sent. If they don't match, the client discards the message and waits for the next one from the server. Each time the client issues a new RPC, it changes the XID. But if the client retransmits a previously sent RPC (because it hasn't received a reply), the XID does not change.

The call variable is 0 for a call, and 1 for a reply. The current RPC version is 2. The next three variables, program number, version number, and procedure number, identify the specific procedure on the server to be called.

The credentials identify the client. In some instances nothing is sent here, and in other instances the numeric user ID and group IDs of the client are sent. The server can look at the credentials and determine if it will perform the request or not. The verifier is used with Secure RPC, which uses DES encryption. Although the credentials and verifier are variable-length fields, their length is encoded as part of the field.

Following this are the procedure parameters. The format of these depends on the definition of the remote procedure by the application. How does the receiver (the server stub) know the size of the parameters? Since UDP is being used, the size of the UDP datagram, minus the length of all the fields up through the verifier, is the size of the parameters. When TCP is used instead of UDP, there is no inherent length, since TCP is a byte stream protocol, without record boundaries. To handle this, a 4-byte length field appears between the TCP header and the XID, telling the receiver how many bytes comprise the RPC call. This allows the RPC call message to be sent in multiple TCP segments, if necessary. (The DNS uses a similar technique; see Exercise 14.4.)

Figure 29.2 shows the format of an RPC reply. This is sent by the server stub to the client stub, when the remote procedure returns.

Figure 29.2 Format of RPC procedure reply message as a UDP datagram.

The XID in the reply is just copied from the XID in the call. The reply is 1, which we said differentiates this message from a call. The status is 0 if the call message was accepted. (The message can be rejected if the RPC version number isn't 2, or if the server cannot authenticate the client.) The verifier is used with secure RPC to identify the server.

The accept status is 0 on success. A nonzero value can indicate an invalid version number or an invalid procedure number, for example. As with the RPC call message, if TCP is used instead of UDP, a 4-byte length field is sent between the TCP header and the XID.

29.3 XDR: External Data Representation

XDR, External Data Representation, is the standard used to encode the values in the RPC call and reply messages-the RPC header fields (XID, program number, accept status, etc.), the procedure parameters, and the procedure results. Having a standard way of encoding all these values is what lets a client on one system call a procedure on a system with a different architecture. XDR is defined in RFC 1014 [Sun Microsystems 1987].

XDR defines numerous data types and exactly how they are transmitted in an RPC message (bit order, byte order, etc.). The sender must build an RPC message in XDR format, then the receiver converts the XDR format into its native representation. We see, for example, in Figures 29.1 and 29.2, that all the integer values we show (XID, call, program number, etc.) are 4-byte integers. Indeed, all integers occupy 4 bytes in XDR. Other data types supported by XDR include unsigned integers, booleans, floating point numbers, fixed-length arrays, variable-length arrays, and structures.

29.4 Port Mapper

The RPC server programs containing the remote procedures use ephemeral ports, not well-known ports. This requires a "registrar" of some form that keeps track of which RPC programs are using which ephemeral ports. In Sun RPC this registrar is called the port mapper.

The term "port" in this name originates from the TCP and UDP port numbers, features of the Internet protocol suite. Since TI-RPC works over any transport layer, and not just TCP and UDP, the name of the port mapper in systems using TI-RPC (SVR4 and Solaris 2.2, for example) has become rpcbind. We'll continue to use the more familiar name of port mapper.

Naturally, the port mapper itself must have a well-known port: UDP port 111 and TCP port 111. The port mapper is also just an RPC server program. It has a program number (100000), a version number (2), a TCP port of 111, and a UDP port of 111. Servers register themselves with the port mapper using RPC calls, and clients query the port mapper using RPC calls. The port mapper provides four server procedures:

PMAPPROC_SET. Called by an RPC server on startup to register a program number, version number, and protocol with a port number.
PMAPPROCJJNSET. Called by server to remove a previously registered mapping.
PMAPPROC_GETPORT. Called by an RPC client on startup to obtain the port number for a given program number, version number, and protocol.
PMAPPROC_DUMP. Returns all entries (program number, version number, protocol, and port number) in the port mapper database.

When an RPC server program starts, and is later called by an RPC client program, the following steps take place.

The port mapper must be started first, normally when the system is bootstrapped. It creates a TCP end point and does a passive open on TCP port 111. It also creates a UDP end point and waits for a UDP datagram to arrive for UDP port 111.
When the RPC server program starts, it creates a TCP end point and a UDP end point for each version of the program that it supports. (A given RPC program can support multiple versions. The client specifies which version it wants when it calls a server procedure.) An ephemeral port number is bound to both end points. (It doesn't matter whether the TCP port number is the same or different from the UDP port number.) The server registers each program, version, protocol, and port number by making a remote procedure call to the port mapper's PMAPPROC_SET procedure.
When the RPC client program starts, it calls the port mapper's PMAP-PROC_GETPORT procedure to obtain the ephemeral port number for a given program, version, and protocol.
The client sends an RPC call message to the port number returned in step 3. If UDP is being used, the client just sends a UDP datagram containing an RPC call message (Figure 29.1) to the server's UDP port number. The server responds by sending a UDP datagram containing an RPC reply message (Figure 29.2) back to the client.
If TCP is being used, the client does an active open to the server's TCP port number, and then sends an RPC call message across the connection. The server responds with an RPC reply message across the connection.

The program rpcinfo(8) prints out the port mapper's current mappings. (It calls the port mapper's PMAPPROC_DUMP procedure.) Here is some typical output:

sun % /usr/etc/rpcinfo -p

program vers proto port

100005 1 tcp 702 mountd mount daemon for NFS

100005 1 udp 699 mountd

100005 2 tcp 702 mountd

100005 2 udp 699 mountd

100003 2 udp 2049 nfs NFS itself

100021 1 tcp 709 niockmgr NFS lock manager

100021 1 udp 1036 niockmgr

100021 2 tcp 721 niockmgr

100021 2 udp 1039 niockmgr

100021 3 tcp 713 niockmgr

100021 3 udp 1037 niockmgr

We see that some programs do support multiple versions, and each combination of a program number, version number, and protocol has its own port number mapping maintained by the port mapper.

Both versions of the mount daemon are accessed through the same TCP port number (702) and the same UDP port number (699), but each version of the lock manager has its own port number.

29.5 NFS Protocol

NFS provides transparent file access for clients to files and filesystems on a server. This differs from FTP (Chapter 27), which provides file transfer. With FTP a complete copy of the file is made. NFS accesses only the portions of a file that a process references, and a goal of NFS is to make this access transparent. This means that any client application that works with a local tile should work with an NFS file, without any program changes whatsoever.

NFS is a client-server application built using Sun RPC. NFS clients access tiles on an NFS server by sending RPC requests to the server. While this could be done using normal user processes - that is, the NFS client could be a user process that makes explicit RPC calls to the server, and the server could also be a user process-NFS is normally not implemented this way for two reasons. First, accessing an NFS tile must be transparent to the client. Therefore the NFS client calls are performed by the client operating system, on behalf of client user processes. Second, NFS servers are implemented within the operating system on the server for efficiency. If the NFS server were a user process, every client request and server reply (including the data being read or written) would have to cross the boundary between the kernel and the user process, which is expensive.

In this section we look at version 2 of NFS, as documented in RFC 1094 [Sun Microsystems 1988b]. A better description of Sun RPC, XDR, and NFS is given in [X/Open 1991]. Details on using and administering NFS are in [Stern 1991]. The specifications for version 3 of the NFS protocol were released in 1993, which we cover in Section 29.7.

Figure 29.3 shows the typical arrangement of an NFS client and an NFS server. There are many subtle points in this figure.

Figure 29.3 Typical arrangement of NFS client and NFS server.

It is transparent to the client whether it's accessing a local file or an NFS file. The kernel determines this when the file is opened. After the tile is opened, the kernel passes all references to local tiles to the box labeled "local file access," and all references to an NFS tile are passed to the "NFS client" box.
The NFS client sends RPC requests to the NFS server through its TCP/IP module. NFS is used predominantly with UDP, but newer implementations can also use TCP.
The NFS server receives client requests as UDP datagrams on port 2049. Although NFS could be made to use the port mapper, allowing the server to use an ephemeral port, UDP port 2049 is hardcoded into most implementations.
When the NFS server receives a client request, the requests are passed to its local file access routines, which access a local disk on the server.
It can take the NFS server a while to handle a client's request. The local file-system is normally accessed, which can take some time. During this time, the server does not want to block other client requests from being serviced. To handle this, most NFS servers are multithreaded-that is, there are really multiple NFS servers running inside the server kernel. How this is handled depends on the operating system. Since most Unix kernels are not multithreaded, a common technique is to start multiple instances of a user process (often called nfsd) that performs a single system call and remains inside the kernel as a kernel process.
Similarly, it can take the NFS client a while to handle a request from a user process on the client host. An RPC is issued to the server host, and the reply is waited for. To provide more concurrency to the user processes on the client host that are using NFS, there are normally multiple NFS clients running inside the client kernel. Again, the implementation depends on the operating system. Unix systems often use a technique similar to the NFS server technique: a user process named biod that performs a single system call and remains inside the kernel as a kernel process.

Most Unix hosts can operate as either an NFS client, an NFS server, or both. Most PC implementations (MS-DOS) only provide NFS client implementations. Most IBM mainframe implementations only provide NFS server functions.

NFS really consists of more than just the NFS protocol. Figure 29.4 shows the various RPC programs normally used with NFS.

Application	Program number	Version numbers	Number of procedures
port mapper NFS mount lock manager status monitor	100000 100003 100005 100021 100024	2 2 1 1,2,3 1	4 15 5 19 6

Figure 29.4 Various RPC programs used with NFS.

The versions we show in this figure are the ones found on systems such as SunOS 4.1.3. Newer implementations are providing newer versions of some of the programs. Solaris 2.2, for example, also supports versions 3 and 4 of the port mapper, and version 2 of the mount daemon. SVR4 also supports version 3 of the port mapper.

The mount daemon is called by the NFS client host before the client can access a filesystem on the server. We discuss this below.

The lock manager and status monitor allow clients to lock portions of files that reside on an NFS server. These two programs are independent of the NFS protocol because locking requires state on both the client and server, and NFS itself is stateless on the server. (We say more about NFS's statelessness later.) Chapters 9, 10, and 11 of [X/Open 1991] document the procedures used by the lock manager and status monitor for file locking with NFS.

File Handles

A fundamental concept in NFS is the file handle. It is an opaque object used to reference a file or directory on the server. The term opaque denotes that the server creates the file handle, passes it back to the client, and then the client uses the file handle when accessing the file. The client never looks at the contents of the file handle-its contents only make sense to the server.

Each time a client process opens a file that is really a file on an NFS server, the NFS client obtains a file handle for that file from the NFS server. Each time the NFS client reads or writes that file for the user process, the file handle is sent back to the server to identify the file being accessed.

Normal user processes never deal with file handles - it is the NFS client code and the NFS server code that pass them back and forth. In version 2 of NFS a file handle occupies 32 bytes, although this increases with version 3 to 64 bytes.

Unix servers normally store the following information in the file handle: the filesystem identifier (the major and minor device numbers of the filesystem), the i-node number (a unique number within a filesystem), and an i-node generation number (a number that changes each time an i-node is reused for a different file).

Mount Protocol

The client must use the NFS mount protocol to mount a server's filesystem, before the client can access files on that filesystem. This is normally done when the client is bootstrapped. The end result is for the client to obtain a file handle for the server's file-system.

Figure 29.5 shows the sequence of steps that takes place when a Unix client issues the mount (8) command, specifying an NFS mount.

Figure 29.5 Mount protocol used by Unix mount command.

The following steps take place.

The port mapper is started on the server, normally when the server bootstraps.
The mount daemon (mountd) is started on the server, after the port mapper. It creates a TCP end point and a UDP end point, and assigns ephemeral port number to each. It then registers these port numbers with the port mapper.
The mount command is executed on the client and it issues an RPC call to the port mapper on the server to obtain the port number of the server's mount daemon. Either TCP or UDP can be used for this client exchange with the port mapper, but UDP is normally used.
The port mapper replies with the port number.
The mount command issues an RPC call to the mount daemon to mount a file-system on the server. Again, either TCP or UDP can be used, but UDP is typical. The server can now validate the client, using the client's IP address and port number, to see if the server lets this client mount the specified filesystem.
The mount daemon replies with the file handle for the given filesystem.
The mount command issues the mount system call on the client to associate the file handle returned in step 5 with a local mount point on the client. This file handle is stored in the NFS client code, and from this point on any references by user processes to files on that server's filesystem will use that file handle as the starting point.

This implementation technique puts all the mount processing, other than the mount system call on the client, in user processes, instead of the kernel. The three programs we show-the mount command, the port mapper, and the mount daemon-are all user processes. As an example, on our host sun (the NFS client) we execute

sun # mount -t nfs bsdi:/usr /nfs/bsdi/usr

This mounts the directory /usr on the host bsdi (the NFS server) as the local file-system /nfs/bsdi/usr. Figure 29.6 shows the result.

Figure 29.6 Mounting the bsdi:/usr directory as /nfs/bsdi/usr on the host sun.

When we reference the file /nfs/bsdi/usr/rstevens/hello.c on the client sun we are really referencing the file /usr/rstevens/hello.c on the server bsdi.

NFS Procedures

The NFS server provides 15 procedures, which we now describe. (The numbers we use are not the same as the NFS procedure numbers, since we have grouped them according to functionality.) Although NFS was designed to work between different operating systems, and not just Unix systems, some of the procedures provide Unix functionality that might not be supported by other operating systems (e.g., hard links, symbolic links, group owner, execute permission, etc.). Chapter 4 of [Stevens 1992] contains additional information on the properties of Unix filesystems, some of which are assumed by NFS.

GETATTR. Return the attributes of a file; type of file (regular file, directory, etc.), permissions, size of file, owner of file, last-access time, and so on.
SETATTR. Set the attributes of a file. Only a subset of the attributes can be set: permissions, owner, group owner, size, last-access time, and last-modification time.
STATFS. Return the status of a filesystem: amount of available space, optimal size for transfer, and so on. Used by the Unix df command, for example.
LOOKUP. Lookup a file. This is the procedure called by the client each time a user process opens a file that's on an NFS server. A file handle is returned, along with the attributes of the file.
READ. Read from a file. The client specifies the file handle, starting byte offset, and maximum number of bytes to read (up to 8192).
WRITE. Write to a file. The client specifies the file handle, starting byte offset, number of bytes to write, and the data to write.
*NFS writes are required to be synchronous. The server cannot respond OK until it has successfully written the data (and any other file information that gets updated) to disk.
CREATE. Create a file.
REMOVE. Delete a file.
RENAME. Rename a file.
LINK. Make a hard link to a file. A hard link is a Unix concept whereby a given file on disk can have any number of directory entries (i.e., names, also called hard links) that point to the file.
SYMLINK. Create a symbolic link to a file. A symbolic link is a file that contains the name of another file. Most operations that reference the symbolic link (e.g., open) really reference the file pointed to by the symbolic link.
READLINK. Read a symbolic link, that is, return the name of the file to which the symbolic link points.
MKDIR. Create a directory.
RMDIR. Delete a directory.
READDIR. Read a directory. Used by the Unix ls command, for example.

These procedure names actually begin with the prefix NFSPROC_, which we've dropped.

UDP or TCP?

NFS was originally written to use UDP, and that's what all vendors provide. Newer implementations, however, also support TCP. TCP support is provided for use on wide area networks, which are getting faster over time. NFS is no longer restricted to local area use.

The network dynamics can change drastically when going from a LAN to a WAN. The round-trip times can vary widely and congestion is more frequent. These characteristics of WANs led to the algorithms we examined with TCP - slow start and congestion avoidance. Since UDP does not provide anything like these algorithms, either the same algorithms must be put into the NFS client and server or TCP should be used.

NFS Over TCP

The Berkeley Net/2 implementation of NFS supports either UDP or TCP. [Macklem 1991] describes this implementation. Let's look at the differences when TCP is used.

When the server bootstraps, it starts an NFS server that does a passive open on TCP port 2049, waiting for client connection requests. This is usually in addition to the normal NFS UDP server that waits for incoming datagrams to UDP port 2049.
When the client mounts the server's filesystem using TCP, it does an active open to TCP port 2049 on the server. This results in a TCP connection between the client and server for this filesystem. If the same client mounts another file-system on the same server, another TCP connection is created.
Both the client and server set TCP's keepalive option on their ends of the connection (Chapter 23). This lets either end detect if the other end crashes, or crashes and reboots.
All applications on the client that use this server's filesystem share the single TCP connection for this filesystem. For example, in Figure 29.6 if there were another directory named smith beneath /usr on bsdi, references to files in /nfs/bsdi/usr/rstevens and /nfs/bsdi/usr/smith would share the same TCP connection.
If the client detects that the server has crashed, or crashed and rebooted (by receiving a TCP error of either "connection timed out" or "connection reset by peer"), it tries to reconnect to the server. The client does another active open to reestablish the TCP connection with the server for this filesystem. Any client requests that timed out on the previous connection are reissued on the new connection.
If the client crashes, so do the applications that are running when it crashes. When the client reboots, it will probably remount the server's filesystem using TCP, resulting in another TCP connection to the server. The previous connection between this client and server for this filesystem is half-open (the server thinks it's still open), but since the server set the keepalive option, this half-open connection will be terminated when the next keepalive probe is sent by the server's TCP.

Over time, additional vendors plan to support NFS over TCP.

29.6 NFS Examples

Let's use tcpdump to see which NFS procedures are invoked by the client for typical file operations. When tcpdump detects a UDP datagram containing an RPC call (call equals 0 in Figure 29.1) with a destination port of 2049, it decodes the datagram as an NFS request. Similarly if the UDP datagram is an RPC reply (reply equals 1 in Figure 29.2) with a source port of 2049, it decodes the datagram as an NFS reply.

Simple Example: Reading a File

Our first example just copies a file to the terminal using the cat(l) command, but the file is on an NFS server:

sun % cat /nfs/bsdi/usr/rstevens/hello.c copy file to terminal

main ()

{

printf ("hello, world\n");

}

On the host sun (the NFS client) the filesystem /nfs/bsdi/usr is really the /usr file-system on the host bsdi (the NFS server), as shown in Figure 29.6. The kernel on sun detects this when cat opens the file, and uses NFS to access the file. Figure 29.7 shows the tcpdump output.

1	`0.0`	`sun.7aa6 > bsdi. nfs: 104 getattr`
2	`0.003587 (0.0036)`	`bsdi.nfs > sun.7aa6: reply ok 96`
3	`0.005390 (0.0018)`	`sun.7aa7 > bsdi.nfs: 116 lookup "rstevens"`
4	`0.009570 (0.0042)`	`bsdi.nfs > sun.7aa7: reply ok 128`
5	`0.011413 (0.0018)`	`sun.7aa8 > bsdi.nfs: 116 lookup "hello.c"`
6	`0.015512 (0.0041)`	`bsdi.nfs > sun.7aa8: reply ok 128`
7	`0.018843 (0.0033)`	`sun.7aa9 > bsdi.nfs: 104 getattr`
8	`0.022377 (0.0035)`	`bsdi.nfs > sun.7aa9: reply ok 96`
9	`0.027621 (0.0052)`	`sun.7aaa > bsdi.nfs: 116 read 1024 bytes @ 0`
10	`0.032170 (0.0045)`	`bsdi.nfs > sun.7aaa: reply ok 140`

Figure 29.7 NFS operations to read a file.

When tcpdump decodes an NFS request or reply, it prints the XID field for the client, instead of the port number. The XID field in lines 1 and 2 is 0x7aa6.

The filename /nfs/bsdi/usr/rstevens/hello.c is processed by the open function in the client kernel one element at a time. When it reaches /nfs/bsdi/usr it detects that this is a mount point to an NFS mounted filesystem.

In line 1 the client calls the GETATTR procedure to fetch the attributes of the server's directory that the client has mounted (/usr). This RPC request contains 104 bytes of data, exclusive of the IP and UDP headers. The reply in line 2 has a return value of OK and contains 96 bytes of data, exclusive of the IP and UDP headers. We see in this figure that the minimum NFS message contains around 100 bytes of data.

In line 3 the client calls the LOOKUP procedure for the file rstevens and receives an OK reply in line 4. The LOOKUP specifies the filename rstevens and the file handle that was saved by the kernel when the remote filesystem was mounted. The reply contains a new file handle that is used in the next step.

In line 5 the client does a LOOKUP of hello.c using the file handle from line 4. It receives another file handle in line 6. This new file handle is what the client uses in lines 7 and 9 to reference the file /nfs/bsdi/usr/rstevens/hello.c. We see that the client does a LOOKUP for each component of the pathname that is being opened.

In line 7 the client does another GETATTR, followed by a READ in line 9. The client asks for 1024 bytes, starting at offset 0, but receives less. (After subtracting the sizes of the RPC fields, and the other values returned by the READ procedure, 38 bytes of data are returned in line 10. This is indeed the size of the file hello.c.)

In this example the user process knows nothing about these NFS requests and replies that are being done by the kernel. The application just calls the kernel's open function, which causes 3 requests and 3 replies to be exchanged (lines 1-6), and then calls the kernel's read function, which causes 2 requests and 2 replies (lines 7-10). It is transparent to the client application that the file is on an NFS server.

Simple Example: Creating a Directory

As another simple example we'll change our working directory to a directory that's on an NFS server, and then create a new directory:

sun % cd /nfs/bsdi/usr/rstevens change working directory

sun % mkdir Mail and create a directory

Figure 29.8 shows the tcpdump output.

1	`0.0`	`sun.7ad2 > bsdi.nfs: 104 getattr`
2	`0.004912 ( 0.0049)`	`bsdi.nfs > sun.7ad2: reply ok 96`
3	`0.007266 ( 0.0024)`	`sun.7ad3 > bsdi.nfs: 104 getattr`
4	`0.010846 ( 0.0036)`	`bsdi.nfs > sun.7ad3: reply ok 96`
5	`35.769875 (35.7590)`	`sun.7ad4 > bsdi.nfs: 104 getattr`
6	`35.773432 ( 0.0036)`	`bsdi.nfs > sun.7ad4: reply ok 96`
7	`35.775236 ( 0.0018)`	`sun.7ad5 > bsdi.nfs: 112 lookup "Mail"`
8	`35.780914 ( 0.0057)`	`bsdi.nfs > sun.7ad5: reply ok 28`
9	`35.782339 ( 0.0014)`	`sun.7ad6 > bsdi.nfs: 144 mkdir "Mail"`
10	`35.992354 ( 0.2100)`	`bsdi.nfs > aun.7ad6: reply ok 128`

Figure 29.8 NFS operations for cd to NFS directory, then mkdir.

Changing our directory causes the client to call the GETATTR procedure twice (lines 1-4). When we create the new directory, the client calls the GETATTR procedure (lines 5 and 6), followed by a LOOKUP (lines 7 and 8, to verify that the directory doesn't already exist), followed by a MKDIR to create the directory (lines 9 and 10). The reply of OK in line 8 doesn't mean that the directory exists. It just means the procedure returned, tcpdump doesn't interpret the return values from the NFS procedures. It normally prints OK and the number of bytes of data in the reply.

Statelessness

One of the features of NFS (critics of NFS would call this a wart, not a feature) is that the NFS server is stateless. "The server does not keep track of which clients are accessing which files. Notice in the list of NFS procedures shown earlier that there is not an open procedure or a close procedure. The LOOKUP procedure is similar to an open, but the server never knows if the client is really going to reference the file after the client does a LOOKUP.

The reason for a stateless design is to simplify the crash recovery of the server after it crashes and reboots.

Example: Server Crash

In the following example we are reading a file from an NFS server when the server crashes and reboots. This shows how the stateless server approach lets the client "not know" that the server crashes. Other than a time pause while the server crashes and reboots, the client is unaware of the problem, and the client application is not affected.

On the client sun we start a cat of a long file (/usr/share/lib/termcap on the NFS server svr4), disconnect the Ethernet cable during the transfer, shut down and reboot the server, then reconnect the cable. The client was configured to read 1024 bytes per NFS read. Figure 29.9 shows the tcpdump output.

`1`	`0.0`	`sun.7ade > svr4.nfs: 104 getattr`
`2`	`0.007653 ( 0.0077)`	`svr4.nfs > sun.7ade: reply ok 96`
`3`	`0.009041 ( 0.0014)`	`sun.7adf > svr4.nfs: 116 lookup "share"`
`4`	`0.017237 ( 0.0082)`	`svr4.nfs > sun.7adf: reply ok 128`
`5`	`0.018518 ( 0.0013)`	`sun.7ae0 > svr4.nfs: 112 lookup "lib"`
`6`	`0.026802 ( 0.0083)`	`svr4.nfs > sun.7ae0: reply ok 128`
`7`	`0.028096 ( 0.0013)`	`sun.7ael > svr4.nfs: 116 lookup "termcap"`
`8`	`0.036434 ( 0.0083)`	`svr4.nfs > sun.7ael: reply ok 128`
`9`	`0.038060 ( 0.0016)`	`sun.7ae2 > svr4.nfs: 104 getattr`
`10`	`0.045821 ( 0.0078)`	`svr4.nfs > sun.7ae2: reply ok 96`
`11`	`0.050984 ( 0.0052)`	`sun.7ae3 > svr4.nfs: 116 read 1024 bytes @ 0`
`12`	`0.084995 ( 0.0340)`	`svr4.nfs > sun.7ae3: reply ok 1124`
		reading continues
`128`	`3.430313 ( 0.0013)`	`sun.7b22 > svr4.nfs: 116 read 1024 bytes @ 64512`
`129`	`3.441828 ( 0.0115)`	`svr4.nfs > sun.7b22: reply ok 1124`
`130`	`4.125031 ( 0.6832)`	`sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536`
`131`	`4.868593 ( 0.7436)`	`sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728`
`132`	`4.993021 ( 0.1244)`	`sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536`
`133`	`5.732217 ( 0.7392)`	`sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728`
`134`	`6.732084 ( 0.9999)`	`sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536`
`135`	`7.472098 ( 0.7400)`	`sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728`
`136`	`10.211964 ( 2.7399)`	`sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536`
`137`	`10.951960 ( 0.7400)`	`sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728`
`138`	`17.171767 ( 6.2198)`	`sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536`
`139`	`17.911762 ( 0.7400)`	`sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728`
`140`	`31.092136 (13.1804)`	`sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536`
`141`	`31.831432 ( 0.7393)`	`sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728`
`142`	`51.090854 (19.2594)`	`sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536`
`143`	`51.830939 ( 0.7401)`	`sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728`
`144`	`71.090305 (19.2594)`	`sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536`
`145`	`71.830155 ( 0.7398)`	`sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728`
		retransmissions continue
`167`	`291.824285 ( 0.7400)`	`sun.7b24 > svr4.nfs: 116 read 1024 bytes @ 73728`
`168`	`311.083676 (19.2594)`	`sun.7b23 > svr4.nfs: 116 read 1024 bytes @ 65536`
		server reboots
`169`	`311.149476 ( 0.0658)`	`arp who-has sun tell svr4`
`170`	`311.150004 ( 0.0005)`	`arp reply sun is-at 8:0:20:3:f6:42`
`171`	`311.154852 ( 0.0048)`	`svr4.nfs > sun.7b23: reply ok 1124`
`172`	`311.156671 ( 0.0018)`	`sun.7b25 > svr4.nfs: 116 read 1024 bytes @ 66560`
`173`	`311.168926 ( 0.0123)`	`svr4.nfs > sun.7b25: reply ok 1124`
		reading continues

Figure 29.9 Client reading a file when an NFS server crashes and reboots.

Lines 1-10 correspond to the client opening the file. The operations are similar to those shown in Figure 29.7. In line II we see the first READ of the file, with 1024 bytes of data returned in line 12. This continues (a READ of 1024 followed by a reply of OK) through line 129.

In lines 130 and 131 we see two requests that time out and are retransmitted in lines 132 and 133. The first question is why are there two read requests, one starting at offset 65536 and the other starting at 73728? The client kernel has detected that the client application is performing sequential reads, and is trying to prefetch data blocks. (Most Unix kernels do this read-ahead.) The client kernel is also running multiple NFS block I/O daemons (biod processes) that try to generate multiple RPC requests on behalf of clients. One daemon is reading 8192 bytes starting at 65536 (in 1024-byte chunks) and the other is performing the read-ahead of 8192 bytes starting at 73728.

Client retransmissions occur in lines 130-168. In line 169 we see the server has rebooted, and it sends an ARP request before it can reply to the client's NFS request in line 168. The response to line 168 is sent in line 171. The client READ requests continue.

The client application never knows that the server crashes and reboots, and except for the 5-minute pause between lines 129 and 171, this server crash is transparent to the client.

To examine the timeout and retransmission interval in this example, realize that there are two client daemons with their own timeouts. The intervals for the first daemon (reading at offset 65536), rounded to two decimal points, are: 0.68, 0.87, 1.74, 3.48, 6.96, 13.92, 20.0, 20.0, 20.0, and so on. The intervals for the second daemon (reading at offset 73728) are the same (to two decimal points). It appears that these NFS clients are using a timeout that is a multiple of 0.875 seconds with an upper bound of 20 seconds. After each timeout the retransmission interval is doubled: 0.875, 1.75,3.5, 7.0, and 14.0.

How long does the client retransmit? The client has two options that affect this. First, if the server filesystem is mounted hard, the client retransmits forever, but if the server filesystem is mounted soft, the client gives up after a fixed number of retransmissions. Also, with a hard mount the client has an option of whether to let the user interrupt the infinite retransmissions or not. If the client host specifies interruptibility when it mounts the server's filesystem, if we don't want to wait 5 minutes for the server to reboot after it crashes, we can type our interrupt key to abort the client application.

Idempotent Procedures

An RPC procedure is called idempotent if it can be executed more than once by the server and still return the same result. For example, the NFS read procedure is idempotent. As we saw in Figure 29.9, the client just reissues a given READ call until it gets a response. In our example the reason for the retransmission was that the server had crashed. If the server hasn't crashed, and the RPC reply message is lost (since UDP is unreliable), the client just retransmits and the server performs the same READ again. The same portion of the same file is read again and sent back to the client.

This works because each READ request specifies the starting offset of the read. If there were an NFS procedure asking the server to read the next N bytes of a file, this wouldn't work. Unless the server is made stateful (as opposed to stateless), if a reply is lost and the client reissues the READ for the next N bytes, the result is different. This is why the NFS READ and WRITE procedures have the client specify the starting offset. The client maintains the state (the current offset of each file), not the server.

Unfortunately, not all filesystem operations are idempotent. For example, consider the following steps: the client NFS issues the REMOVE request to delete a file; the server NFS deletes the file and responds OK; the server's response is lost; the client NFS times out and retransmits the request; the server NFS can't find the file and responds with an error; the client application receives an error saying the file doesn't exist. This error return to the client application is wrong-the file did exist and was deleted.

The NFS operations that are idempotent are: GETATTR, STATES, LOOKUP, READ, WRITE, READLINK, and READDIR. The procedures that are not idempotent are: CREATE, REMOVE, RENAME, LINK, SYMLINK, MKDIR, and RMDIR. SETATTR is normally idempotent, unless it's being used to truncate a file.

Since lost responses can always happen with UDP, NFS servers need a way to handle the nonidempotent operations. Most servers implement a recent-reply cache in which they store recent replies for the nonidempotent operations. Each time the server receives a request, it first checks this cache, and if a match is found, returns the previous reply instead of calling the NFS procedure again. [Juszczak 1989] provides details on this type of cache.

This concept of idempotent server procedures applies to any UDP-based application, not just NFS. The DNS, for example, provides an idempotent service. A DNS server can execute a resolver's request any number of times with no ill effects (other than wasted network resources).

29.7 NFS Version 3

During 1993 the specifications for version 3 of the NFS protocol were released [Sun Microsystems 1994]. Implementations are expected to become available during 1994.

Here we summarize the major differences between versions 2 and 3. We'll refer to the two as V2 and V3.

The file handle in V2 is a fixed-size array of 32 bytes. With V3 it becomes a variable-length array up to 64 bytes. A variable-length array in XDR is encoded with a 4-byte count, followed by the actual bytes. This reduces the size of the file handle on implementations such as Unix that only need about 12 bytes, but allows non-Unix implementations to maintain additional information.
V2 limits the number of bytes per READ or WRITE RPC to 8192 bytes. This limit is removed in V3, meaning an implementation over UDP is limited only by the IP datagram size (65535 bytes). This allows larger read and write packets on faster networks.
File sizes and the starting byte offsets for the READ and WRITE procedures are extended from 32 to 64 bits, allowing larger file sizes.
A file's attributes are returned on every call that affects the attributes. This reduces the number of GETATTR calls required by the client.
WRITEs can be asynchronous, instead of the synchronous WRITEs required by V2. This can improve WRITE performance.
One procedure was deleted (STATES) and seven were added: ACCESS (check file access permissions), MKNOD (create a Unix special file), READDIRPLUS (returns names of files in a directory along with their attributes), FSINFO (returns the static information about a filesystem), FSSTAT (returns the dynamic information about a filesystem), PATHCONF (returns the POSIX.1 information about a file), and COMMIT (commit previous asynchronous writes to stable storage).

29.8 Summary

RPC is a way to build a client-server application so that it appears that the client just calls server procedures. All the networking details are hidden in the client and server stubs, which are generated for an application by the RPC package, and in the RPC library routines. We showed the format of the RPC call and reply messages, and mentioned that XDR is used to encode the values, allowing RPC clients and servers to run on machines with different architectures.

One of the most widely used RPC applications is Sun's NFS, a heterogeneous file access protocol that is widely implemented on hosts of all sizes. We looked at NFS and the way that it uses UDP and TCP. Fifteen procedures define the NFS Version 2 protocol.

A client's access to an NFS server starts with the mount protocol, returning a file handle to the client. The client can then access files on the server's filesystem using that file handle. Filenames are looked up on the server one element at a time, returning a new file handle for each element. The end result is a file handle for the file being referenced, which is used in subsequent reads and writes.

NFS tries to make all its procedures idempotent, so that the client can just reissue a request if the response gets lost. We saw an example of this with a client reading a file while the server crashed and rebooted.

Exercises

29.1 In Figure 29.7 we saw that tcpdump interpreted the packets as NFS requests and replies, printing the XID. Can tcpdump do this for any RPC request or reply?

29.2 On a Unix system, why do you think RPC server programs use ephemeral ports and not well-known ports?

29.3 An RPC client calls two server procedures. The first server procedure takes 5 seconds to execute, and the second procedure takes 1 second to execute. The client has a timeout of 4 seconds. Draw a time line of what's exchanged between the client and server. (Assume it takes no time for messages from the client to the server, and vice versa.)

29.4 What would happen in the example shown in Figure 29.9 if, while the NFS server were down, its Ethernet card were replaced?

29.5 When the server reboots in Figure 29.9, it handles the request starting at byte offset 65536 (lines 168 and 171), and then handles the next request starting at offset 66560 (lines 172 and 173). What happened to the request starting at offset 73728 (line 167)?

29.6 When we described idempotent NFS procedures we gave an example of a REMOVE reply being lost in the network. What happens in this case if TCP is used, instead of UDP?

29.7 If the NFS server used an ephemeral port instead of 2049, what would happen to an NFS client when the server crashes and reboots?

29.8 Reserved port numbers (Section 1.9) are scarce, since there are a maximum of 1023 per host. If an NFS server requires its clients to have reserved ports (which is common) and an NFS client using TCP mounts N filesystems on N different servers, does the client need a different reserved port number for each connection?

`sun % /usr/etc/rpcinfo -p`
`program`	`vers`	`proto`	`port`
`100005`	`1`	`tcp`	`702`	`mountd`	mount daemon for NFS
`100005`	`1`	`udp`	`699`	`mountd`
`100005`	`2`	`tcp`	`702`	`mountd`
`100005`	`2`	`udp`	`699`	`mountd`
`100003`	`2`	`udp`	`2049`	`nfs`	NFS itself
`100021`	`1`	`tcp`	`709`	`niockmgr`	NFS lock manager
`100021`	`1`	`udp`	`1036`	`niockmgr`
`100021`	`2`	`tcp`	`721`	`niockmgr`
`100021`	`2`	`udp`	`1039`	`niockmgr`
`100021`	`3`	`tcp`	`713`	`niockmgr`
`100021`	`3`	`udp`	`1037`	`niockmgr`

`sun % cat /nfs/bsdi/usr/rstevens/hello.c`	copy file to terminal
`main ()`
`{`
`printf ("hello, world\n");`
`}`

`sun % cd /nfs/bsdi/usr/rstevens`	change working directory
`sun % mkdir Mail`	and create a directory