[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1406872036.24270.7.camel@perseus.fritz.box>
Date: Fri, 01 Aug 2014 13:47:16 +0800
From: Ian Kent <raven@...maw.net>
To: NeilBrown <neilb@...e.de>
Cc: autofs@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH/RFC] autofs: the documentation I wanted to read
On Tue, 2014-07-29 at 12:00 +1000, NeilBrown wrote:
>
> This documents autofs from the perspective of what the module actually
> supports rather than how automount is expected to use it.
> It is based mostly on code review and very little on testing so it
> may be inaccurate in some places.
>
> The document assumes the functionality added by the RCU-walk patches
> that I posted recently.
>
> It is formatted using "markdown" and works best with Markdown.pl
> (markdown_py doesn't like some constructs).
>
>
> Signed-off-by: NeilBrown <neilb@...e.de>
Acked-by: Ian Kent <raven@...maw.net>
There are a couple of places that might need more work but it's better
to have this now and to work with it in future than to hold it up.
Especially since I can't quite nail down what it was that didn't quite
sound right when reading it.
Excellent job Neil, thanks very much.
Ian
>
> diff --git a/Documentation/filesystems/autofs4.txt b/Documentation/filesystems/autofs4.txt
> new file mode 100644
> index 000000000000..45f67c83d713
> --- /dev/null
> +++ b/Documentation/filesystems/autofs4.txt
> @@ -0,0 +1,503 @@
> +<head>
> +<style> p { max-width:50em} ol, ul {max-width: 40em}</style>
> +</head>
> +
> +autofs - how it works
> +=====================
> +
> +Purpose
> +-------
> +
> +The goal of autofs is to provide on-demand mounting and race free
> +automatic unmounting of various other filesystems. This provides two
> +key advantages:
> +
> +1. There is no need to delay boot until all filesystems that
> + might be needed are mounted. Processes that try to access those
> + slow filesystems might be delayed but other processes can
> + continue freely. This is particularly important for
> + network filesystems (e.g. NFS) or filesystems stored on
> + media with a media-changing robot.
> +
> +2. The names and locations of filesystems can be stored in
> + a remote database and can change at any time. The content
> + in that data base at the time of access will be used to provide
> + a target for the access. The interpretation of names in the
> + filesystem can even be programatic rather than database-backed,
> + allowing wildcards for example, and can vary based on the user who
> + first accessed a name.
> +
> +Context
> +-------
> +
> +The "autofs4" filesystem module is only one part of an autofs system.
> +There also needs to be a user-space program which looks up names
> +and mounts filesystems. This will often be the "automount" program,
> +though other tools including "systemd" can make use of "autofs4".
> +This document describes only the kernel module and the interactions
> +required with any user-space program. Subsequent text refers to this
> +as the "automount daemon" or simply "the daemon".
> +
> +"autofs4" is a Linux kernel module with provides the "autofs"
> +filesystem type. Several "autofs" filesystems can be mounted and they
> +can each be managed separately, or all managed by the same daemon.
> +
> +Content
> +-------
> +
> +An autofs filesystem can contain 3 sorts of objects: directories,
> +symbolic links and mount traps. Mount traps are directories with
> +extra properties as described in the next section.
> +
> +Objects can only be created by the automount daemon: symlinks are
> +created with a regular `symlink` systemcall, while directories and
> +mount traps are created with `mkdir`. The determination of whether a
> +directory should be a mount trap or not is quite _ad hoc_, largely for
> +historical reasons, and is determined in part the
> +*direct*/*indirect*/*offset* mount options, and the *maxproto* mount option.
> +
> +If neither the *direct* or *offset* mount options are given (so the
> +mount is considered to be *indirect*), then the root directory is
> +always a regular directory, otherwise it is a mount trap when it is
> +empty and a regular directory when not empty. Note that *direct* and
> +*offset* are treated identically so a concise summary is that the root
> +directory is a mount trap only if the filesystem is mounted *direct*
> +and the root is empty.
> +
> +Directories created in the root directory are mount traps only if the
> +filesystem is mounted *indirect* and they are empty.
> +
> +Directories further down the tree depend on the *max_proto* mount
> +option and particularly whether it is less than five or not.
> +When *max_proto* is five, no directories further down the
> +tree are ever mount traps, they are always regular directories. When
> +the *max_proto* is four (or three), these directories are mount traps
> +precisely when they are empty.
> +
> +So: non-empty (i.e. non-leaf) directories are never mount traps. Empty
> +directories are sometimes mount traps, and sometimes not depending on
> +where in the tree they are (root, top level, or lower) the *maxproto*,
> +and whether the mount was *indirect* or not.
> +
> +Mount Traps
> +---------------
> +
> +A core element of the implementation of autofs is the Mount Traps
> +which are provided by the Linux VFS. Any directory provided by a
> +filesystem can be designated as a trap. This involves two separate
> +features that work together to allow autofs to do its job.
> +
> +**DCACHE_NEED_AUTOMOUNT**
> +
> +If a dentry has the DCACHE_NEED_AUTOMOUNT flag set (which gets set if
> +the inode has S_AUTOMOUNT set, or can be set directly) then it is
> +(potentially) a mount trap. Any access to this directory beyond a
> +"`stat`" will (normally) cause the `d_op->d_automount()` dentry operation
> +to be called. The task of this method is to find the filesystem that
> +should be mounted on the directory and to return it. The VFS is
> +responsibly for actually mounting the root of this filesystem on the
> +directory.
> +
> +autofs doesn't find the filesystem itself but sends a message to the
> +automount daemon asking it to find and mount the filesystem. The
> +autofs `d_automount` method then waits for the daemon to report that
> +everything is ready. It will then return "`NULL`" indicating that the
> +mount has already happened. The VFS doesn't try to mount anything but
> +follows down the mount that is already there.
> +
> +This functionality is sufficient for some users of mount traps such
> +as NFS which creates traps so that mountpoints on the server can be
> +reflected on the client. However it is not sufficient for autofs. As
> +mounting onto a directory is considered to be "beyond a `stat`", the
> +automount daemon would not be able to mount a filesystem on the 'trap'
> +directory without some way to avoid getting caught in the trap. For
> +that purpose there is another flag.
> +
> +**DCACHE_MANAGE_TRANSIT**
> +
> +If a dentry has DCACHE_MANAGE_TRANSIT set then two very different but
> +related behaviors are invoked, both using the `d_op->d_manage()`
> +dentry operation.
> +
> +Firstly, before checking to see if any filesystem is mounted on the
> +directory, d_manage() will be called with the `rcu_walk` parameter set
> +to `false`. It may return one of three things:
> +
> +- A return value of zero indicates that there is nothing special
> + about this dentry and normal checks for mounts and automounts
> + should proceed.
> +
> + autofs normally returns zero, but first waits for any
> + expiry (automatic unmounting of the mounted filesystem) to
> + complete. This avoids races.
> +
> +- A return value of `-EISDIR` tells the VFS to ignore any mounts
> + on the directory and to not consider calling `->d_automount()`.
> + This effectively disables the **DCACHE_NEED_AUTOMOUNT** flag
> + causing the directory not be a mount trap after all.
> +
> + autofs returns this if it detects that the process performing the
> + lookup is the automount daemon and that the mount has been
> + requested but has not yet completed. How it determines this is
> + discussed later. This allows the automount daemon not to get
> + caught in the mount trap.
> +
> + There is a subtlety here. It is possibly that a second autofs
> + filesystem can be mounted below the first and for both of them to
> + be managed by the same daemon. For the daemon to be able to mount
> + something on the second it must be able to "walk" down past the
> + first. This means that d_manage cannot *always* return -EISDIR for
> + the automount daemon. It must only return it when a mount has
> + been requested, but has not yet completed.
> +
> + `d_manage` also returns `-EISDIR` if the dentry shouldn't be a
> + mount trap, either because it is a symbolic link or because it is
> + not empty.
> +
> +- Any other negative value is treated as an error and returned
> + to the caller.
> +
> + autofs can return
> +
> + - -ENOENT if the automount daemon failed to mount anything,
> + - -ENOMEM if it ran out of memory,
> + - -EINTR if a signal arrived while waiting for expiry to
> + complete
> + - or any other error sent down by the automount daemon.
> +
> +
> +The second use case only occurs during an "RCU-walk" and so `rcu_walk`
> +will be set.
> +
> +An RCU-walk is a fast and light weight process for walking down a
> +filename path (i.e. it is like running on tip-toes). RCU-walk cannot
> +cope with all situations so when it finds a difficulty it falls back
> +to "REF-walk", which is slower but more robust.
> +
> +RCU-walk will never call `->d_automount`, the filesystems must already
> +be mounted or RCU-walk cannot handle the path.
> +To determine if a mount-trap is safe for RCU-walk mode it calls
> +`->d_manage()` with `rcu_walk` set to `true`.
> +
> +In this case `d_manage()` must avoid blocking and should avoid taking
> +spinlocks if at all possible. Its sole purpose is to determine if it
> +would be safe to follow down into any mounted directory and the only
> +reason that it might not be is if an expiry of the mount is
> +underway.
> +
> +In the `rcu_walk` case, `d_manage()` cannot return -EISDIR to tell the
> +VFS that this is a directory that doesn't require d_automount. If
> +`rcu_walk` sees a dentry with DCACHE_NEED_AUTOMOUNT set but nothing
> +mounted, it *will* fall back to REF-walk. `d_manage()` cannot make the
> +VFS remain in RCU-walk mode, but can only tell it to get out of
> +RCU-walk mode by returning `-ECHILD`.
> +
> +So `d_manage()`, when called with `rcu_walk` set, should either return
> +-ECHILD if there is any reason to believe it is unsafe to end the
> +mounted filesystem, and otherwise should return 0.
> +
> +autofs will return `-ECHILD` if an expiry of the filesystem has been
> +initiated or is being considered, otherwise it returns 0.
> +
> +
> +Mountpoint expiry
> +-----------------
> +
> +The VFS has a mechansim for automatically expiring unused mounts,
> +much as it can expire any unused dentry information from the dcache.
> +This is guided by the MNT_SHRINKABLE flag. This only applies to
> +mounts that were created by `d_automount()` returning a filesystem to be
> +mounted. As autofs doesn't return such a filesystem be leaves the
> +mounting to the automount daemon, it must involve the automount daemon
> +in unmounting as well. This also means that autofs has more control
> +of expiry.
> +
> +The VFS also supports "expiry" of mounts using the MNT_EXPIRE flag to
> +the `umount` systemcall. Unmounting with MNT_EXPIRE will fail unless
> +a previous attempt had been made, and the filesystem was been inactive
> +and untouched since that previous attempt. autofs4 does not depend on
> +this but has its own internal tracking of whether filesystems were
> +recently used. This allows individual names in the autofs directory
> +to expire separately.
> +
> +With version 4 of the protocol, the automount daemon can try to
> +unmount any filesystems mounted on the autofs filesystem or remove any
> +symbolic links or empty directories any time it likes. If the unmount
> +or removal is successful the filesystem will be returned to the state
> +it was before the mount or creation, so that any access of the name
> +will trigger normal auto-mount processing. In particlar, `rmdir` and
> +`unlink` do not leave negative entries in the dcache as a normal
> +filesystem would, so an attempt to access a recently-removed object is
> +passed to autofs for handling.
> +
> +With version 5, this is not safe except for unmounting from top-level
> +directories. As lower-level directories are never mount traps, other
> +processes will see an empty directory as soon as the filesystem is
> +unmounted. So it is generally safest to use the autofs expiry
> +protocol described below.
> +
> +Normally the daemon only wants to remove entries which haven't been
> +used for a while. For this purpose autofs maintains a "`last_used`"
> +time stamp on each directory or symlink. For symlinks it genuinely
> +does record the last time the symlink was "used" or followed to find
> +out where it points to. For directories the field is a slight
> +misnomer. It actually records the last time that autofs checked if
> +the directory or one of its descendents was busy and found that it
> +was. This is just as useful and doesn't require updating the field so
> +often.
> +
> +The daemon is able to ask autofs if anything is due to be expired,
> +using an `ioctl` as discussed later. For a *direct* mount, autofs
> +considers if the entire mount-tree can be unmounted or not. For an
> +*indirect* mount, autofs considers each of the names in the top level
> +directory to determine if any of those can be unmounted and cleaned
> +up.
> +
> +There is an option with indirect mounts to consider each of the leaves
> +that has been mounted on instead of considering the top-level names.
> +This is intended for compatability with version 4 of autofs and should
> +be considered as deprecated.
> +
> +When autofs considers a directory it checks the `last_used` time and
> +compares it with the "timeout" value set when the filesystem was
> +mounted, though this check is ignored in some cases. It also checks if
> +the directory or anything below it is in use. For symbolic links,
> +only the `last_used` time is ever considered.
> +
> +If both appear to support expiring the directory or symlink, an action
> +is taken.
> +
> +There are two ways to ask autofs to consider expiry. The first is to
> +use the **AUTOFS_IOC_EXPIRE** ioctl. This only works for indirect
> +mounts. If it finds something in the root directory to expire it will
> +return the name of that thing. Once a name has been returned the
> +automount daemon needs to unmount any filesystems mounted below the
> +name normally. As described above, this is unsafe for non-toplevel
> +mounts in a version-5 autofs. For this reason the current `automountd`
> +does not use this ioctl.
> +
> +The second mechanism uses either the **AUTOFS_DEV_IOCTL_EXPIRE_CMD** or
> +the **AUTOFS_IOC_EXPIRE_MULTI** ioctl. This will work for both direct and
> +indirect mounts. If it selects an object to expire, it will notify
> +the daemon using the notification mechanism described below. This
> +will block until the daemon acknowledges the expiry notification.
> +This implies that the "`EXPIRE`" ioctl must be sent from a different
> +thread than the one which handles notification.
> +
> +While the ioctl is blocking, the entry is marked as "expiring" and
> +`d_manage` will block until the daemon affirms that the unmount has
> +completed (together with removing any directories that might have been
> +necessary), or has been aborted.
> +
> +Communicating with autofs: detecting the daemon
> +-----------------------------------------------
> +
> +There are several forms of communication between the automount daemon
> +and the filesystem. As we have already seen, the daemon can create and
> +remove directories and symlinks using normal filesystem operations.
> +autofs knows whether a process requesting some operation is the daemon
> +or not based on it's process-group id number (see getpgid(1)).
> +
> +When an autofs filesystem it mounted the pgid of the mounting
> +processes is recorded unless that "pgrp=" option is given, in which
> +case that number is recorded instead. Any request arriving from a
> +process in that process group is considered to come from the daemon.
> +If the daemon ever has to be stopped and restarted a new pgid can be
> +provided through an ioctl as will be described below.
> +
> +Communicating with autofs: the event pipe
> +-----------------------------------------
> +
> +When an autofs filesystem is mounted, the 'write' end of a pipe must
> +be passed using the 'fd=' mount option. autofs will write
> +notification messages to this pipe for the daemon to respond to.
> +For version 5, the format of the message is:
> +
> + struct autofs_v5_packet {
> + int proto_version; /* Protocol version */
> + int type; /* Type of packet */
> + autofs_wqt_t wait_queue_token;
> + __u32 dev;
> + __u64 ino;
> + __u32 uid;
> + __u32 gid;
> + __u32 pid;
> + __u32 tgid;
> + __u32 len;
> + char name[NAME_MAX+1];
> + };
> +
> +where the type is one of
> +
> + autofs_ptype_missing_indirect
> + autofs_ptype_expire_indirect
> + autofs_ptype_missing_direct
> + autofs_ptype_expire_direct
> +
> +so messages can indicate that a name is missing (something tried to
> +access it but it isn't there) or that it has been selected for expiry.
> +
> +The pipe will be set to "packet mode" (equivalent to passing
> +`O_DIRECT`) to _pipe2(2)_ so that a read from the pipe will return at
> +most one packet, and any unread portion of a packet will be discarded.
> +
> +The `wait_queue_token` is a unique number which can identify a
> +particular request to be acknowledged. When a message is sent over
> +the pipe the affected dentry is marked as either "active" or
> +"expiring" and other accesses to it block until the messages is
> +acknowledged using one of the ioctls below and the relevant
> +`wait_queue_token`.
> +
> +Communicating with autofs: root directory ioctls
> +------------------------------------------------
> +
> +The root directory of an autofs filesystem will respond to a number of
> +ioctls. The process issuing the ioctl must have the CAP_SYS_ADMIN
> +capability, or must be the automount daemon.
> +
> +The available ioctl commands are:
> +
> +- **AUTOFS_IOC_READY**: a notification has been handled. The argument
> + to the ioctl command is the "wait_queue_token" number
> + corresponding to the notification being acknowledged.
> +- **AUTOFS_IOC_FAIL**: similar to above, but indicates failure with
> + the error code `ENOENT`.
> +- **AUTOFS_IOC_CATATONIC**: Causes the autofs to enter "catatonic"
> + mode meaning that it stops sending notifications to the daemon.
> + This mode is also entered if a write to the pipe fails.
> +- **AUTOFS_IOC_PROTOVER**: This returns the protocol version in use.
> +- **AUTOFS_IOC_PROTOSUBVER**: Returns the protocol sub-version which
> + is really a version number for the implementation. It is
> + currently 2.
> +- **AUTOFS_IOC_SETTIMEOUT**: This passes a pointer to an unsigned
> + long. The value is used to set the timeout for expiry, and
> + the current timeout value is stored back through the pointer.
> +- **AUTOFS_IOC_ASKUMOUNT**: Returns, in the pointed-to `int`, 1 if
> + the filesystem could be unmounted. This is only a hint as
> + the situation could change at any instant. This call can be
> + use to avoid a more expensive full unmount attempt.
> +- **AUTOFS_IOC_EXPIRE**: as described above, this asks if there is
> + anything suitable to expire. A pointer to a packet:
> +
> + struct autofs_packet_expire_multi {
> + int proto_version; /* Protocol version */
> + int type; /* Type of packet */
> + autofs_wqt_t wait_queue_token;
> + int len;
> + char name[NAME_MAX+1];
> + };
> +
> + is required. This is filled in with the name of something
> + that can be unmounted or removed. If nothing can be expired,
> + `errno` is set to `EAGAIN`. Even though a `wait_queue_token`
> + is present in the structure, not "wait queue" is established
> + and no acknowledgment is needed.
> +- **AUTOFS_IOC_EXPIRE_MULTI**: This is similar to
> + **AUTOFS_IOC_EXPIRE** except that it causes notification to be
> + sent to the daemon, and it blocks until the daemon acknowledges.
> + The argument is an integer which can contain two different flags.
> +
> + **AUTOFS_EXP_IMMEDIATE** causes `last_used` time to be ignored
> + and objects are expired if the are not in use.
> +
> + **AUTOFS_EXP_LEAVES** will select a leaf rather than a top-level
> + name to expire. This is only safe when *maxproto* is 4.
> +
> +Communicating with autofs: char-device ioctls
> +---------------------------------------------
> +
> +It is not always possible to open the root of an autofs filesystem,
> +particularly a *direct* mounted filesystem. If the automount daemon
> +is restarted there is no way for it to regain control of existing
> +mounts using any of the above communication channels. To address this
> +need there is a "miscellaneous" character device (major 10, minor 235)
> +which can be used to communicate directly with the autofs filesystem.
> +It requires CAP_SYS_ADMIN for access.
> +
> +The `ioctl`s that can be used on this device a described in a separate
> +document `autofs4-mount-control.txt`, and are summarized briefly here.
> +Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure:
> +
> + struct autofs_dev_ioctl {
> + __u32 ver_major;
> + __u32 ver_minor;
> + __u32 size; /* total size of data passed in
> + * including this struct */
> + __s32 ioctlfd; /* automount command fd */
> +
> + __u32 arg1; /* Command parameters */
> + __u32 arg2;
> +
> + char path[0];
> + };
> +
> +For the **OPEN_MOUNT** and **IS_MOUNTPOINT** commands, the target
> +filesystem is identified by the `path`. All other commands identify
> +the filesystem by the `ioctlfd` which is a file descriptor open on the
> +root, and which can be returned by **OPEN_MOUNT**.
> +
> +The `ver_major` and `ver_minor` are in/out parameters which check that
> +the requested version is supported, and report the maximum version
> +that the kernel module can support.
> +
> +Commands are:
> +
> +- **AUTOFS_DEV_IOCTL_VERSION_CMD**: does nothing, except validate and
> + set version numbers.
> +- **AUTOFS_DEV_IOCTL_OPENMOUNT_CMD**: return an open file descriptor
> + on the root of an autofs filesystem. The filesystem is identified
> + by name a device number, which is stored in `arg1`. Device
> + numbers for existing filesystems can be found in
> + `/proc/self/mountinfo`.
> +- **AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD**: same as `close(ioctlfd)`.
> +- **AUTOFS_DEV_IOCTL_SETPIPEFD_CMD**: if the filesystem is in
> + catatonic mode, this can provide the write end of a new pipe
> + in `arg1` to re-establish communication with a daemon. The
> + process group of the calling process is used to identify the
> + daemon.
> +- **AUTOFS_DEV_IOCTL_REQUESTER_CMD**: `path` should be a
> + name within the filesystem that as been auto-mounted on.
> + arg1 is the dev number of the underlying autofs. On successful
> + return, `arg1` and `arg2` will be the UID and GID of the process
> + which triggered that mount.
> +
> +- **AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD**: Check if path is a
> + mountpoint of a particular type - see separate documentation for
> + details.
> +
> +- **AUTOFS_DEV_IOCTL_PROTOVER_CMD**:
> +- **AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD**:
> +- **AUTOFS_DEV_IOCTL_READY_CMD**:
> +- **AUTOFS_DEV_IOCTL_FAIL_CMD**:
> +- **AUTOFS_DEV_IOCTL_CATATONIC_CMD**:
> +- **AUTOFS_DEV_IOCTL_TIMEOUT_CMD**:
> +- **AUTOFS_DEV_IOCTL_EXPIRE_CMD**:
> +- **AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD**: These all have the same
> + function as the similarly named **AUTOFS_IOC** ioctls, except
> + that **FAIL** can be given an explicit error number in `arg1`
> + instead of assuming `ENOENT`, and this **EXPIRE** command
> + corresponds to **AUTOFS_IOC_EXPIRE_MULTI**.
> +
> +Catatonic mode
> +--------------
> +
> +As mentioned, an autofs mount can enter "catatonic" mode. This
> +happens if a write to the notification pipe fails, or if it is
> +explicitly requested by an `ioctl`.
> +
> +When entering catatonic mode, the pipe is closed and any pending
> +notifications are acknowledged with the error `ENOENT`.
> +
> +Once in catatonic mode attempts to access non-existing names will
> +result in `ENOENT` while attempts to access existing directories will
> +be treated in the same way is if they came from the daemon, so mount
> +traps will not fire.
> +
> +When the filesystem is mounted a _uid_ and _gid_ can be given which
> +set the ownership of directories and symbolic links. When the
> +filesystem is in catatonic mode, any process with a matching UID can
> +create directories or symlink in the root directory, but not in other
> +directories.
> +
> +Catatonic mode can only be left via the
> +**AUTOFS_DEV_IOCTL_OPENMOUNT_CMD** ioctl on the `/dev/autofs`.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists