linux-kernel - Re: [PATCH] Documentation: add new description of path-name lookup.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55B5A880.8040500@infradead.org>
Date:	Sun, 26 Jul 2015 20:41:52 -0700
From:	Randy Dunlap <rdunlap@...radead.org>
To:	NeilBrown <neilb@...e.com>, Jonathan Corbet <corbet@....net>
CC:	Al Viro <viro@...IV.linux.org.uk>,
	lkml <linux-kernel@...r.kernel.org>,
	"J. Bruce Fields" <bfields@...ldses.org>, linux-doc@...r.kernel.org
Subject: Re: [PATCH] Documentation: add new description of path-name lookup.

On 07/24/15 17:28, NeilBrown wrote:
> 

Hi Neil,

Some edits for you to consider.


> 
> diff --git a/Documentation/filesystems/path-lookup.md b/Documentation/filesystems/path-lookup.md
> new file mode 100644
> index 000000000000..eedf31c737ed
> --- /dev/null
> +++ b/Documentation/filesystems/path-lookup.md
> @@ -0,0 +1,1297 @@
> +<head>
> +<style> p { max-width:50em} ol, ul {max-width: 40em}</style>
> +</head>
> +
> +Pathname lookup in Linux.
> +=========================
> +
> +This write-up is based on three articles published at lwn.net:
> +
> +- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
> +- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
> +- <https://lwn.net/Articles/650786/> A walk among the symlinks
> +
> +Written by Neil Brown with help from Al Viro and Jon Corbet.
> +
> +Introduction
> +------------


> +Bringing it together with `struct nameidata`
> +--------------------------------------------
> +

> +### `struct path root` ###
> +
> +This is used to hold a reference to the effective root of the
> +filesystem.  Often that reference won't be needed, so this field is
> +only assigned the first time it is used, or when a non-standard root
> +is requested.  Keeping a reference in the `nameidata` ensures that
> +only one root is in effect for the entire path walk, even if it races
> +with a `chroot()` system call.
> +
> +The root is needed when either of two conditions holds: (1) either the
> +pathname or a symbolic link starts with a "'/'", or (2) a "`..`"
> +component is being handled, since "`..`" from the root must always stay
> +at the root.  The value used is usually the current root directory of
> +the calling process.  An alternate root can be provided as when
> +`sysctl()` calls `file_open_root()`, and when NFSv4 or Btrfs call
> +`mount_subtree()`.  In each case a pathname is being looked up in a very
> +specific part of the filesystem, and the lookup must not be allowed to
> +escape that subtree.  It works a bit like a local `chroot()`.
> +
> +Ignoring the handling of symbolic links, we can now describe the
> +"`link_path_walk()`" function, which handles the lookup of everything
> +except the final component as:
> +
> +>  Given a path (`name`) and a nameidata structure (`nd`), check that the
> +>  current directory has execute permission and then advance `name`
> +>  over one component while updating `last_type` and `last`.  If that
> +>  was the final component, then return, otherwise call
> +>  `walk_component()` and repeat from the top.
> +
> +`walk_component()` is even easier.  If the component is `LAST_DOTS`,
> +it calls `handle_dots()` which does the necessary locking as already
> +described.  If it finds a `LAST_NORM` component it first calls
> +"`lookup_fast()`" which only looks in the dcache, but will ask the
> +filesystem to revalidate the result if it is that sort of filesystem.
> +If that doesn't get a good result, it calls "`lookup_slow()`" which
> +takes the `i_mutex`, rechecks the cache, and then asks the filesystem
> +to find a definitive answer.  Each of these will call
> +`follow_managed()` (as described below) to handle any mount points.
> +
> +In the absence of symbolic links, `walk_component()` creates a new
> +`struct path` containing a counted reference to the new dentry and a
> +reference to the new `vfsmount` which is only counted if it is
> +different from the previous `vfsmount`.  It then calls
> +`path_to_nameidata()` to install the new `struct path` in the
> +`struct nameidate` and drop the unneeded references.

           nameidata

> +
> +This "hand-over-hand" sequencing of getting a reference to the new
> +dentry before dropping the reference to the previous dentry may
> +seem obvious, but is worth pointing out so that we will recognize its
> +analogue in the "RCU-walk" version.
> +
> +Handling the final component.
> +-----------------------------
> +
> +`link_path_walk()` only walks as far as setting `nd->last` and
> +`nd->last_type` to refer to the final component of the path.  It does
> +not call `walk_component()` that last time.  Handling that final
> +component remains for the caller to sort out. Those callers are
> +`path_lookupat()`, `path_parentat()`, `path_mountpoint()` and
> +`path_openat()` each of which handles the differing requirements of
> +different system calls.
> +
> +`path_parentat()` is clearly the simplest - it just wraps a little bit
> +of housekeeping around `link_path_walk()` and returns the parent
> +directory and final component to the caller.  The caller will be either
> +aiming to create a name (via `filename_create()`) or remove or rename
> +a name (in which case `user_path_parent()` is used).  They will use
> +`i_mutex` to exclude other changes while they validate and then
> +perform their operation.
> +
> +`path_lookupat()` is nearly as simple - it is used when an existing
> +object is wanted such as by `stat()` or `chmod()`.  It essentially just
> +calls `walk_component()` on the final component through a call to
> +`lookup_last()`.  `path_lookupat()` returns just the final dentry.
> +
> +`path_mountpoint()` handles the special case of unmounting which must
> +not try to revalidate the mounted filesystem.  It effectively
> +contains, through a call to `mountpoint_last()`, an alternate
> +implementation of `lookup_slow()` which skips that step.  This is
> +important when unmounting a filesystem that is inaccessible, such as
> +one provided by a dead NFS server.
> +
> +Finally `path_openat()` is used for the `open()` system call; it
> +contains, in support functions starting with "`do_last()`", all the
> +complexity needed to handle the different subtleties of O_CREAT (with
> +or without O_EXCL) final "`/`" characters, and trailing symbolic

missing comma?      , final

> +links.  We will revisit this in the final part of this series, which
> +focuses on those symbolic links.  "`do_last()`" will sometimes, but
> +not always, take `i_mutex`, depending on what it finds.
> +
> +Each of these, or the functions which call them, need to be alert to
> +the possibility that the final component is not `LAST_NORM`.  If the
> +goal of the lookup is to create something, then any value for
> +`last_type` other than `LAST_NORM` will result in an error.  For
> +example if `path_parentat()` reports `LAST_DOTDOT`, then the caller
> +won't try to create that name.  They also check for trailing slashes
> +by testing `last.name[last.len]`.  If there is any character beyond
> +the final component, it must be a trailing slash.
> +
> +Revalidation and automounts
> +---------------------------
> +
> +Apart from symbolic links, there are only two parts of the "REF-walk"
> +process not yet covered.  One is the handling of stale cache entries
> +and the other is automounts.
> +
> +On filesystems that require it, the lookup routines will call the
> +`->d_revalidate()` dentry method to ensure that the cached information
> +is current.  This will often confirm validity or update a few details
> +from a server.  In some cases it may find that there has been change
> +further up the path and that something that was thought to be valid
> +previously isn't really.  When this happens the lookup of the whole
> +path is aborted and retried with the "`LOOKUP_REVAL`" flag set.  This
> +forces revalidation to be more thorough.  We will see more details of
> +this retry process in the next article.
> +
> +Automount points are locations in the filesystem where an attempt to
> +lookup a name can trigger changes to how that lookup should be
> +handled, in particular by mounting a filesystem there.  These are
> +covered in greater detail in autofs4.txt in the Linux documentation
> +tree, but a few notes specifically related to path lookup are in order
> +here.
> +
> +The Linux VFS has a concept of "managed" dentries which is reflected
> +in function names such as "`follow_managed()`".  There are three
> +potentially interesting things about these dentries corresponding
> +to three different flags that might be set in `dentry->d_flags`:
> +


> +RCU-walk - faster pathname lookup in Linux
> +==========================================
> +
> +RCU-walk is another algorithm for performing pathname lookup in Linux.
> +It is in many ways similar to REF-walk and the two share quite a bit
> +of code.  The significant difference in RCU-walk is how it allows for
> +the possibility of concurrent access.
> +
> +We noted that REF-walk is complex because there are numerous details
> +and special cases.  RCU-walk reduces this complexity by simply
> +refusing to handle a number of cases -- it instead falls back to
> +REF-walk.  The difficulty with RCU-walk comes from a different
> +direction: unfamiliarity.  The locking rules when depending on RCU are
> +quite different to traditional locking, so we will spend a little extra

I would say:       from

> +time when we come to those.
> +
> +Clear demarcation of roles
> +--------------------------
> +
> +The easiest way to manage concurrency is to forcibly stop any other
> +thread from changing the data structures that a given thread is
> +looking at.  In cases where no other thread would even think of
> +changing the data and lots of different threads want to read at the
> +same time, this can be very costly.  Even when using locks that permit
> +multiple concurrent readers, the simple act of updating the count of
> +the number of current readers can impose an unwanted cost.  So the
> +goal when reading a shared data structure that no other process is
> +changing, is to avoid writing anything to memory at all.  Take no

dubious comma^

> +locks, increment no counts, leave no footprints.
> +

> +RCU and seqlocks: fast and light
> +--------------------------------
> +

> +However, there is a little bit more to seqlocks than that.  If
> +RCU-walk accesses two different fields in a seqlock-protected
> +structure, or accesses the same field twice, there is no a-priori

                                             no hyphen:      ^

> +guarantee of any consistency between those accesses.  When consistency
> +is needed - which it usually is - RCU-walk must take a copy and then
> +use `read_seqcount_retry()` to validate that copy.
> +


> +`unlazy walk()` and `complete_walk()`
> +-------------------------------------
> +

> +A pair of patterns
> +------------------
> +
> +In various places in the details of REF-walk and RCU-walk, and also in
> +the big picture, there are a couple of related patterns that are worth
> +being aware of.
> +
> +The first is "try quickly and check, if that fails try slowly".  We
> +can see that in the high-level approach of first trying RCU-walk and
> +then trying REF-walk, and in places were `unlazy_walk()` is used to

                                       where

> +switch to REF-walk for the rest of the path.  We also saw it earlier
> +in `dget_parent()` when following a "`..`" link.  It tries a quick way
> +to get a reference, then falls back to taking locks if needed.
> +

> +A walk among the symlinks
> +=========================
> +

> +
> +Storage and lifetime of cached symlinks
> +---------------------------------------
> +

> +
> +When neither of these are suitable, the next most likely scenario is

                         is

> +that the filesystem will allocate some temporary memory and copy or
> +construct the symlink content into that memory whenever it is needed.
> +


> +Updating the access time
> +------------------------
> +
> +We previously said of RCU-walk that it would "take no locks, increment
> +no counts, leave no footprints."  We have since seen that some
> +"footprints" can be needed when handling symlinks as a counted
> +reference (or even a memory allocation) may be needed.  But these
> +footprints are best kept to a minimum.
> +
> +One other place where walking down a symlink can involve leaving
> +footprints in a way that doesn't affect directories is in updating access times.
> +In Unix (and Linux) every filesystem object has a "last accessed
> +time", or "`atime`".  Passing through a directory to access a file
> +within, is not considered to be an access for the purposes of

drop     ^ comma?

> +`atime`; only listing the contents of a directory can update its `atime`.
> +Symlinks are different it seems.  Both reading a symlink (with `readlink()`)
> +and looking up a symlink on the way to some other destination can
> +update the atime on that symlink.
> +



-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/