linux-kernel - Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <52483.1597190733@warthog.procyon.org.uk>
Date:   Wed, 12 Aug 2020 01:05:33 +0100
From:   David Howells <dhowells@...hat.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     dhowells@...hat.com, Miklos Szeredi <miklos@...redi.hu>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        Al Viro <viro@...iv.linux.org.uk>, Karel Zak <kzak@...hat.com>,
        Jeff Layton <jlayton@...hat.com>,
        Miklos Szeredi <mszeredi@...hat.com>,
        Nicolas Dichtel <nicolas.dichtel@...nd.com>,
        Christian Brauner <christian@...uner.io>,
        Lennart Poettering <lennart@...ttering.net>,
        Linux API <linux-api@...r.kernel.org>,
        Ian Kent <raven@...maw.net>,
        LSM <linux-security-module@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> [ I missed the beginning of this discussion, so maybe this was already
> suggested ]

Well, the start of it was my proposal of an fsinfo() system call.  That at its
simplest takes an object reference (eg. a path) and an integer attribute ID (it
could use a string instead, I suppose, but it would mean a bunch of strcmps
instead of integer comparisons) and returns the value of the attribute.  But I
allow you to do slightly more interesting things than that too.

Miklós seems dead-set against adding a system call specifically for this -
though he's proposed extending open in various ways and also proposed an
additional syscall, readfile(), that does the open+read+close all in one step.

I think also at some point, he (or maybe James?) proposed adding a new magic
filesystem mounted somewhere on proc (reflecting an open fd) that then had a
bunch of symlinks to somewhere in sysfs (reflecting a mount).  The idea being
that you did something like:

	fd = open("/path/to/object", O_PATH);
	sprintf(name, "/proc/self/fds/%u/attr1", fd);
	attrfd = open(name, O_RDONLY);
	read(attrfd, buf1, sizeof(buf1));
	close(attrfd);
	sprintf(name, "/proc/self/fds/%u/attr2", fd);
	attrfd = open(name, O_RDONLY);
	read(attrfd, buf2, sizeof(buf2));
	close(attrfd);

or:

	sprintf(name, "/proc/self/fds/%u/attr1", fd);
	readfile(name, buf1, sizeof(buf1));
	sprintf(name, "/proc/self/fds/%u/attr2", fd);
	readfile(name, buf2, sizeof(buf2));

and then "/proc/self/fds/12/attr2" might then be a symlink to, say,
"/sys/mounts/615/mount_attr".

Miklós's justification for this was that it could then be operated from a shell
script without the need for a utility - except that bash, at least, can't do
O_PATH opens.

James has proposed making fsconfig() able to retrieve attributes (though I'd
prefer to give it a sibling syscall that does the retrieval rather than making
fsconfig() do that too).

>   {
>      int fd, attrfd;
>
>      fd = open(path, O_PATH);
>      attrfd = openat(fd, name, O_ALT);
>      close(fd);
>      read(attrfd, value, size);
>      close(attrfd);
>   }

Please don't go down this path.  You're proposing five syscalls - including
creating two file descriptors - to do what fsinfo() does in one.

Do you have a particular objection to adding a syscall specifically for
retrieving filesystem/VFS information?

-~-

Anyway, in case you're interested in what I want to get out of this - which is
the reason for it being posted in the first place:

 (*) The ability to retrieve various attributes of a filesystem/superblock,
     including information on:

	- Filesystem features: Does it support things like hard links, user
          quotas, direct I/O.

	- Filesystem limits: What's the maximum size of a file, an xattr, a
          directory; how many files can it support.

	- Supported API features: What FS_IOC_GETFLAGS does it support?  Which
          can be set?  Does it have Windows file attributes available?  What
          statx attributes are supported?  What do the timestamps support?
          What sort of case handling is done on filenames?

     Note that for a lot of cases, this stuff is fixed and can just be memcpy'd
     from rodata.  Some of this is variable, however, in things like ext4 and
     xfs, depending on, say, mkfs configuration.  The situation is even more
     complex with network filesystems as this may depend on the server they're
     talking to.

     But note also that some of this stuff might change file-to-file, even
     within a superblock.

 (*) The ability to retrieve attributes of a mount point, including information
     on the flags, propagation settings and child lists.

 (*) The ability to quickly retrieve a list of accessible mount point IDs,
     with change event counters to permit userspace (eg. systemd) to quickly
     determine if anything changed in the even of an overrun.

 (*) The ability to find mounts/superblocks by mount ID.  Paths are not unique
     identifiers for mountpoints.  You can stack multiple mounts on the same
     directory, but a path only sees the top one.

 (*) The ability to look inside a different mount namespace - one to which you
     have a reference fd.  This would allow a container manager to look inside
     the container it is managing.

 (*) The ability to expose filesystem-specific attributes.  Network filesystems
     can expose lists of servers and server addresses, for instance.

 (*) The ability to use the object referenced to determine the namespace
     (particularly the network namespace) to look in.  The problem with looking
     in, say, /proc/net/... is that it looks at current's net namespace -
     whether or not the object of interest is in the same one.

 (*) The ability to query the context attached to the fd obtained from
     fsopen().  Such a context may not have a superblock attached to it yet or
     may not be mounted yet.

     The aim is to allow a container manager to supervise a mount being made in
     a container.  It kind of pairs with fsconfig() in that respect.

 (*) The ability to query mount and superblock event counters to help a
     watching process handle overrun in the notifications queue.


What I've done with fsinfo() is:

 (*) Provided a number of ways to refer to the object to be queried (path,
     dirfd+path, fd, mount ID - with others planned).

 (*) Made it so that attibutes are referenced by a numeric ID to keep search
     time minimal.  Numeric IDs must be declared in uapi/linux/fsinfo.h.

 (*) Made it so that the core does most of the work.  Filesystems are given an
     in-kernel buffer to copy into and don't get to see any userspace pointers.

 (*) Made it so that values are not, by and large, encoded as text if it can be
     avoided.  Backward and forward compatibility on binary structs is handled
     by the core.  The filesystem just fills in the values in the UAPI struct
     in the buffer.  The core will zero-pad or truncate the data to match what
     userspace asked for.

     The UAPI struct must be declared in uapi/linux/fsinfo.h.

 (*) Made it so that, for some attributes, the core will fill in the data as
     best it can from what's available in the superblock, mount struct or mount
     namespace.  The filesystem can then amend this if it wants to.

 (*) Made it so that attributes are typed.  The types are few: string, struct,
     list of struct, opaque.  Structs are extensible: the length is the
     version, a new version is required to be a superset of the old version and
     excess requestage is simply cleared by the kernel.

     Information about the type of an attribute can be queried by fsinfo().


What I want to avoid:

 (*) Adding another magic filesystem.

 (*) Adding symlinks from proc to sysfs.

 (*) Having to use open to get an attribute.

 (*) Having to use multiple opens to get an attribute.

 (*) Having to pathwalk to get to the attribute from the object being queried.

 (*) Allocating another O_ open flag for this.

 (*) Avoidable text encoding and decoding.

 (*) Letting the filesystem access the userspace buffer.

Note that I'm not against splitting fsinfo() into a set of sibling syscalls if
that makes it more palatable, or even against using strings for the attribute
IDs, though I'd prefer to avoid the strcmps.

David