linux-kernel - Re: [RFC PATCH 2/3] add statmnt(2) syscall

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJfpegt-VPZP3ou-TMQFs1Xupj_iWA5ttC2UUFKh3E43EyCOQQ@mail.gmail.com>
Date:   Fri, 15 Sep 2023 10:56:58 +0200
From:   Miklos Szeredi <miklos@...redi.hu>
To:     Christian Brauner <brauner@...nel.org>
Cc:     Miklos Szeredi <mszeredi@...hat.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-api@...r.kernel.org, linux-man@...r.kernel.org,
        linux-security-module@...r.kernel.org, Karel Zak <kzak@...hat.com>,
        Ian Kent <raven@...maw.net>,
        David Howells <dhowells@...hat.com>,
        Al Viro <viro@...iv.linux.org.uk>,
        Christian Brauner <christian@...uner.io>,
        Amir Goldstein <amir73il@...il.com>
Subject: Re: [RFC PATCH 2/3] add statmnt(2) syscall

On Thu, 14 Sept 2023 at 17:27, Christian Brauner <brauner@...nel.org> wrote:
>
> On Thu, Sep 14, 2023 at 12:13:54PM +0200, Miklos Szeredi wrote:
> No worries, I think the discussion touching on this starts at:
> https://youtu.be/j3fp2MtRr2I?si=f-YBg6uWq80dV3VC&t=1603
> (with David talking quietly without a microphone for some parts
> unfortunately...)

(Thanks for digging that out.)

That discussion touched on two aspects of using a single call vs.
multiple calls:

 - atomicity
 - marshalling

Atomicity of getting a snapshot of the current mount tree with all of
its attributes was never guaranteed, although reading
/proc/self/mountinfo into a sufficiently large buffer would work that
way.   However, I don't see why mount trees would require stronger
guarantees than dentry trees (for which we have basically none).

Marshalling/demashalling of arbitrary structures is indeed ugly.  I
think what Linus suggested, and what this interface was based on is
much less than that.  Also see my suggestion below: it doesn't need
demashalling at all due to the fact that the kernel can fill in the
pointers.   And yes, this could be used for arbitrary structures
without compromising type safety, but at the cost of adding more
complexity to the kernel (at least ascii strings are just one type).

Even more type clean interface:

struct statmnt *statmnt(u64 mnt_id, u64 mask, void *buf, size_t
bufsize, unsigned int flags);

Kernel would return a fully initialized struct with the numeric as
well as string fields filled.  That part is trivial for userspace to
deal with.

For sizing the buffer and versioning the struct see discussion below.

> > What I'm thinking is making it even simpler for userspace:
> >
> > struct statmnt {
> >   ...
> >   char *mnt_root;
> >   char *mountpoint;
> >   char *fs_type;
> >   u32 num_opts;
> >   char *opts;
> > };
> >
> > I'd still just keep options nul delimited.
> >
> > Is there a good reason not to return pointers (pointing to within the
> > supplied buffer obviously) to userspace?
>
> It's really unpleasant to program with. Yes, I think you pointed out
> before that it often doesn't matter much as long as the system call is
> really only relevant to some special purpose userspace.
>
> But statmount() will be used pretty extensively pretty quickly for the
> purpose of finding out mount options on a mount (Querying a whole
> sequences of mounts via repeated listmount() + statmount() calls on the
> other hand will be rarer.).
>
> And there's just so many tools that need this: libmount, systemd, all
> kinds of container runtimes, path lookup libraries such as libpathrs,
> languages like go and rust that expose and wrap these calls and so on.
>
> Most of these tools don't need to know about filesystem mount options
> and if they do they can just query that through an extra system call. No
> harm in doing that.

Just pass sizeof(struct statmnt) as the buffer size, and it will work that way.

> The agreement we came to to split out listing submounts into a separate
> system call was exactly to avoid having to have a variable sized pointer
> at the end of the struct statmnt (That's also part of the video above
> btw.) and to make it as simple as possible.
>
> Plus, the format for how to return arbitrary filesystem mount options
> warrants a separate discussion imho as that's not really vfs level
> information.

Okay.   Let's take fs options out of this.

That leaves:

 - fs type and optionally subtype
 - root of mount within fs
 - mountpoint path

The type and subtype are naturally limited to sane sizes, those are
not an issue.

For paths the evolution of the relevant system/library calls was:

  char *getwd(char buf[PATH_MAX]);
  char *getcwd(char *buf, size_t size);
  char *get_current_dir_name(void);

It started out using a fixed size buffer, then a variable sized
buffer, then an automatically allocated buffer by the library, hiding
the need to resize on overflow.

The latest style is suitable for the statmnt() call as well, if we
worry about pleasantness of the API.

>
> > > This will also allow us to turn statmnt() into an extensible argument
> > > system call versioned by size just like we do any new system calls with
> > > struct arguments (e.g., mount_setattr(), clone3(), openat2() and so on).
> > > Which is how we should do things like that.
> >
> > The mask mechanism also allow versioning of the struct.
>
> Yes, but this is done with reserved space which just pushes away the
> problem and bloats the struct for the sake of an unknown future. If we
> were to use an extensible argument struct we would just version by size.
> The only requirement is that you extend by 64 bit (see struct
> clone_args) which had been extended.

No need for reserved space in fact.  Versioning would still work, as
long as userspace is strictly checking the return mask.  I.e. newly
added fields will come after the old buffer, as assumed by the kernel.
But the kernel will never set the mask bits for these fields, so
userspace should not ever look at them.  Note: the interface does have
a bufsize parameter, so no possibility of memory corruption in any
event.

I added the reserved space so that userspace would be protected from
rubbish at the end of the struct if the kernel was older.  A library
wrapper could work around that issue (move the variable part beyond
the end of the new struct), but it would require code update in the
wrapper, not just updating the struct.

But in fact it's much simpler to just add ample reserved space and be
done with it forever, no need to worry about versioning at all.

> > > numbers for sub types as well. So we don't need to use strings here.
> >
> > Ugh.
>
> Hm, idk. It's not that bad imho. We'll have to make some ugly tradeoffs.

Subtype is a fuse thing (e.g. sshfs would show up as  fuse.sshfs
/proc/self/mountinfo.  Forcing each fuse filesystem to invent a magic
number... please no.

Thanks,
Miklos