linux-kernel - Re: [GIT PULL] pidfd updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230425135429.GQ3390869@ZenIV>
Date:   Tue, 25 Apr 2023 14:54:29 +0100
From:   Al Viro <viro@...iv.linux.org.uk>
To:     Christian Brauner <brauner@...nel.org>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [GIT PULL] pidfd updates

On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote:

> It is rife with misunderstandings just looking at what we did in
> kernel/fork.c earlier:
> 
> 	retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
>         [...]
>         pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
>                                      O_RDWR | O_CLOEXEC);
> 
> seeing where both get_unused_fd_flags() and both *_getfile() take flag
> arguments. Sure, for us this is pretty straightforward since we've seen
> that code a million times. For others it's confusing why there's two
> flag arguments. Sure, we could use one flags argument but it still be
> weird to look at.

First of all, get_unused_fd_flags() doesn't give a damn about O_RDWR and
anon_inode_getfile() - about O_CLOEXEC.  Duplicating the expression in
places like that is cargo-culting.

> But with this api we also force all users to remember that they need to
> cleanup the fd and the file - but definitely one - depending on where
> they fail.
> 
> But conceptually the fd and the file belong together. After all it's the
> file we're about to make that reserved fd refer to.

Why?  The common pattern is
	* reserve the descriptor
	* do some work that might fail
	* get struct file set up (which also might fail)
	* do more work (possibly including copyout, etc.)
	* install file into descriptor table

We want to reserve the descriptor early, since it's about the easiest
thing to undo.  Why bother doing it next to file setup?  That can be
pretty deep in call chain and doing it that way comes with headache
about passing the damn thing around.  And cleanup rules with your
variant end up being more complicated.

Keep the manipulations of descriptor table as close to the surface
as possible.  The real work is with file references; descriptors
belong in userland and passing them around kernel-side is asking
for headache.