[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250523-vfs-pidfs-aa1d59a1e9b3@brauner>
Date: Fri, 23 May 2025 14:40:47 +0200
From: Christian Brauner <brauner@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Christian Brauner <brauner@...nel.org>,
linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: [GIT PULL for v6.16] vfs pidfs
Hey Linus,
This pull request serves as the base for the coredump pull request I'm
sending separately:
https://lore.kernel.org/20250523-vfs-coredump-66643655f2fe@brauner
The reason is simply that this branch has been in -next for such a long
time that clearly delineating both topics by moving things around would
have caused more churn for little gain since the relationship is tight
enough that it fits into both categories.
/* Summary */
This contains pidfs updates for this cycle:
Features:
- Allow to hand out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
socket option.
SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
the process that called connect() or listen(). This is heavily used to
safely authenticate clients in userspace avoiding security bugs due to
pid recycling races (dbus, polkit, systemd, etc.).
SO_PEERPIDFD currently doesn't support handing out pidfds if the
sk->sk_peer_pid thread-group leader has already been reaped. In this
case it currently returns EINVAL. Userspace still wants to get a pidfd
for a reaped process to have a stable handle it can pass on. This is
especially useful now that it is possible to retrieve exit information
through a pidfd via the PIDFD_GET_INFO ioctl()'s PIDFD_INFO_EXIT flag.
Another summary has been provided by David Rheinsberg:
> A pidfd can outlive the task it refers to, and thus user-space must
> already be prepared that the task underlying a pidfd is gone at the time
> they get their hands on the pidfd. For instance, resolving the pidfd to
> a PID via the fdinfo must be prepared to read `-1`.
>
> Despite user-space knowing that a pidfd might be stale, several kernel
> APIs currently add another layer that checks for this. In particular,
> SO_PEERPIDFD returns `EINVAL` if the peer-task was already reaped,
> but returns a stale pidfd if the task is reaped immediately after the
> respective alive-check.
>
> This has the unfortunate effect that user-space now has two ways to
> check for the exact same scenario: A syscall might return
> EINVAL/ESRCH/... *or* the pidfd might be stale, even though there is no
> particular reason to distinguish both cases. This also propagates
> through user-space APIs, which pass on pidfds. They must be prepared to
> pass on `-1` *or* the pidfd, because there is no guaranteed way to get a
> stale pidfd from the kernel.
> Userspace must already deal with a pidfd referring to a reaped task as
> the task may exit and get reaped at any time will there are still many
> pidfds referring to it.
In order to allow handing out reaped pidfd SO_PEERPIDFD needs to ensure
that PIDFD_INFO_EXIT information is available whenever a pidfd for a
reaped task is created by PIDFD_INFO_EXIT. The uapi promises that reaped
pidfds are only handed out if it is guaranteed that the caller sees the
exit information:
TEST_F(pidfd_info, success_reaped)
{
struct pidfd_info info = {
.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
};
/*
* Process has already been reaped and PIDFD_INFO_EXIT been set.
* Verify that we can retrieve the exit status of the process.
*/
ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
ASSERT_TRUE(WIFEXITED(info.exit_code));
ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
}
To hand out pidfds for reaped processes we thus allocate a pidfs entry
for the relevant sk->sk_peer_pid at the time the sk->sk_peer_pid is
stashed and drop it when the socket is destroyed. This guarantees that
exit information will always be recorded for the sk->sk_peer_pid task
and we can hand out pidfds for reaped processes.
- Hand a pidfd to the coredump usermode helper process.
Give userspace a way to instruct the kernel to install a pidfd for the
crashing process into the process started as a usermode helper. There's
still tricky race-windows that cannot be easily or sometimes not closed
at all by userspace. There's various ways like looking at the start time
of a process to make sure that the usermode helper process is started
after the crashing process but it's all very very brittle and fraught
with peril.
The crashed-but-not-reaped process can be killed by userspace before
coredump processing programs like systemd-coredump have had time to
manually open a PIDFD from the PID the kernel provides them, which means
they can be tricked into reading from an arbitrary process, and they run
with full privileges as they are usermode helper processes.
Even if that specific race-window wouldn't exist it's still the safest
and cleanest way to let the kernel provide the pidfd directly instead of
requiring userspace to do it manually. In parallel with this commit we
already have systemd adding support for this in [1].
When the usermode helper process is forked we install a pidfd file
descriptor three into the usermode helper's file descriptor table so
it's available to the exec'd program.
Since usermode helpers are either children of the system_unbound_wq
workqueue or kthreadd we know that the file descriptor table is empty
and can thus always use three as the file descriptor number.
Note, that we'll install a pidfd for the thread-group leader even if a
subthread is calling do_coredump(). We know that task linkage hasn't
been removed yet and even if this @current isn't the actual thread-group
leader we know that the thread-group leader cannot be reaped until
@current has exited.
- Allow to tell when a task has not been found from finding the wrong
task when creating a pidfd.
We currently report EINVAL whenever a struct pid has no tasked attached
anymore thereby conflating two concepts:
(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
actually references a struct pid that isn't used as a thread-group
leader.
This is causing issues for non-threaded workloads as in where they
expect ESRCH to be reported, not EINVAL.
So allow userspace to reliably distinguish between (1) and (2).
- Make it possible to detect when a pidfs entry would outlive the struct
pid it pinned.
- Add a range of new selftests.
Cleanups:
- Remove unneeded NULL check from pidfd_prepare() for passed struct pid.
- Avoid pointless reference count bump during release_task().
Fixes:
- Various fixes to the pidfd and coredump selftests.
- Fix error handling for replace_fd() when spawning coredump usermode helper.
/* Testing */
gcc (Debian 14.2.0-19) 14.2.0
Debian clang version 19.1.7 (3)
No build failures or warnings were observed.
/* Conflicts */
Merge conflicts with mainline
=============================
No known conflicts.
Merge conflicts with other trees
================================
No known conflicts.
The following changes since commit 0af2f6be1b4281385b618cb86ad946eded089ac8:
Linux 6.15-rc1 (2025-04-06 13:11:33 -0700)
are available in the Git repository at:
git@...olite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.16-rc1.pidfs
for you to fetch changes up to db56723ceaec87aa5cf871e623f464934b266228:
pidfs: detect refcount bugs (2025-05-06 13:59:00 +0200)
Please consider pulling these changes from the signed vfs-6.16-rc1.pidfs tag.
Thanks!
Christian
----------------------------------------------------------------
vfs-6.16-rc1.pidfs
----------------------------------------------------------------
Christian Brauner (20):
selftests/pidfd: adapt to recent changes
pidfd: remove unneeded NULL check from pidfd_prepare()
pidfd: improve uapi when task isn't found
selftest/pidfd: add test for thread-group leader pidfd open for thread
Merge patch series "pidfd: improve uapi when task isn't found"
exit: move wake_up_all() pidfd waiters into __unhash_process()
pidfs: ensure consistent ENOENT/ESRCH reporting
Merge patch series "pidfs: ensure consistent ENOENT/ESRCH reporting"
net, pidfd: report EINVAL for ESRCH
pidfs: register pid in pidfs
net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
pidfs: get rid of __pidfd_prepare()
net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
Merge patch series "net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid"
Merge patch series "selftests: coredump: Some bug fixes"
pidfs: move O_RDWR into pidfs_alloc_file()
coredump: fix error handling for replace_fd()
coredump: hand a pidfd to the usermode coredump helper
Merge patch series "coredump: hand a pidfd to the usermode coredump helper"
pidfs: detect refcount bugs
Nam Cao (3):
selftests: coredump: Properly initialize pointer
selftests: coredump: Fix test failure for slow machines
selftests: coredump: Raise timeout to 2 minutes
Oleg Nesterov (1):
release_task: kill the no longer needed get/put_pid(thread_pid)
fs/coredump.c | 65 +++++++++++++++--
fs/pidfs.c | 82 ++++++++++++++++++---
include/linux/coredump.h | 1 +
include/linux/pid.h | 2 +-
include/linux/pidfs.h | 3 +
include/uapi/linux/pidfd.h | 2 +-
kernel/exit.c | 10 ++-
kernel/fork.c | 88 ++++++++++-------------
kernel/pid.c | 6 +-
net/core/sock.c | 12 +++-
net/unix/af_unix.c | 85 +++++++++++++++++++---
tools/testing/selftests/coredump/stackdump_test.c | 10 ++-
tools/testing/selftests/pidfd/pidfd_info_test.c | 13 ++--
13 files changed, 286 insertions(+), 93 deletions(-)
Powered by blists - more mailing lists