lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250523-vfs-pidfs-aa1d59a1e9b3@brauner>
Date: Fri, 23 May 2025 14:40:47 +0200
From: Christian Brauner <brauner@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Christian Brauner <brauner@...nel.org>,
	linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: [GIT PULL for v6.16] vfs pidfs

Hey Linus,

This pull request serves as the base for the coredump pull request I'm
sending separately:

https://lore.kernel.org/20250523-vfs-coredump-66643655f2fe@brauner

The reason is simply that this branch has been in -next for such a long
time that clearly delineating both topics by moving things around would
have caused more churn for little gain since the relationship is tight
enough that it fits into both categories.

/* Summary */

This contains pidfs updates for this cycle:

Features:

- Allow to hand out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
  socket option.

  SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
  the process that called connect() or listen(). This is heavily used to
  safely authenticate clients in userspace avoiding security bugs due to
  pid recycling races (dbus, polkit, systemd, etc.).

  SO_PEERPIDFD currently doesn't support handing out pidfds if the
  sk->sk_peer_pid thread-group leader has already been reaped. In this
  case it currently returns EINVAL. Userspace still wants to get a pidfd
  for a reaped process to have a stable handle it can pass on. This is
  especially useful now that it is possible to retrieve exit information
  through a pidfd via the PIDFD_GET_INFO ioctl()'s PIDFD_INFO_EXIT flag.

  Another summary has been provided by David Rheinsberg:

  > A pidfd can outlive the task it refers to, and thus user-space must
  > already be prepared that the task underlying a pidfd is gone at the time
  > they get their hands on the pidfd. For instance, resolving the pidfd to
  > a PID via the fdinfo must be prepared to read `-1`.
  >
  > Despite user-space knowing that a pidfd might be stale, several kernel
  > APIs currently add another layer that checks for this. In particular,
  > SO_PEERPIDFD returns `EINVAL` if the peer-task was already reaped,
  > but returns a stale pidfd if the task is reaped immediately after the
  > respective alive-check.
  >
  > This has the unfortunate effect that user-space now has two ways to
  > check for the exact same scenario: A syscall might return
  > EINVAL/ESRCH/... *or* the pidfd might be stale, even though there is no
  > particular reason to distinguish both cases. This also propagates
  > through user-space APIs, which pass on pidfds. They must be prepared to
  > pass on `-1` *or* the pidfd, because there is no guaranteed way to get a
  > stale pidfd from the kernel.
  > Userspace must already deal with a pidfd referring to a reaped task as
  > the task may exit and get reaped at any time will there are still many
  > pidfds referring to it.

  In order to allow handing out reaped pidfd SO_PEERPIDFD needs to ensure
  that PIDFD_INFO_EXIT information is available whenever a pidfd for a
  reaped task is created by PIDFD_INFO_EXIT. The uapi promises that reaped
  pidfds are only handed out if it is guaranteed that the caller sees the
  exit information:

  TEST_F(pidfd_info, success_reaped)
  {
          struct pidfd_info info = {
                  .mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
          };

          /*
           * Process has already been reaped and PIDFD_INFO_EXIT been set.
           * Verify that we can retrieve the exit status of the process.
           */
          ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
          ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
          ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
          ASSERT_TRUE(WIFEXITED(info.exit_code));
          ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
  }

  To hand out pidfds for reaped processes we thus allocate a pidfs entry
  for the relevant sk->sk_peer_pid at the time the sk->sk_peer_pid is
  stashed and drop it when the socket is destroyed. This guarantees that
  exit information will always be recorded for the sk->sk_peer_pid task
  and we can hand out pidfds for reaped processes.

- Hand a pidfd to the coredump usermode helper process.

  Give userspace a way to instruct the kernel to install a pidfd for the
  crashing process into the process started as a usermode helper. There's
  still tricky race-windows that cannot be easily or sometimes not closed
  at all by userspace. There's various ways like looking at the start time
  of a process to make sure that the usermode helper process is started
  after the crashing process but it's all very very brittle and fraught
  with peril.

  The crashed-but-not-reaped process can be killed by userspace before
  coredump processing programs like systemd-coredump have had time to
  manually open a PIDFD from the PID the kernel provides them, which means
  they can be tricked into reading from an arbitrary process, and they run
  with full privileges as they are usermode helper processes.

  Even if that specific race-window wouldn't exist it's still the safest
  and cleanest way to let the kernel provide the pidfd directly instead of
  requiring userspace to do it manually. In parallel with this commit we
  already have systemd adding support for this in [1].

  When the usermode helper process is forked we install a pidfd file
  descriptor three into the usermode helper's file descriptor table so
  it's available to the exec'd program.

  Since usermode helpers are either children of the system_unbound_wq
  workqueue or kthreadd we know that the file descriptor table is empty
  and can thus always use three as the file descriptor number.

  Note, that we'll install a pidfd for the thread-group leader even if a
  subthread is calling do_coredump(). We know that task linkage hasn't
  been removed yet and even if this @current isn't the actual thread-group
  leader we know that the thread-group leader cannot be reaped until
  @current has exited.

- Allow to tell when a task has not been found from finding the wrong
  task when creating a pidfd.

  We currently report EINVAL whenever a struct pid has no tasked attached
  anymore thereby conflating two concepts:

  (1) The task has already been reaped.
  (2) The caller requested a pidfd for a thread-group leader but the pid
      actually references a struct pid that isn't used as a thread-group
      leader.

  This is causing issues for non-threaded workloads as in where they
  expect ESRCH to be reported, not EINVAL.

  So allow userspace to reliably distinguish between (1) and (2).

- Make it possible to detect when a pidfs entry would outlive the struct
  pid it pinned.

- Add a range of new selftests.

Cleanups:

- Remove unneeded NULL check from pidfd_prepare() for passed struct pid.

- Avoid pointless reference count bump during release_task().

Fixes:

- Various fixes to the pidfd and coredump selftests.

- Fix error handling for replace_fd() when spawning coredump usermode helper.

/* Testing */

gcc (Debian 14.2.0-19) 14.2.0
Debian clang version 19.1.7 (3)

No build failures or warnings were observed.

/* Conflicts */

Merge conflicts with mainline
=============================

No known conflicts.

Merge conflicts with other trees
================================

No known conflicts.

The following changes since commit 0af2f6be1b4281385b618cb86ad946eded089ac8:

  Linux 6.15-rc1 (2025-04-06 13:11:33 -0700)

are available in the Git repository at:

  git@...olite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.16-rc1.pidfs

for you to fetch changes up to db56723ceaec87aa5cf871e623f464934b266228:

  pidfs: detect refcount bugs (2025-05-06 13:59:00 +0200)

Please consider pulling these changes from the signed vfs-6.16-rc1.pidfs tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.16-rc1.pidfs

----------------------------------------------------------------
Christian Brauner (20):
      selftests/pidfd: adapt to recent changes
      pidfd: remove unneeded NULL check from pidfd_prepare()
      pidfd: improve uapi when task isn't found
      selftest/pidfd: add test for thread-group leader pidfd open for thread
      Merge patch series "pidfd: improve uapi when task isn't found"
      exit: move wake_up_all() pidfd waiters into __unhash_process()
      pidfs: ensure consistent ENOENT/ESRCH reporting
      Merge patch series "pidfs: ensure consistent ENOENT/ESRCH reporting"
      net, pidfd: report EINVAL for ESRCH
      pidfs: register pid in pidfs
      net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
      pidfs: get rid of __pidfd_prepare()
      net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
      Merge patch series "net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid"
      Merge patch series "selftests: coredump: Some bug fixes"
      pidfs: move O_RDWR into pidfs_alloc_file()
      coredump: fix error handling for replace_fd()
      coredump: hand a pidfd to the usermode coredump helper
      Merge patch series "coredump: hand a pidfd to the usermode coredump helper"
      pidfs: detect refcount bugs

Nam Cao (3):
      selftests: coredump: Properly initialize pointer
      selftests: coredump: Fix test failure for slow machines
      selftests: coredump: Raise timeout to 2 minutes

Oleg Nesterov (1):
      release_task: kill the no longer needed get/put_pid(thread_pid)

 fs/coredump.c                                     | 65 +++++++++++++++--
 fs/pidfs.c                                        | 82 ++++++++++++++++++---
 include/linux/coredump.h                          |  1 +
 include/linux/pid.h                               |  2 +-
 include/linux/pidfs.h                             |  3 +
 include/uapi/linux/pidfd.h                        |  2 +-
 kernel/exit.c                                     | 10 ++-
 kernel/fork.c                                     | 88 ++++++++++-------------
 kernel/pid.c                                      |  6 +-
 net/core/sock.c                                   | 12 +++-
 net/unix/af_unix.c                                | 85 +++++++++++++++++++---
 tools/testing/selftests/coredump/stackdump_test.c | 10 ++-
 tools/testing/selftests/pidfd/pidfd_info_test.c   | 13 ++--
 13 files changed, 286 insertions(+), 93 deletions(-)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ