[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250513000654.70344-1-kuniyu@amazon.com>
Date: Mon, 12 May 2025 17:06:50 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <brauner@...nel.org>
CC: <alexander@...alicyn.com>, <bluca@...ian.org>, <daan.j.demeyer@...il.com>,
<daniel@...earbox.net>, <davem@...emloft.net>, <david@...dahead.eu>,
<edumazet@...gle.com>, <horms@...nel.org>, <jack@...e.cz>,
<jannh@...gle.com>, <kuba@...nel.org>, <kuniyu@...zon.com>,
<lennart@...ttering.net>, <linux-fsdevel@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-security-module@...r.kernel.org>,
<me@...dnzj.com>, <netdev@...r.kernel.org>, <oleg@...hat.com>,
<pabeni@...hat.com>, <viro@...iv.linux.org.uk>, <zbyszek@...waw.pl>
Subject: Re: [PATCH v6 4/9] coredump: add coredump socket
From: Christian Brauner <brauner@...nel.org>
Date: Mon, 12 May 2025 10:55:23 +0200
> Coredumping currently supports two modes:
>
> (1) Dumping directly into a file somewhere on the filesystem.
> (2) Dumping into a pipe connected to a usermode helper process
> spawned as a child of the system_unbound_wq or kthreadd.
>
> For simplicity I'm mostly ignoring (1). There's probably still some
> users of (1) out there but processing coredumps in this way can be
> considered adventurous especially in the face of set*id binaries.
>
> The most common option should be (2) by now. It works by allowing
> userspace to put a string into /proc/sys/kernel/core_pattern like:
>
> |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
>
> The "|" at the beginning indicates to the kernel that a pipe must be
> used. The path following the pipe indicator is a path to a binary that
> will be spawned as a usermode helper process. Any additional parameters
> pass information about the task that is generating the coredump to the
> binary that processes the coredump.
>
> In the example core_pattern shown above systemd-coredump is spawned as a
> usermode helper. There's various conceptual consequences of this
> (non-exhaustive list):
>
> - systemd-coredump is spawned with file descriptor number 0 (stdin)
> connected to the read-end of the pipe. All other file descriptors are
> closed. That specifically includes 1 (stdout) and 2 (stderr). This has
> already caused bugs because userspace assumed that this cannot happen
> (Whether or not this is a sane assumption is irrelevant.).
>
> - systemd-coredump will be spawned as a child of system_unbound_wq. So
> it is not a child of any userspace process and specifically not a
> child of PID 1. It cannot be waited upon and is in a weird hybrid
> upcall which are difficult for userspace to control correctly.
>
> - systemd-coredump is spawned with full kernel privileges. This
> necessitates all kinds of weird privilege dropping excercises in
> userspace to make this safe.
>
> - A new usermode helper has to be spawned for each crashing process.
>
> This series adds a new mode:
>
> (3) Dumping into an abstract AF_UNIX socket.
>
> Userspace can set /proc/sys/kernel/core_pattern to:
>
> @address SO_COOKIE
>
> The "@" at the beginning indicates to the kernel that the abstract
> AF_UNIX coredump socket will be used to process coredumps. The address
> is given by @address and must be followed by the socket cookie of the
> coredump listening socket.
>
> The socket cookie is used to verify the socket connection. If the
> coredump server restarts or crashes and someone recycles the socket
> address the kernel will detect that the address has been recycled as the
> socket cookie will have necessarily changed and refuse to connect.
>
> The coredump socket is located in the initial network namespace. When a
> task coredumps it opens a client socket in the initial network namespace
> and connects to the coredump socket.
>
> - The coredump server uses SO_PEERPIDFD to get a stable handle on the
> connected crashing task. The retrieved pidfd will provide a stable
> reference even if the crashing task gets SIGKILLed while generating
> the coredump.
>
> - By setting core_pipe_limit non-zero userspace can guarantee that the
> crashing task cannot be reaped behind it's back and thus process all
> necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
> detect whether /proc/<pid> still refers to the same process.
>
> The core_pipe_limit isn't used to rate-limit connections to the
> socket. This can simply be done via AF_UNIX sockets directly.
>
> - The pidfd for the crashing task will grow new information how the task
> coredumps.
>
> - The coredump server should mark itself as non-dumpable.
>
> - A container coredump server in a separate network namespace can simply
> bind to another well-know address and systemd-coredump fowards
> coredumps to the container.
>
> - Coredumps could in the future also be handled via per-user/session
> coredump servers that run only with that users privileges.
>
> The coredump server listens on the coredump socket and accepts a
> new coredump connection. It then retrieves SO_PEERPIDFD for the
> client, inspects uid/gid and hands the accepted client to the users
> own coredump handler which runs with the users privileges only
> (It must of coure pay close attention to not forward crashing suid
> binaries.).
>
> The new coredump socket will allow userspace to not have to rely on
> usermode helpers for processing coredumps and provides a safer way to
> handle them instead of relying on super privileged coredumping helpers
> that have and continue to cause significant CVEs.
>
> This will also be significantly more lightweight since no fork()+exec()
> for the usermodehelper is required for each crashing process. The
> coredump server in userspace can e.g., just keep a worker pool.
>
> Signed-off-by: Christian Brauner <brauner@...nel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@...zon.com>
Thanks!
Powered by blists - more mailing lists