linux-kernel - Re: [PATCH RFC v2 3/6] coredump: support AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez1w+25tbSPPU6=z1rWRm3ZXuGq0ypq4jffhzUva9Bwazw@mail.gmail.com>
Date: Fri, 2 May 2025 16:04:32 +0200
From: Jann Horn <jannh@...gle.com>
To: Christian Brauner <brauner@...nel.org>
Cc: Eric Dumazet <edumazet@...gle.com>, Kuniyuki Iwashima <kuniyu@...zon.com>, 
	Oleg Nesterov <oleg@...hat.com>, linux-fsdevel@...r.kernel.org, 
	"David S. Miller" <davem@...emloft.net>, Alexander Viro <viro@...iv.linux.org.uk>, 
	Daan De Meyer <daan.j.demeyer@...il.com>, David Rheinsberg <david@...dahead.eu>, 
	Jakub Kicinski <kuba@...nel.org>, Jan Kara <jack@...e.cz>, 
	Lennart Poettering <lennart@...ttering.net>, Luca Boccassi <bluca@...ian.org>, Mike Yuan <me@...dnzj.com>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Zbigniew Jędrzejewski-Szmek <zbyszek@...waw.pl>, 
	linux-kernel@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH RFC v2 3/6] coredump: support AF_UNIX sockets

On Fri, May 2, 2025 at 2:42 PM Christian Brauner <brauner@...nel.org> wrote:
> diff --git a/fs/coredump.c b/fs/coredump.c
[...]
> @@ -801,6 +841,73 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                 }
>                 break;
>         }
> +       case COREDUMP_SOCK: {
> +               struct file *file __free(fput) = NULL;
> +#ifdef CONFIG_UNIX
> +               ssize_t addr_size;
> +               struct sockaddr_un unix_addr = {
> +                       .sun_family = AF_UNIX,
> +               };
> +               struct sockaddr_storage *addr;
> +
> +               /*
> +                * TODO: We need to really support core_pipe_limit to
> +                * prevent the task from being reaped before userspace
> +                * had a chance to look at /proc/<pid>.
> +                *
> +                * I need help from the networking people (or maybe Oleg
> +                * also knows?) how to do this.
> +                *
> +                * IOW, we need to wait for the other side to shutdown
> +                * the socket/terminate the connection.
> +                *
> +                * We could just read but then userspace could sent us
> +                * SCM_RIGHTS and we just shouldn't need to deal with
> +                * any of that.
> +                */

I don't think userspace can send you SCM_RIGHTS if you don't do a
recvmsg() with a control data buffer?

> +               if (WARN_ON_ONCE(core_pipe_limit)) {
> +                       retval = -EINVAL;
> +                       goto close_fail;
> +               }
> +
> +               retval = strscpy(unix_addr.sun_path, cn.corename, sizeof(unix_addr.sun_path));
> +               if (retval < 0)
> +                       goto close_fail;
> +               addr_size = offsetof(struct sockaddr_un, sun_path) + retval + 1,
> +
> +               file = __sys_socket_file(AF_UNIX, SOCK_STREAM, 0);
> +               if (IS_ERR(file))
> +                       goto close_fail;
> +
> +               /*
> +                * It is possible that the userspace process which is
> +                * supposed to handle the coredump and is listening on
> +                * the AF_UNIX socket coredumps. This should be fine
> +                * though. If this was the only process which was
> +                * listen()ing on the AF_UNIX socket for coredumps it
> +                * obviously won't be listen()ing anymore by the time it
> +                * gets here. So the __sys_connect_file() call will
> +                * often fail with ECONNREFUSED and the coredump.

Why will the server not be listening anymore? Have the task's file
descriptors already been closed by the time we get here?

(Maybe just get rid of this comment, I agree with the following
comment saying we should let userspace deal with this.)

> +                * In general though, userspace should just mark itself
> +                * non dumpable and not do any of this nonsense. We
> +                * shouldn't work around this.
> +                */
> +               addr = (struct sockaddr_storage *)(&unix_addr);
> +               retval = __sys_connect_file(file, addr, addr_size, O_CLOEXEC);

Have you made an intentional decision on whether you want to connect
to a unix domain socket with a path relative to current->fs->root (so
that containers can do their own core dump handling) or relative to
the root namespace root (so that core dumps always reach the init
namespace's core dumping even if a process sandboxes itself with
namespaces or such)? Also, I think this connection attempt will be
subject to restrictions imposed by (for example) Landlock or AppArmor,
I'm not sure if that is desired here (since this is not actually a
connection that the process in whose context the call happens decided
to make, it's something the system administrator decided to do, and
especially with Landlock, policies are controlled by individual
applications that may not know how core dumps work on the system).

I guess if we keep the current behavior where the socket path is
namespaced, then we also need to keep the security checks, since an
unprivileged user could probably set up a namespace and chroot() to a
place where the socket path (indirectly, through a symlink) refers to
an arbitrary socket...

An alternative design might be to directly register the server socket
on the userns/mountns/netns or such in some magic way, and then have
the core dumping walk up the namespace hierarchy until it finds a
namespace that has opted in to using its own core dumping socket, and
connect to that socket bypassing security checks. (A bit like how
namespaced binfmt_misc works.) Like, maybe userspace with namespaced
CAP_SYS_ADMIN could bind() to some magic UNIX socket address, or use
some new setsockopt() on the socket or such, to become the handler of
core dumps? This would also have the advantage that malicious
userspace wouldn't be able to send fake bogus core dumps to the
server, and the server would provide clear consent to being connected
to without security checks at connection time.

> +               if (retval)
> +                       goto close_fail;
> +
> +               /* The peer isn't supposed to write and we for sure won't read. */
> +               retval =  __sys_shutdown_sock(sock_from_file(file), SHUT_RD);
> +               if (retval)
> +                       goto close_fail;
> +
> +               cprm.limit = RLIM_INFINITY;
> +#endif
> +               cprm.file = no_free_ptr(file);
> +               break;
> +       }
>         default:
>                 WARN_ON_ONCE(true);
>                 retval = -EINVAL;
> @@ -818,7 +925,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                  * have this set to NULL.
>                  */
>                 if (!cprm.file) {
> -                       coredump_report_failure("Core dump to |%s disabled", cn.corename);
> +                       if (cn.core_type == COREDUMP_PIPE)
> +                               coredump_report_failure("Core dump to |%s disabled", cn.corename);
> +                       else
> +                               coredump_report_failure("Core dump to :%s disabled", cn.corename);
>                         goto close_fail;
>                 }
>                 if (!dump_vma_snapshot(&cprm))
> @@ -839,8 +949,25 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>                 file_end_write(cprm.file);
>                 free_vma_snapshot(&cprm);
>         }
> -       if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit)
> -               wait_for_dump_helpers(cprm.file);
> +
> +       if (core_pipe_limit) {
> +               switch (cn.core_type) {
> +               case COREDUMP_PIPE:
> +                       wait_for_dump_helpers(cprm.file);
> +                       break;
> +               case COREDUMP_SOCK: {
> +                       /*
> +                        * TODO: Wait for the coredump handler to shut
> +                        * down the socket so we prevent the task from
> +                        * being reaped.
> +                        */

Hmm, I'm no expert but maybe you could poll for the POLLRDHUP event...
though that might require writing your own helper with a loop that
does vfs_poll() and waits for a poll wakeup, since I don't think there
is a kernel helper analogous to a synchronous poll() syscall yet.

> +                       break;
> +               }
> +               default:
> +                       break;
> +               }
> +       }
> +
>  close_fail:
>         if (cprm.file)
>                 filp_close(cprm.file, NULL);