netdev - Re: [PATCH RFC v2 3/6] coredump: support AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez3xYzzazbxcHKEFzj9DDMOrnVf1cfjNpwE_FAY-YhtHmw@mail.gmail.com>
Date: Fri, 2 May 2025 22:23:44 +0200
From: Jann Horn <jannh@...gle.com>
To: Christian Brauner <brauner@...nel.org>
Cc: Eric Dumazet <edumazet@...gle.com>, Kuniyuki Iwashima <kuniyu@...zon.com>, 
	Oleg Nesterov <oleg@...hat.com>, linux-fsdevel@...r.kernel.org, 
	"David S. Miller" <davem@...emloft.net>, Alexander Viro <viro@...iv.linux.org.uk>, 
	Daan De Meyer <daan.j.demeyer@...il.com>, David Rheinsberg <david@...dahead.eu>, 
	Jakub Kicinski <kuba@...nel.org>, Jan Kara <jack@...e.cz>, 
	Lennart Poettering <lennart@...ttering.net>, Luca Boccassi <bluca@...ian.org>, Mike Yuan <me@...dnzj.com>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Zbigniew Jędrzejewski-Szmek <zbyszek@...waw.pl>, 
	linux-kernel@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH RFC v2 3/6] coredump: support AF_UNIX sockets

On Fri, May 2, 2025 at 10:11 PM Christian Brauner <brauner@...nel.org> wrote:
> On Fri, May 02, 2025 at 04:04:32PM +0200, Jann Horn wrote:
> > On Fri, May 2, 2025 at 2:42 PM Christian Brauner <brauner@...nel.org> wrote:
> > > diff --git a/fs/coredump.c b/fs/coredump.c
> > [...]
> > > @@ -801,6 +841,73 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > >                 }
> > >                 break;
> > >         }
> > > +       case COREDUMP_SOCK: {
> > > +               struct file *file __free(fput) = NULL;
> > > +#ifdef CONFIG_UNIX
> > > +               ssize_t addr_size;
> > > +               struct sockaddr_un unix_addr = {
> > > +                       .sun_family = AF_UNIX,
> > > +               };
> > > +               struct sockaddr_storage *addr;
> > > +
> > > +               /*
> > > +                * TODO: We need to really support core_pipe_limit to
> > > +                * prevent the task from being reaped before userspace
> > > +                * had a chance to look at /proc/<pid>.
> > > +                *
> > > +                * I need help from the networking people (or maybe Oleg
> > > +                * also knows?) how to do this.
> > > +                *
> > > +                * IOW, we need to wait for the other side to shutdown
> > > +                * the socket/terminate the connection.
> > > +                *
> > > +                * We could just read but then userspace could sent us
> > > +                * SCM_RIGHTS and we just shouldn't need to deal with
> > > +                * any of that.
> > > +                */
> >
> > I don't think userspace can send you SCM_RIGHTS if you don't do a
> > recvmsg() with a control data buffer?
>
> Oh hm, then maybe just a regular read at the end would work. As soon as
> userspace send us anything or we get a close event we just disconnect.
>
> But btw, I think we really need a recvmsg() flag that allows a receiver
> to refuse SCM_RIGHTS/file descriptors from being sent to it. IIRC, right
> now this is a real issue that systemd works around by always calling its
> cmsg_close_all() helper after each recvmsg() to ensure that no one sent
> it file descriptors it didn't want. The problem there is that someone
> could have sent it an fd to a hanging NFS server or something and then
> it would hang in close() even though it never even wanted any file
> descriptors in the first place.

Would a recvmsg() flag really solve that aspect of NFS hangs? By the
time you read from the socket, the file is already attached to an SKB
queued up on the socket, and cleaning up the file is your task's
responsibility either way (which will either be done by the kernel for
you if you don't read it into a control message, or by userspace if it
was handed off through a control message). The process that sent the
file to you might already be gone, it can't be on the hook for
cleaning up the file anymore.

I think the thorough fix would probably be to introduce a socket
option (controlled via setsockopt()) that already blocks the peer's
sendmsg().

> > > +                * In general though, userspace should just mark itself
> > > +                * non dumpable and not do any of this nonsense. We
> > > +                * shouldn't work around this.
> > > +                */
> > > +               addr = (struct sockaddr_storage *)(&unix_addr);
> > > +               retval = __sys_connect_file(file, addr, addr_size, O_CLOEXEC);
> >
> > Have you made an intentional decision on whether you want to connect
> > to a unix domain socket with a path relative to current->fs->root (so
> > that containers can do their own core dump handling) or relative to
> > the root namespace root (so that core dumps always reach the init
> > namespace's core dumping even if a process sandboxes itself with
> > namespaces or such)? Also, I think this connection attempt will be
>
> Fsck no. :) I just jotted this down as an RFC. Details below.
>
> > subject to restrictions imposed by (for example) Landlock or AppArmor,
> > I'm not sure if that is desired here (since this is not actually a
> > connection that the process in whose context the call happens decided
> > to make, it's something the system administrator decided to do, and
> > especially with Landlock, policies are controlled by individual
> > applications that may not know how core dumps work on the system).
> >
> > I guess if we keep the current behavior where the socket path is
> > namespaced, then we also need to keep the security checks, since an
> > unprivileged user could probably set up a namespace and chroot() to a
> > place where the socket path (indirectly, through a symlink) refers to
> > an arbitrary socket...
> >
> > An alternative design might be to directly register the server socket
> > on the userns/mountns/netns or such in some magic way, and then have
> > the core dumping walk up the namespace hierarchy until it finds a
> > namespace that has opted in to using its own core dumping socket, and
> > connect to that socket bypassing security checks. (A bit like how
> > namespaced binfmt_misc works.) Like, maybe userspace with namespaced
>
> Yeah, I namespaced that thing. :)

Oh, hah, sorry, I forgot that was you.

> > CAP_SYS_ADMIN could bind() to some magic UNIX socket address, or use
> > some new setsockopt() on the socket or such, to become the handler of
> > core dumps? This would also have the advantage that malicious
> > userspace wouldn't be able to send fake bogus core dumps to the
> > server, and the server would provide clear consent to being connected
> > to without security checks at connection time.
>
> I think that's policy that I absolute don't want the kernel to get
> involved in unless absolutely necessary. A few days ago I just discussed
> this at length with Lennart and the issue is that systemd would want to
> see all coredumps on the system independent of the namespace they're
> created in. To have a per-namespace (userns/mountns/netns) coredump
> socket would invalidate that one way or the other and end up hiding
> coredumps from the administrator unless there's some elaborate scheme
> where it doesn't.
>
> systemd-coredump (and Apport fwiw) has infrastructure to forward
> coredumps to individual services and containers and it's already based
> on AF_UNIX afaict. And I really like that it's the job of userspace to
> deal with this instead of the kernel having to get involved in that
> mess.
>
> So all of this should be relative to the initial namespace. I want a

Ah, sounds good.

> separate security hook though so an LSMs can be used to prevent
> processes from connecting to the coredump socket.
>
> My idea has been that systemd-coredump could use a bpf lsm program that
> would allow to abort a coredump before the crashing process connects to
> the socket and again make this a userspace policy issue.

I don't understand this part. Why would you need an LSM to prevent a
crashing process from connecting, can't the coredumping server process
apply whatever filtering it wants in userspace?