linux-kernel - Re: [PATCH v3 6/7] seccomp: allow nested listeners

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALCETrXbTjD4ChWnhH8tRZ8MSGQqfueQPwv9EvT_aQ8gnnfaEw@mail.gmail.com>
Date: Wed, 21 Jan 2026 09:59:32 -0800
From: Andy Lutomirski <luto@...capital.net>
To: Aleksa Sarai <cyphar@...har.com>
Cc: Andrei Vagin <avagin@...il.com>, 
	Alexander Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com>, kees@...nel.org, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, bpf@...r.kernel.org, 
	Will Drewry <wad@...omium.org>, Jonathan Corbet <corbet@....net>, Shuah Khan <shuah@...nel.org>, 
	Tycho Andersen <tycho@...ho.pizza>, Christian Brauner <brauner@...nel.org>, 
	Stéphane Graber <stgraber@...raber.org>, 
	Alexander Mikhalitsyn <alexander@...alicyn.com>
Subject: Re: [PATCH v3 6/7] seccomp: allow nested listeners

On Wed, Jan 21, 2026 at 7:43 AM Aleksa Sarai <cyphar@...har.com> wrote:
>
> On 2026-01-20, Andrei Vagin <avagin@...il.com> wrote:
> > On Thu, Dec 11, 2025 at 4:46 AM Alexander Mikhalitsyn
> > <aleksandr.mikhalitsyn@...onical.com> wrote:
> > >
> > > Now everything is ready to get rid of "only one listener per tree"
> > > limitation.
> > >
> > > Let's introduce a new uAPI flag
> > > SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS, so userspace may explicitly
> > > allow nested listeners when installing a listener.
> >
> > I am not sure we really need SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS.
> > If nested listeners are completely functional, why would we want to
> > implicitly allow or disallow someone from using them?
>
> It can be quite easy to deadlock a process using seccomp-notify (even
> in the single-notifier case) so especially in the case of container
> managers I can see the argument for wanting this to be an opt-in thing
> once container runtimes have verified their notifier won't break
> nesting.

Is the deadlock such that a process and its manager can deadlock in a
way that's hard to kill?  Or is there some problem that could
adversely affect an outer manager?  It would be nice for these
features to be automatic instead of opt in.

(I just wasted half an hour yesterday removing use of
unshare(CLONE_FILES) from a program that didn't run under a container
manager that, for some reason, thought that was a sensitive syscall.)

--Andy

>
> > Actually, even the current behavior of SECCOMP_RET_USER_NOTIF looks a
> > bit illogical. I think the following behavior would be more expected:
> > instead of running all filters and picking the most restrictive result,
> > the kernel should execute them one by one (most recent fist). If a filter
> > returns USER_NOTIF, the kernel pauses immediately to let the listener
> > handle the call. If that listener then issues "CONTINUE", the kernel
> > resumes by running the remaining older filters in the chain.
>
> I guess there is a philosophical argument that earlier filters are "more
> trusted" but the seccomp security model has always been that the
> strictest filter return wins and I don't really see a strong argument
> for deviating from that for USER_NOTIF.
>

I don't know if I agree with that philosophy.  I would think the best
philosophy is that, when filters are nested, the innermost filter +
filtered task combination acts as a unit that is filtered by the outer
filter.

Without notifiers and without filters that overwrite errno, I think
strictest-wins is a decent approximation -- the choices are kill or
allow, although one might quibble about the various forms of "kill".

With SECCOMP_RET_ERRNO, I would argue that the behavior would be
superior if we just stopped processing filters after an inner filter
returned SECCOMP_RET_ERROR.  After all, the effect is to do no syscall
at all, and having a process that didn't do a syscall get killed
because it tried a bad syscall is kind of weird.

With notifiers, this is all rather more complex.  Notifiers can
emulate syscalls, and having an outer notifier somehow process the
syscall that was replaced by an inner notifier seems rather weird.  Or
suppose that an outer filter wants to prevent some operation, but an
inner system wants to emulate it in a way that doesn't do the
offending syscall, why not allow it?

So I'd argue for considering changing the behavior for everything,
maybe optionally?  I'm not really sure where TRACE fits in.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC