linux-kernel - Re: [PATCH v3 6/7] seccomp: allow nested listeners

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANaxB-yqvHx7rWK4Efq8869ai6TEd3SCLDCLrHq5rhXGPNV-1g@mail.gmail.com>
Date: Thu, 22 Jan 2026 22:26:55 -0800
From: Andrei Vagin <avagin@...il.com>
To: Aleksa Sarai <cyphar@...har.com>
Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com>, kees@...nel.org, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, bpf@...r.kernel.org, 
	Will Drewry <wad@...omium.org>, Jonathan Corbet <corbet@....net>, Shuah Khan <shuah@...nel.org>, 
	Tycho Andersen <tycho@...ho.pizza>, Christian Brauner <brauner@...nel.org>, 
	Stéphane Graber <stgraber@...raber.org>, 
	Alexander Mikhalitsyn <alexander@...alicyn.com>, Andy Lutomirski <luto@...capital.net>
Subject: Re: [PATCH v3 6/7] seccomp: allow nested listeners

On Wed, Jan 21, 2026 at 9:59 AM Andy Lutomirski <luto@...capital.net> wrote:
>
> On Wed, Jan 21, 2026 at 7:43 AM Aleksa Sarai <cyphar@...har.com> wrote:
> >
> > On 2026-01-20, Andrei Vagin <avagin@...il.com> wrote:
> > > On Thu, Dec 11, 2025 at 4:46 AM Alexander Mikhalitsyn
> > > <aleksandr.mikhalitsyn@...onical.com> wrote:
> > > >
> > > > Now everything is ready to get rid of "only one listener per tree"
> > > > limitation.
> > > >
> > > > Let's introduce a new uAPI flag
> > > > SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS, so userspace may explicitly
> > > > allow nested listeners when installing a listener.
> > >
> > > I am not sure we really need SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS.
> > > If nested listeners are completely functional, why would we want to
> > > implicitly allow or disallow someone from using them?
> >
> > It can be quite easy to deadlock a process using seccomp-notify (even
> > in the single-notifier case) so especially in the case of container
> > managers I can see the argument for wanting this to be an opt-in thing
> > once container runtimes have verified their notifier won't break
> > nesting.
>
> Is the deadlock such that a process and its manager can deadlock in a
> way that's hard to kill?  Or is there some problem that could
> adversely affect an outer manager?  It would be nice for these
> features to be automatic instead of opt in.

Both a process and its manager can always be killed with SIGKILL.
I’m not sure I follow the specific deadlock Aleksa is referring to here.
In my view, an outer manager should not care about any syscalls that
processes are calling and intercepting. The outer manager must be
triggered only when a syscall is going to be executed "natively".
This kind of overlaps with the second part...

BTW: If a user wants to prevent the usage of seccomp notify, they can
always install a seccomp filter that rejects the seccomp syscall called
with SECCOMP_FILTER_FLAG_NEW_LISTENER.

>
> (I just wasted half an hour yesterday removing use of
> unshare(CLONE_FILES) from a program that didn't run under a container
> manager that, for some reason, thought that was a sensitive syscall.)
>
> --Andy
>
> >
> > > Actually, even the current behavior of SECCOMP_RET_USER_NOTIF looks a
> > > bit illogical. I think the following behavior would be more expected:
> > > instead of running all filters and picking the most restrictive result,
> > > the kernel should execute them one by one (most recent fist). If a filter
> > > returns USER_NOTIF, the kernel pauses immediately to let the listener
> > > handle the call. If that listener then issues "CONTINUE", the kernel
> > > resumes by running the remaining older filters in the chain.
> >
> > I guess there is a philosophical argument that earlier filters are "more
> > trusted" but the seccomp security model has always been that the
> > strictest filter return wins and I don't really see a strong argument
> > for deviating from that for USER_NOTIF.
> >
>
> I don't know if I agree with that philosophy.  I would think the best
> philosophy is that, when filters are nested, the innermost filter +
> filtered task combination acts as a unit that is filtered by the outer
> filter.
>
> Without notifiers and without filters that overwrite errno, I think
> strictest-wins is a decent approximation -- the choices are kill or
> allow, although one might quibble about the various forms of "kill".
>
> With SECCOMP_RET_ERRNO, I would argue that the behavior would be
> superior if we just stopped processing filters after an inner filter
> returned SECCOMP_RET_ERROR.  After all, the effect is to do no syscall
> at all, and having a process that didn't do a syscall get killed
> because it tried a bad syscall is kind of weird.
>
> With notifiers, this is all rather more complex.  Notifiers can
> emulate syscalls, and having an outer notifier somehow process the
> syscall that was replaced by an inner notifier seems rather weird.  Or
> suppose that an outer filter wants to prevent some operation, but an
> inner system wants to emulate it in a way that doesn't do the
> offending syscall, why not allow it?
>
> So I'd argue for considering changing the behavior for everything,
> maybe optionally?  I'm not really sure where TRACE fits in.
>

gVisor (a user-mode kernel similar to User-Mode Linux) is a real-world
example that is impacted by the current seccomp behavior. The gVisor
systrap platform uses seccomp to intercept guest syscalls so they can
be handled by the Sentry (the gVisor kernel). All guest syscalls are
managed by the Sentry and are never executed natively.

Thanks,
Andrei