linux-kernel - Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aDmvpOMlaAZOXrji@google.com>
Date: Fri, 30 May 2025 15:16:20 +0200
From: "Günther Noack" <gnoack@...gle.com>
To: "Mickaël Salaün" <mic@...ikod.net>
Cc: "Günther Noack" <gnoack3000@...il.com>, Paul Moore <paul@...l-moore.com>, sergeh@...nel.org, 
	David Howells <dhowells@...hat.com>, Kees Cook <keescook@...omium.org>, 
	linux-security-module@...r.kernel.org, 
	Konstantin Meskhidze <konstantin.meskhidze@...wei.com>, Jann Horn <jannh@...gle.com>, 
	linux-kernel@...r.kernel.org, Peter Newman <peternewman@...gle.com>, 
	Andy Lutomirski <luto@...capital.net>, Will Drewry <wad@...omium.org>
Subject: Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()

On Sun, May 18, 2025 at 09:57:32PM +0200, Mickaël Salaün wrote:
> On Sun, May 18, 2025 at 09:40:05AM +0200, Günther Noack wrote:
> > On Tue, Mar 11, 2025 at 03:32:53PM +0100, Mickaël Salaün wrote:
> > > On Mon, Mar 10, 2025 at 02:04:23PM +0100, Günther Noack wrote:

> > > > Approach 1: Use the creds API thread-by-thread (implemented here)
> > > > 
> > > >   * Each task calls prepare_creds() and commit_creds() on its own, in
> > > >     line with the way the API is designed to be used (from a single
> > > >     task).
> > > >   * Task work gets scheduled with a pseudo-signal and the task that
> > > >     invoked the syscall is waiting for all of them to return.
> > > >   * Task work can fail at the beginning due to prepare_creds(), in
> > > >     which case all tasks have to abort_creds(). Additional
> > > >     synchronization is needed for that.
> > > > 
> > > >   Drawback: We need to grab the system-global task lock to prevent new
> > > >   thread creation and also grab the per-process signal lock to prevent
> > > >   races with other creds accesses, for the entire time as we wait for
> > > >   each task to do the task work.
> > > 
> > > In other words, this approach blocks all threads from the same process.
> > 
> > It does, but that is still an improvement over the current
> > libpsx-based implementation in userspace.  That existing
> > implementation does not block, but it is running the risk that
> > prepare_creds() might fail on one of the threads (e.g. allocation
> > failure), which would leave the processes' threads in an inconsistent
> > state.
> > 
> > Another upside that the in-kernel implementation has is that the
> > implementation of that is hidden behind an API, so if we can
> > eventually find a better approach, we can migrate to it.  It gives us
> > flexibility.
> 
> > I guess a possible variant (approach 1B) would be to do the equivalent
> > to what userspace does today, and not make all threads wait for the
> > possible error of prepare_creds() on the other threads.
> 
> This 1B variant is not OK because it would remove the guarantee that the
> whole process is restricted.

👍 Agreed.


> > > > Approach 2: Attempt to do the prepare_creds() step in the calling task.
> > > > 
> > > >   * Would use an API similar to what keyctl uses for the
> > > >     parent-process update.
> > > >   * This side-steps the credentials update API as it is documented in
> > > >     Documentation, using the cred_alloc_blank() helper and replicating
> > > >     some prepare_creds() logic.
> > > > 
> > > >   Drawback: This would introduce another use of the cred_alloc_blank()
> > > >   API (and the cred_transfer LSM hook), which would otherwise be
> > > >   reasonable to delete if we can remove the keyctl use case.
> > > >   (https://lore.kernel.org/all/20240805-remove-cred-transfer-v2-0-a2aa1d45e6b8@google.com/)
> > > 
> > > cred_alloc_blank() was designed to avoid dealing with -ENOMEM, which is
> > > a required property for this Landlock TSYNC feature (i.e. atomic and
> > > consistent synchronization).
> > 
> > Remark on the side, I suspect that the error handling in nptl(7)
> > probably also does not guarantee that, also for setuid(2) and friends.
> > 
> > 
> > > I think it would make sense to replace most of the
> > > key_change_session_keyring() code with a new cred_transfer() helper that
> > > will memcpy the old cred to the new, increment the appropriate ref
> > > counters, and call security_transfer_creds().  We could then use this
> > > helper in Landlock too.
> > > 
> > > To properly handle race conditions with a thread changing its own
> > > credentials, we would need a new LSM hook called by commit_creds().
> > > For the Landlock implementation, this hook would check if the process is
> > > being Landlocked+TSYNC and return -ERESTARTNOINTR if it is the case.
> > > The newly created task_work would then be free to update each thread's
> > > credentials while only blocking the calling thread (which is also a
> > > required feature).
> > > 
> > > Alternatively, instead of a new LSM hook, commit_creds() could check
> > > itself a new group leader's flag set if all the credentials from the
> > > calling process are being updated, and return -ERESTARTNOINTR in this
> > > case.
> > 
> > commit_creds() is explicitly documented to never return errors.
> > It returns a 0 integer so that it lends itself for tail calls,
> > and some of those usages might also rely on it always working.
> > There are ~15 existing calls where the return value is discarded.
> 
> Indeed, commit_creds() should always return 0.  My full proposal does
> not look safe enough, but the cred_transfer() helper can still be
> useful.
> 
> > 
> > If commit_creds() returns -ERESTARTNOINTR, I assume that your idea is
> > that the task_work would retry the prepare-and-commit when
> > encountering that?
> > 
> > We would have to store the fact that the process is being
> > Landlock+TSYNC'd in a central place (e.g. group leader flag set).
> > When that is done, don't we need more synchronization mechanisms to
> > access that (which RCU was meant to avoid)?
> > 
> > I am having a hard time wrapping my head around these synchronization
> > schemes, I feel this is getting too complicated for what it is trying
> > to do and might become difficult to maintain if we implemented it.
> 
> Fair. ERESTARTNOINTR should only be used by a syscall implementation.
> 
> > 
> > > > Approach 3: Store Landlock domains outside of credentials altogether
> > > > 
> > > >   * We could also store a task's Landlock domain as a pointer in the
> > > >     per-task security blob, and refcount these.  We would need to make
> > > >     sure that they get newly referenced and updated in the same
> > > >     scenarios as they do within struct cred today.
> > > >   * We could then guard accesses to a task's Landlock domain with a
> > > >     more classic locking mechanism.  This would make it possible to
> > > >     update the Landlock domain of all tasks in a process without
> > > >     having to go through pseudo-signals.
> > > > 
> > > >   Drawbacks:
> > > >   * Would have to make sure that the Landlock domain the task's LSM
> > > >     blob behaves exactly the same as before in the struct cred.
> > > >   * Potentially slower to access Landlock domains that are guarded by
> > > >     a mutex.
> > > 
> > > This would not work because the kernel (including LSM hooks) uses
> > > credentials to check access.
> > 
> > It's unclear to me what you mean by that.
> > 
> > Do you mean that it is hard to replicate for Landlock the cases where
> > the pointer would have to be copied, because the LSM hooks are not
> > suited for it?
> 
> struct cred is used to check if a task subject can access a task object.
> Landlock's metadata must stay in struct cred to be available when
> checking access to any kernel object.  The LSM hooks reflect this
> rationale by only passing struct cred when checking a task (e.g.
> security_task_kill()'s cred).
> 
> seccomp only cares about filtering raw syscalls, and the seccomp filters
> are just ignored when the kernel (with an LSM or not) checks task's
> permission to access another task.
> 
> The per-task security blob could store some state though, e.g. to
> identify if a domain needs to be updated, but I don't see a use case
> here.

(Side remark on the idea of storing "pending domain updates" in the task blob:

I have pondered such an idea as well, where we do not store the Landlock domain
itself in the task blob, but only a "pending" update that we need to do to the
Landlock domain in creds, and then to apply that opportunistically/lazily as
part of other Landlock LSM calls.

I believe in this approach, it becomes hard to control whether that update can
actually ever get applied.  So to be sure, we would always have to run under the
assumption that it does not get applied, and then we might as well store the
Landlock domain directly in the task blob.

I also don't think this makes sense.)


> > Here is another possible approach which a colleague suggested in a
> > discussion:
> > 
> > Approach 4: Freeze-and re-enforce the Landlock ruleset
> > 
> > Another option would be to have a different user space API for this,
> > with a flag LANDLOCK_RESTRICT_SELF_ENTER (name TBD) to enter a given
> > domain.
> > 
> > On first usage of landlock_restrict_self() with the flag, the enforced
> > ruleset would be frozen and linked to the Landlock domain which was
> > enforced at the end.
> > 
> > Subsequent attempts to add rules to the ruleset would fail when the
> > ruleset is frozen.  The ruleset FD is now representing the created
> > domain including all its nesting.
> > 
> > Subsequent usages of landlock_restrict_self() on a frozen ruleset would:
> > 
> > (a) check that the ruleset's domain is a narrower (nested) domain of
> >     the current thread's domain (so that we retain the property of
> >     only locking in a task further than it was before).
> > 
> > (b) set the task's domain to the domain attached to the ruleset
> > 
> > This way, we would keep a per-thread userspace API, avoiding the
> > issues discussed before.  It would become possible to use ruleset file
> > descriptors as handles for entering Landlock domains and pass them
> > around between processes.
> > 
> > The only drawback I can see is that it has the same issues as libpsx
> > and nptl(7) in that the syscall can fail on individual threads due to
> > ENOMEM.
> 
> Right. This approach is interesting, but it does not solve the main
> issue here.

It doesn't?

In my mind, the main goal of the patch set is that we can enable Landlock in
multithreaded processes like in Go programs or in multithreaded C(++).

With Approach 4, we would admittedly still have to do some work in userspace,
and it would not have the nice all-or-nothing semantics, but at least, it would
be possible to get all threads joining the same Landlock domain.  (And after
all, setuid(0) also does not have the all-or-nothing semantics, from what I can
tell.)


> Anyway, being able to enter a Landlock domain would definitely be
> useful. I would prefer using a pidfd to refer to a task's Landlock
> domain, which would avoid race condition and make the API clearer.  It
> would be nice to be able to pass a pidfd (instead of a ruleset) to
> landlock_restrict_self().  If we want to directly deal with a domain, we
> should create a dedicated domain FD type.

Fair enough, a different FD type for that would also be possible.


> > If we can not find a solution for "TSYNC", it seems that this might be
> > a viable alternative.  For multithreaded applications enforcing a
> > Landlock policy, it would become an application of libpsx with the
> > LANDLOCK_RESTRICT_SELF_ENTER flag.
> > 
> > Let me know what you think.
> > 
> > –Günther
> 
> Thinking more about this feature, it might actually make sense to
> synchronize all threads from the same process without checking other
> threads' Landlock domain. The rationale are:
> 1. Linux threads are not security boundaries and it is allowed for a
>    thread to control other threads' memory, which means changing their
>    code flow.  In other words, thread's permissions are the union of all
>    thread's permissions in the same process.
> 2. libpsx and libc's set*id() ignore other thread's credentials and just
>    blindly execute the same code on all threads.
> 3. It would be simpler and would avoid another error case.

+1, agreed.  That would let us skip the check for the pre-existing domain on
these threads.


> An issue could happen if a Landlock domain restricting a test thread is
> replaced.

You mean for Landlock's selftests?  I thought these were running in their own
forked-off subprocess?  I'm probably misunderstanding you here. :)


> I don't think the benefit of avoiding this issue is worth it
> compared to the guarantee we get when forcing the sandboxing of a full
> process without error.
> 



> We should rename the flag to LANDLOCK_RESTRICT_SELF_PROCESS to make it
> clear what it does.
> 
> The remaining issues are still the potential memory allocation failures.
> There are two things:
> 
> 1. We should try as much as possible to limit useless credential
>    duplications by not creating a new struct cred if parent credentials
>    are the same.
> 
> 2. To avoid the libpsx inconsistency (because of ENOMEM or EPERM),
>    landlock_restrict_self(2) should handle memory allocation and
>    transition the process from a known state to another known state.
> 
> What about this approach:
> - "Freeze" all threads of the current process (not ideal but simple) to
>   make sure their credentials don't get updated.
> - Create a new blank credential for the calling thread.
> - Walk through all threads and create a new blank credential for all
>   threads with a different cred than the caller.
> - Inject a task work that will call cred_transfer() for all threads with
>   either the same new credential used by the caller (incrementing the
>   refcount), or it will populate and use a blank one if it has different
>   credentials than the caller.
> 
> This may not efficiently deduplicate credentials for all threads but it
> is a simple deduplication approach that should be useful in most cases.
> 
> The difficult part is mainly in the "fleezing". It would be nice to
> change the cred API to avoid that but I'm not sure how.

I don't see an option how we could freeze the credentials of other threads:

To freeze a task's credentials, we would have to inhibit that commit_creds()
succeeds on that task, and I don't see how that would be done - we can not
prevent these tasks from calling commit_creds() [1], and when commit_creds()
gets called, it is guaranteed to work.

So in my mind, we have to somehow deal with the possibility that a task has a
new and not-previously-seen struct creds, by the time that its task_work gets
called.  As a consequence, I think a call to prepare_creds() would then be
unavoidable in the task_work?


—Günther


[1] We might be able to keep cred_prepare() and maybe cred_alloc_blank() from
    succeeding, but that does not mean that no one can call commit_creds() -
    there is still the possibility that commit_creds() gets called with a struct
    cred* that was acquired before decided to freeze.