linux-kernel - Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250518.xeevoom3kieY@digikod.net>
Date: Sun, 18 May 2025 21:57:32 +0200
From: Mickaël Salaün <mic@...ikod.net>
To: Günther Noack <gnoack3000@...il.com>
Cc: Paul Moore <paul@...l-moore.com>, sergeh@...nel.org, 
	David Howells <dhowells@...hat.com>, Kees Cook <keescook@...omium.org>, 
	linux-security-module@...r.kernel.org, Konstantin Meskhidze <konstantin.meskhidze@...wei.com>, 
	Jann Horn <jannh@...gle.com>, linux-kernel@...r.kernel.org, 
	Peter Newman <peternewman@...gle.com>, Andy Lutomirski <luto@...capital.net>, 
	Will Drewry <wad@...omium.org>
Subject: Re: [RFC 1/2] landlock: Multithreading support for
 landlock_restrict_self()

On Sun, May 18, 2025 at 09:40:05AM +0200, Günther Noack wrote:
> Hello!
> 
> On Tue, Mar 11, 2025 at 03:32:53PM +0100, Mickaël Salaün wrote:
> > On Mon, Mar 10, 2025 at 02:04:23PM +0100, Günther Noack wrote:
> > > On Tue, Mar 04, 2025 at 09:25:51PM +0100, Mickaël Salaün wrote:
> > > > On Fri, Feb 28, 2025 at 06:33:55PM +0100, Günther Noack wrote:
> > > > > Hello!
> > > > > 
> > > > > Thanks for the review!
> > > > > 
> > > > > I'm adding David Howells to this thread as well.  David, maybe you can
> > > > > help us and suggest a appropriate way to update the struct cred across
> > > > > multiple threads?
> > > 
> > > Paul and Serge, since you are volunteering to take ownership of
> > > credentials, maybe you can advise on what is the best approach here?
> > > 
> > > To summarize the approaches that I have been discussing with Mickaël:
> > > 
> > > Approach 1: Use the creds API thread-by-thread (implemented here)
> > > 
> > >   * Each task calls prepare_creds() and commit_creds() on its own, in
> > >     line with the way the API is designed to be used (from a single
> > >     task).
> > >   * Task work gets scheduled with a pseudo-signal and the task that
> > >     invoked the syscall is waiting for all of them to return.
> > >   * Task work can fail at the beginning due to prepare_creds(), in
> > >     which case all tasks have to abort_creds(). Additional
> > >     synchronization is needed for that.
> > > 
> > >   Drawback: We need to grab the system-global task lock to prevent new
> > >   thread creation and also grab the per-process signal lock to prevent
> > >   races with other creds accesses, for the entire time as we wait for
> > >   each task to do the task work.
> > 
> > In other words, this approach blocks all threads from the same process.
> 
> It does, but that is still an improvement over the current
> libpsx-based implementation in userspace.  That existing
> implementation does not block, but it is running the risk that
> prepare_creds() might fail on one of the threads (e.g. allocation
> failure), which would leave the processes' threads in an inconsistent
> state.
> 
> Another upside that the in-kernel implementation has is that the
> implementation of that is hidden behind an API, so if we can
> eventually find a better approach, we can migrate to it.  It gives us
> flexibility.



> 
> I guess a possible variant (approach 1B) would be to do the equivalent
> to what userspace does today, and not make all threads wait for the
> possible error of prepare_creds() on the other threads.

This 1B variant is not OK because it would remove the guarantee that the
whole process is restricted.

> 
> 
> > > Approach 2: Attempt to do the prepare_creds() step in the calling task.
> > > 
> > >   * Would use an API similar to what keyctl uses for the
> > >     parent-process update.
> > >   * This side-steps the credentials update API as it is documented in
> > >     Documentation, using the cred_alloc_blank() helper and replicating
> > >     some prepare_creds() logic.
> > > 
> > >   Drawback: This would introduce another use of the cred_alloc_blank()
> > >   API (and the cred_transfer LSM hook), which would otherwise be
> > >   reasonable to delete if we can remove the keyctl use case.
> > >   (https://lore.kernel.org/all/20240805-remove-cred-transfer-v2-0-a2aa1d45e6b8@google.com/)
> > 
> > cred_alloc_blank() was designed to avoid dealing with -ENOMEM, which is
> > a required property for this Landlock TSYNC feature (i.e. atomic and
> > consistent synchronization).
> 
> Remark on the side, I suspect that the error handling in nptl(7)
> probably also does not guarantee that, also for setuid(2) and friends.
> 
> 
> > I think it would make sense to replace most of the
> > key_change_session_keyring() code with a new cred_transfer() helper that
> > will memcpy the old cred to the new, increment the appropriate ref
> > counters, and call security_transfer_creds().  We could then use this
> > helper in Landlock too.
> > 
> > To properly handle race conditions with a thread changing its own
> > credentials, we would need a new LSM hook called by commit_creds().
> > For the Landlock implementation, this hook would check if the process is
> > being Landlocked+TSYNC and return -ERESTARTNOINTR if it is the case.
> > The newly created task_work would then be free to update each thread's
> > credentials while only blocking the calling thread (which is also a
> > required feature).
> > 
> > Alternatively, instead of a new LSM hook, commit_creds() could check
> > itself a new group leader's flag set if all the credentials from the
> > calling process are being updated, and return -ERESTARTNOINTR in this
> > case.
> 
> commit_creds() is explicitly documented to never return errors.
> It returns a 0 integer so that it lends itself for tail calls,
> and some of those usages might also rely on it always working.
> There are ~15 existing calls where the return value is discarded.

Indeed, commit_creds() should always return 0.  My full proposal does
not look safe enough, but the cred_transfer() helper can still be
useful.

> 
> If commit_creds() returns -ERESTARTNOINTR, I assume that your idea is
> that the task_work would retry the prepare-and-commit when
> encountering that?
> 
> We would have to store the fact that the process is being
> Landlock+TSYNC'd in a central place (e.g. group leader flag set).
> When that is done, don't we need more synchronization mechanisms to
> access that (which RCU was meant to avoid)?
> 
> I am having a hard time wrapping my head around these synchronization
> schemes, I feel this is getting too complicated for what it is trying
> to do and might become difficult to maintain if we implemented it.

Fair. ERESTARTNOINTR should only be used by a syscall implementation.

> 
> > > Approach 3: Store Landlock domains outside of credentials altogether
> > > 
> > >   * We could also store a task's Landlock domain as a pointer in the
> > >     per-task security blob, and refcount these.  We would need to make
> > >     sure that they get newly referenced and updated in the same
> > >     scenarios as they do within struct cred today.
> > >   * We could then guard accesses to a task's Landlock domain with a
> > >     more classic locking mechanism.  This would make it possible to
> > >     update the Landlock domain of all tasks in a process without
> > >     having to go through pseudo-signals.
> > > 
> > >   Drawbacks:
> > >   * Would have to make sure that the Landlock domain the task's LSM
> > >     blob behaves exactly the same as before in the struct cred.
> > >   * Potentially slower to access Landlock domains that are guarded by
> > >     a mutex.
> > 
> > This would not work because the kernel (including LSM hooks) uses
> > credentials to check access.
> 
> It's unclear to me what you mean by that.
> 
> Do you mean that it is hard to replicate for Landlock the cases where
> the pointer would have to be copied, because the LSM hooks are not
> suited for it?

struct cred is used to check if a task subject can access a task object.
Landlock's metadata must stay in struct cred to be available when
checking access to any kernel object.  The LSM hooks reflect this
rationale by only passing struct cred when checking a task (e.g.
security_task_kill()'s cred).

seccomp only cares about filtering raw syscalls, and the seccomp filters
are just ignored when the kernel (with an LSM or not) checks task's
permission to access another task.

The per-task security blob could store some state though, e.g. to
identify if a domain needs to be updated, but I don't see a use case
here.

> 
> 
> Here is another possible approach which a colleague suggested in a
> discussion:
> 
> Approach 4: Freeze-and re-enforce the Landlock ruleset
> 
> Another option would be to have a different user space API for this,
> with a flag LANDLOCK_RESTRICT_SELF_ENTER (name TBD) to enter a given
> domain.
> 
> On first usage of landlock_restrict_self() with the flag, the enforced
> ruleset would be frozen and linked to the Landlock domain which was
> enforced at the end.
> 
> Subsequent attempts to add rules to the ruleset would fail when the
> ruleset is frozen.  The ruleset FD is now representing the created
> domain including all its nesting.
> 
> Subsequent usages of landlock_restrict_self() on a frozen ruleset would:
> 
> (a) check that the ruleset's domain is a narrower (nested) domain of
>     the current thread's domain (so that we retain the property of
>     only locking in a task further than it was before).
> 
> (b) set the task's domain to the domain attached to the ruleset
> 
> This way, we would keep a per-thread userspace API, avoiding the
> issues discussed before.  It would become possible to use ruleset file
> descriptors as handles for entering Landlock domains and pass them
> around between processes.
> 
> The only drawback I can see is that it has the same issues as libpsx
> and nptl(7) in that the syscall can fail on individual threads due to
> ENOMEM.

Right. This approach is interesting, but it does not solve the main
issue here.

Anyway, being able to enter a Landlock domain would definitely be
useful. I would prefer using a pidfd to refer to a task's Landlock
domain, which would avoid race condition and make the API clearer.  It
would be nice to be able to pass a pidfd (instead of a ruleset) to
landlock_restrict_self().  If we want to directly deal with a domain, we
should create a dedicated domain FD type.

> 
> If we can not find a solution for "TSYNC", it seems that this might be
> a viable alternative.  For multithreaded applications enforcing a
> Landlock policy, it would become an application of libpsx with the
> LANDLOCK_RESTRICT_SELF_ENTER flag.
> 
> Let me know what you think.
> 
> –Günther

Thinking more about this feature, it might actually make sense to
synchronize all threads from the same process without checking other
threads' Landlock domain. The rationale are:
1. Linux threads are not security boundaries and it is allowed for a
   thread to control other threads' memory, which means changing their
   code flow.  In other words, thread's permissions are the union of all
   thread's permissions in the same process.
2. libpsx and libc's set*id() ignore other thread's credentials and just
   blindly execute the same code on all threads.
3. It would be simpler and would avoid another error case.

An issue could happen if a Landlock domain restricting a test thread is
replaced.  I don't think the benefit of avoiding this issue is worth it
compared to the guarantee we get when forcing the sandboxing of a full
process without error.

We should rename the flag to LANDLOCK_RESTRICT_SELF_PROCESS to make it
clear what it does.

The remaining issues are still the potential memory allocation failures.
There are two things:

1. We should try as much as possible to limit useless credential
   duplications by not creating a new struct cred if parent credentials
   are the same.

2. To avoid the libpsx inconsistency (because of ENOMEM or EPERM),
   landlock_restrict_self(2) should handle memory allocation and
   transition the process from a known state to another known state.

What about this approach:
- "Freeze" all threads of the current process (not ideal but simple) to
  make sure their credentials don't get updated.
- Create a new blank credential for the calling thread.
- Walk through all threads and create a new blank credential for all
  threads with a different cred than the caller.
- Inject a task work that will call cred_transfer() for all threads with
  either the same new credential used by the caller (incrementing the
  refcount), or it will populate and use a blank one if it has different
  credentials than the caller.

This may not efficiently deduplicate credentials for all threads but it
is a simple deduplication approach that should be useful in most cases.

The difficult part is mainly in the "fleezing". It would be nice to
change the cred API to avoid that but I'm not sure how.