linux-kernel - Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250518.be040c48937c@gnoack.org>
Date: Sun, 18 May 2025 09:40:05 +0200
From: Günther Noack <gnoack3000@...il.com>
To: Mickaël Salaün <mic@...ikod.net>
Cc: Paul Moore <paul@...l-moore.com>, sergeh@...nel.org,
	David Howells <dhowells@...hat.com>,
	Kees Cook <keescook@...omium.org>,
	linux-security-module@...r.kernel.org,
	Konstantin Meskhidze <konstantin.meskhidze@...wei.com>,
	Jann Horn <jannh@...gle.com>, linux-kernel@...r.kernel.org,
	Peter Newman <peternewman@...gle.com>
Subject: Re: [RFC 1/2] landlock: Multithreading support for
 landlock_restrict_self()

Hello!

On Tue, Mar 11, 2025 at 03:32:53PM +0100, Mickaël Salaün wrote:
> On Mon, Mar 10, 2025 at 02:04:23PM +0100, Günther Noack wrote:
> > On Tue, Mar 04, 2025 at 09:25:51PM +0100, Mickaël Salaün wrote:
> > > On Fri, Feb 28, 2025 at 06:33:55PM +0100, Günther Noack wrote:
> > > > Hello!
> > > > 
> > > > Thanks for the review!
> > > > 
> > > > I'm adding David Howells to this thread as well.  David, maybe you can
> > > > help us and suggest a appropriate way to update the struct cred across
> > > > multiple threads?
> > 
> > Paul and Serge, since you are volunteering to take ownership of
> > credentials, maybe you can advise on what is the best approach here?
> > 
> > To summarize the approaches that I have been discussing with Mickaël:
> > 
> > Approach 1: Use the creds API thread-by-thread (implemented here)
> > 
> >   * Each task calls prepare_creds() and commit_creds() on its own, in
> >     line with the way the API is designed to be used (from a single
> >     task).
> >   * Task work gets scheduled with a pseudo-signal and the task that
> >     invoked the syscall is waiting for all of them to return.
> >   * Task work can fail at the beginning due to prepare_creds(), in
> >     which case all tasks have to abort_creds(). Additional
> >     synchronization is needed for that.
> > 
> >   Drawback: We need to grab the system-global task lock to prevent new
> >   thread creation and also grab the per-process signal lock to prevent
> >   races with other creds accesses, for the entire time as we wait for
> >   each task to do the task work.
> 
> In other words, this approach blocks all threads from the same process.

It does, but that is still an improvement over the current
libpsx-based implementation in userspace.  That existing
implementation does not block, but it is running the risk that
prepare_creds() might fail on one of the threads (e.g. allocation
failure), which would leave the processes' threads in an inconsistent
state.

Another upside that the in-kernel implementation has is that the
implementation of that is hidden behind an API, so if we can
eventually find a better approach, we can migrate to it.  It gives us
flexibility.

I guess a possible variant (approach 1B) would be to do the equivalent
to what userspace does today, and not make all threads wait for the
possible error of prepare_creds() on the other threads.


> > Approach 2: Attempt to do the prepare_creds() step in the calling task.
> > 
> >   * Would use an API similar to what keyctl uses for the
> >     parent-process update.
> >   * This side-steps the credentials update API as it is documented in
> >     Documentation, using the cred_alloc_blank() helper and replicating
> >     some prepare_creds() logic.
> > 
> >   Drawback: This would introduce another use of the cred_alloc_blank()
> >   API (and the cred_transfer LSM hook), which would otherwise be
> >   reasonable to delete if we can remove the keyctl use case.
> >   (https://lore.kernel.org/all/20240805-remove-cred-transfer-v2-0-a2aa1d45e6b8@google.com/)
> 
> cred_alloc_blank() was designed to avoid dealing with -ENOMEM, which is
> a required property for this Landlock TSYNC feature (i.e. atomic and
> consistent synchronization).

Remark on the side, I suspect that the error handling in nptl(7)
probably also does not guarantee that, also for setuid(2) and friends.


> I think it would make sense to replace most of the
> key_change_session_keyring() code with a new cred_transfer() helper that
> will memcpy the old cred to the new, increment the appropriate ref
> counters, and call security_transfer_creds().  We could then use this
> helper in Landlock too.
> 
> To properly handle race conditions with a thread changing its own
> credentials, we would need a new LSM hook called by commit_creds().
> For the Landlock implementation, this hook would check if the process is
> being Landlocked+TSYNC and return -ERESTARTNOINTR if it is the case.
> The newly created task_work would then be free to update each thread's
> credentials while only blocking the calling thread (which is also a
> required feature).
> 
> Alternatively, instead of a new LSM hook, commit_creds() could check
> itself a new group leader's flag set if all the credentials from the
> calling process are being updated, and return -ERESTARTNOINTR in this
> case.

commit_creds() is explicitly documented to never return errors.
It returns a 0 integer so that it lends itself for tail calls,
and some of those usages might also rely on it always working.
There are ~15 existing calls where the return value is discarded.

If commit_creds() returns -ERESTARTNOINTR, I assume that your idea is
that the task_work would retry the prepare-and-commit when
encountering that?

We would have to store the fact that the process is being
Landlock+TSYNC'd in a central place (e.g. group leader flag set).
When that is done, don't we need more synchronization mechanisms to
access that (which RCU was meant to avoid)?

I am having a hard time wrapping my head around these synchronization
schemes, I feel this is getting too complicated for what it is trying
to do and might become difficult to maintain if we implemented it.

> > Approach 3: Store Landlock domains outside of credentials altogether
> > 
> >   * We could also store a task's Landlock domain as a pointer in the
> >     per-task security blob, and refcount these.  We would need to make
> >     sure that they get newly referenced and updated in the same
> >     scenarios as they do within struct cred today.
> >   * We could then guard accesses to a task's Landlock domain with a
> >     more classic locking mechanism.  This would make it possible to
> >     update the Landlock domain of all tasks in a process without
> >     having to go through pseudo-signals.
> > 
> >   Drawbacks:
> >   * Would have to make sure that the Landlock domain the task's LSM
> >     blob behaves exactly the same as before in the struct cred.
> >   * Potentially slower to access Landlock domains that are guarded by
> >     a mutex.
> 
> This would not work because the kernel (including LSM hooks) uses
> credentials to check access.

It's unclear to me what you mean by that.

Do you mean that it is hard to replicate for Landlock the cases where
the pointer would have to be copied, because the LSM hooks are not
suited for it?


Here is another possible approach which a colleague suggested in a
discussion:

Approach 4: Freeze-and re-enforce the Landlock ruleset

Another option would be to have a different user space API for this,
with a flag LANDLOCK_RESTRICT_SELF_ENTER (name TBD) to enter a given
domain.

On first usage of landlock_restrict_self() with the flag, the enforced
ruleset would be frozen and linked to the Landlock domain which was
enforced at the end.

Subsequent attempts to add rules to the ruleset would fail when the
ruleset is frozen.  The ruleset FD is now representing the created
domain including all its nesting.

Subsequent usages of landlock_restrict_self() on a frozen ruleset would:

(a) check that the ruleset's domain is a narrower (nested) domain of
    the current thread's domain (so that we retain the property of
    only locking in a task further than it was before).

(b) set the task's domain to the domain attached to the ruleset

This way, we would keep a per-thread userspace API, avoiding the
issues discussed before.  It would become possible to use ruleset file
descriptors as handles for entering Landlock domains and pass them
around between processes.

The only drawback I can see is that it has the same issues as libpsx
and nptl(7) in that the syscall can fail on individual threads due to
ENOMEM.

If we can not find a solution for "TSYNC", it seems that this might be
a viable alternative.  For multithreaded applications enforcing a
Landlock policy, it would become an application of libpsx with the
LANDLOCK_RESTRICT_SELF_ENTER flag.

Let me know what you think.

–Günther