linux-kernel - Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250228.b3794e33d5c0@gnoack.org>
Date: Fri, 28 Feb 2025 18:33:55 +0100
From: Günther Noack <gnoack3000@...il.com>
To: Mickaël Salaün <mic@...ikod.net>
Cc: Kees Cook <keescook@...omium.org>, Paul Moore <paul@...l-moore.com>,
	linux-security-module@...r.kernel.org,
	Konstantin Meskhidze <konstantin.meskhidze@...wei.com>,
	Jann Horn <jannh@...gle.com>, linux-kernel@...r.kernel.org,
	Peter Newman <peternewman@...gle.com>,
	David Howells <dhowells@...hat.com>
Subject: Re: [RFC 1/2] landlock: Multithreading support for
 landlock_restrict_self()

Hello!

Thanks for the review!

I'm adding David Howells to this thread as well.  David, maybe you can
help us and suggest a appropriate way to update the struct cred across
multiple threads?

On Thu, Feb 27, 2025 at 09:53:11PM +0100, Mickaël Salaün wrote:
> On Fri, Feb 21, 2025 at 07:44:17PM +0100, Günther Noack wrote:
> > Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag.  With this flag, a
> > given Landlock ruleset is applied to all threads of the calling
> > process, instead of only the current one.
> 
> I initially though that we may pick another flag name but I now think
> reusing the same name as for seccomp would be less confusing for users.

+1, under the assumption that we implement a "syncing" behaviour here,
I also think that the name is good.


> > diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
> > index a9760d252fc2d..63792a6cde5ca 100644
> > --- a/security/landlock/syscalls.c
> > +++ b/security/landlock/syscalls.c

> > +static struct cred *
> > +landlock_prepare_creds(const struct landlock_ruleset *old_dom,
> > +		       struct landlock_ruleset *new_dom)
> > +{
> > +	struct cred *new_cred;
> > +	struct landlock_cred_security *new_llcred;
> > +
> > +	/*
> > +	 * Similar checks as for seccomp(2), except that an -EPERM may be
> > +	 * returned.
> > +	 */
> > +	if (!task_no_new_privs(current) &&
> > +	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
> > +		return ERR_PTR(-EPERM);
> 
> This should only be done for the calling thread, so just reusing the
> existing code from landlock_restrict_self().
> 
> If the no_new_privs property is set for the current thread, it would
> make sense to just set it for all other threads once we are sure we can
> sandbox all of them (see seccomp_sync_threads).

(See discussion below regarding NO_NEW_PRIVS)


> For this we need to have two for_each_thread loops: one to check that
> the current thread is legitimate to sandbox all the other threads and
> make sure no error could arise (e.g. memory allocation), and another
> loop to safely enforce the restrictions if the first one was OK.
> 
> The handling of the following prepare_cred()'s failure is the main
> issue, but see my proposal below.
> 
> > +
> > +	/* Prepares new credentials. */
> > +	new_cred = prepare_creds();
> > +	if (!new_cred)
> > +		return ERR_PTR(-ENOMEM);
> 
> Instead of creating one new credential per thread, we should use the
> newly created one (from landlock_restrict_self) and increment its
> references.  In most cases that would avoid duplicating information and
> wasting memory by creating one struct cred per thread.

struct cred contains all kinds of other credentials as well, besides
Landlock and NO_NEW_PRIVS. *In normal scenarios* I believe that the
struct cred is the same for all threads, but this is not guaranteed.

We can make it safe by requiring that all threads must point to the
same struct cred, as you are suggesting below.  But how are users
supposed to reason about whether all threads share the same struct
cred?  To my knowledge, it is an implementation detail of the kernel.
We can't expect callers to understand what properties belong to struct
cred, and that list of properties can potentially change in the future
as well.

This is why in the proposal I sent, the only thing that matters about
the preexisting struct cred is the Landlock domain which is stored in
it.  This is a lower bar to meet and seems more appropriate for this
Landlock-specific syscall.


> > +	new_llcred = landlock_cred(new_cred);
> > +	if (new_llcred->domain != old_dom) {
> > +		abort_creds(new_cred);
> > +		return ERR_PTR(-ESRCH);
> 
> To avoid dealing with prepare_creds()'s error, I think we could just
> check that all threads have (initially) the same credentials, and either
> swap this same old credential with the new one (created by the caller),
> or swap the Landlock domain (and set no_new_privs) if the credentials
> changed between the initial check and now.

This would be working around the API documented in
Documentation/security/credentials.rst, where the affected task should
do its own prepare_creds()/abort_creds()/commit_creds() dance.

After seeing how it had turned out in the keyctl case with an
additional LSM hook, I was hoping to use the suggested API and do
things as intended here.  prepare_creds() and commit_creds() do more
than just switching pointers and refcounts, it is not clear to me that
prepare_creds() on one task and commit_creds() on another would work
with the same struct cred.


> Actually it might be up to
> three loops:
> 
> (lock cred_guard_mutex)

FWIW, Peter Newman (CC'd) also pointed me at the tasklist lock, which
is also used in Seccomp.  I believe we would also have to grab this
one *before* starting to iterate over the list of tasks and *after*
committing the last credentials, so that we don't spawn new tasks in
the middle of the operation, which might still inherit old
credentials.  (The drawback is that it is a global lock.)


> 1. Check that all threads use the same credential, or abort the whole
>    operation otherwise.
> 
> 2. If the caller has no_new_privs (and passed the Landlock checks), set
>    no_new_privs to all threads.

I am still conflicted about this.  This seems like a too implicit way
to set NO_NEW_PRIVS and might surprise callers.  After all, we are not
setting it in the single-threaded landlock_restrict_self() case
either.


> 3. Run a work_task on each thread that will either:
>    - if old thread's credential is still the same as the caller's
>      credentials, swap it with the new one with get_cred()/put_cred();
>    - otherwise, this means that the thread changed its credential after
>      the first loop but because we set no_new_privs, this alternative
>      credential should not give more privileges and we should be safe to
>      only swap the Landlock domain with the new one (if different).
>      Indeed, this alternative credential might be used by other threads
>      from the same process, but AFAIK not by threads from other
>      processes thanks to cred_guard_mutex.  This would require to bypass
>      the const pointer though.  Holding cred_guard_mutex will also make
>      sure that domains were not updated under our feet by another call
>      to landlock_restrict_self().

Hmm, in the second case we'd still have to use the
prepare_creds()/abort_creds()/commit_creds() API with all of its
potential failure modes.  Or, we would have to swizzle out contents of
struct cred directly, but that would be side-stepping that API again.

I would prefer not to side-step that cred-updating API, given how it
turned out in the keyctl case.  If it is absolutely necessary to
change struct cred differently, I would rather introduce a new
"official" way to change struct cred than to work around it from the
outside without David Howells' knowledge.  But I'd prefer to do it
with the existing update API, if we can.


> (unlock cred_guard_mutex)
> 
> Another approach would be to leverage an LSM hook to deny credential
> changes for a thread while we are changing them for the whole process,
> but I'm not sure it would be better.
> 
> Kees, Paul, Jann, any better idea?
> 
> > +	}
> > +
> > +	/* Replaces the old (prepared) domain. */
> > +	landlock_put_ruleset(new_llcred->domain);
> > +	landlock_get_ruleset(new_dom);
> > +	new_llcred->domain = new_dom;
> > +
> > +	return new_cred;
> > +}
> > +
> > +/*
> > + * restrict_one_thread - update a thread's Landlock domain in lockstep with the
> > + * other threads in the same process.
> > + *
> > + * When this is run, the same function gets run in all other threads in the same
> > + * process.  The concurrently running function invocations coordinate to wait
> > + * until they are all done with the preparation step and have reported back
> > + * errors from the preparation step, if necessary.
> > + *
> > + * Afterwards, depending on the presence of an error, all threads either commit
> > + * or abort the prepared credentials.  The commit operation can not fail any more.
> > + */
> > +static void restrict_one_thread(struct landlock_tsync_shared_context *ctx)
> > +{
> > +	int res;
> > +	struct cred *cred = landlock_prepare_creds(ctx->old_dom, ctx->new_dom);
> > +
> > +	/* On error, set the error and continue. (Do not return early.) */
> > +	if (IS_ERR(cred))
> > +		atomic_set(&ctx->preparation_error, PTR_ERR(cred));
> > +
> > +	/*
> > +	 * Barrier: Wait until all threads are done preparing.
> > +	 * After this point, we can have no more failures.
> > +	 */
> > +	if (atomic_dec_return(&ctx->num_preparing) == 0)
> > +		complete_all(&ctx->all_prepared);
> > +	wait_for_completion(&ctx->all_prepared);
> 
> The logic is OK but the side effect would be to block all threads, which
> is an issue.  That's why we need at least two loops run by the calling
> thread, and the other threads should not have to wait.

If we do "prepare" and "commit" in two different loops, we'd still
need to start a "task_work" for the threads in the "commit" loop so
that they update their struct cred.  And if the struct cred changed in
the meantime between the two loops, what do we do?  If we want
all-or-nothing semantics, we'd still have to check for modification in
the task_work and synchronize with a barrier between all tasks, right?
Doesn't that have the same problems?  (I would also expect that
wait-time-wise, the "landlock_prepare_creds" part would be quicker
than waiting for the "all_prepared" barrier, so this doesn't seem to
make a big difference?)

> 
> > +
> > +	/*
> > +	 * Abort the commit if any of the threads had an error.
> > +	 * (It might also be this thread.)  Otherwise, commit.
> > +	 */
> > +	res = atomic_read(&ctx->preparation_error);
> > +	if (res) {
> > +		if (!IS_ERR(cred))
> > +			abort_creds(cred);
> > +	} else {
> > +		commit_creds(cred);
> > +	}
> > +
> > +	/* Notify the calling thread once all threads are done */
> > +	if (atomic_dec_return(&ctx->num_unfinished) == 0)
> > +		complete_all(&ctx->all_finished);
> > +}
> > +
> > +/*
> > + * restrict_one_thread_callback - task_work callback for restricting a thread
> > + *
> > + * Calls restrict_one_thread with the struct landlock_shared_tsync_context
> > + * and frees up the per-work_task landlock_tsync_context afterwards.
> > + */
> > +static void restrict_one_thread_callback(struct callback_head *work)
> > +{
> > +	struct landlock_tsync_context *ctx =
> > +		container_of(work, struct landlock_tsync_context, work);
> > +	struct landlock_tsync_shared_context *sctx = ctx->ctx;
> > +
> > +	restrict_one_thread(sctx);
> > +	kfree(ctx);
> > +}
> > +
> > +/*
> > + * restrict_all_threads - enables a Landlock policy for all threads
> > + */
> > +static int restrict_all_threads(const int ruleset_fd)
> > +{
> > +	int res;
> > +	struct task_struct *thread, *caller;
> > +	struct landlock_tsync_shared_context sctx;
> > +	struct landlock_ruleset *ruleset;
> > +
> > +	/*
> > +	 * XXX: Need to double check race conditions and deadlocks before
> > +	 * merging this upstream. We probably need additional locking.
> > +	 */
> > +
> > +	/* Starting with 1, as we're already counting current. */
> > +	atomic_set(&sctx.num_preparing, 1);
> > +	atomic_set(&sctx.num_unfinished, 1);
> > +	init_completion(&sctx.all_prepared);
> > +	init_completion(&sctx.all_finished);
> > +	atomic_set(&sctx.preparation_error, 0);
> > +	sctx.old_dom = landlock_get_current_domain();
> > +
> > +	/* Gets and checks the ruleset. */
> > +	ruleset = get_ruleset_from_fd(ruleset_fd, FMODE_CAN_READ);
> > +	if (IS_ERR(ruleset))
> > +		return PTR_ERR(ruleset);
> > +
> > +	/*
> > +	 * We pre-merge the domain for all threads,
> > +	 * so that all threads end up with exactly the same domain.
> > +	 */
> > +	sctx.new_dom = landlock_merge_ruleset(sctx.old_dom, ruleset);
> > +	landlock_put_ruleset(ruleset);
> > +	if (IS_ERR(sctx.new_dom))
> > +		return PTR_ERR(sctx.new_dom);
> 
> What is specific to this function should only start here.  The flag
> checks (new ones are coming with the audit patch series), the ruleset
> FD, the preapre_creds(), and the landlock_merge_ruleset() should be the
> same.  We can initialize shared states from here.

I'm not sure I understand fully; is the suggestion that I reorder the
work at the top of restrict_all_threads() so that the parts that are
also done in the single threaded-case can be factored out and done in
the calling function?

> 
> > +
> > +	landlock_get_ruleset(sctx.old_dom);
> > +
> > +	caller = current;
> > +	for_each_thread(caller, thread) {
> > +		/* Skip current, since it is initiating the sync. */
> > +		if (thread == caller)
> > +			continue;
> > +
> > +		/* Skip exited threads. */
> > +		if (thread->flags & PF_EXITING)
> > +			continue;
> > +
> > +		/* Deallocation is done by the task_work itself. */
> > +		struct landlock_tsync_context *ctx =
> > +			kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
> > +		if (!ctx) {
> > +			/*
> > +			 * On error, keep already started threads from
> > +			 * accidentally committing.  Do not start additional
> > +			 * threads.
> > +			 */
> > +			atomic_set(&sctx.preparation_error, -ENOMEM);
> > +			break;
> > +		}
> > +
> > +		ctx->ctx = &sctx;
> > +		atomic_inc(&sctx.num_preparing);
> > +		atomic_inc(&sctx.num_unfinished);
> > +		init_task_work(&ctx->work, restrict_one_thread_callback);
> > +		res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
> > +		if (res) {
> > +			atomic_set(&sctx.preparation_error, res);
> > +			atomic_dec(&sctx.num_preparing);
> > +			atomic_dec(&sctx.num_unfinished);
> > +		}
> > +	}
> > +
> > +	/* and do the same on the current thread */
> > +	restrict_one_thread(&sctx);
> > +
> > +	res = atomic_read(&sctx.preparation_error);
> > +	wait_for_completion(&sctx.all_finished);
> > +	landlock_put_ruleset(sctx.new_dom);
> > +	landlock_put_ruleset(sctx.old_dom);
> > +	return res;
> > +}
> > +
> >  /**
> >   * sys_landlock_restrict_self - Enforce a ruleset on the calling thread
> >   *
> >   * @ruleset_fd: File descriptor tied to the ruleset to merge with the target.
> > - * @flags: Must be 0.
> > + * @flags: Flags to modify the behavior.
> >   *
> > - * This system call enables to enforce a Landlock ruleset on the current
> > - * thread.  Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
> > + * This system call enforces a Landlock ruleset on the current thread.
> > + * Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
> >   * namespace or is running with no_new_privs.  This avoids scenarios where
> >   * unprivileged tasks can affect the behavior of privileged children.
> >   *
> > + * If %LANDLOCK_RESTRICT_SELF_TSYNC is specified in @flags, the ruleset will be
> > + * applied to all threads of the current process.  For this, all threads must be
> > + * in the same Landlock domain and fulfill the normal requirements for enforcing
> > + * a Landlock ruleset.
> > + *
> >   * Possible returned errors are:
> >   *
> >   * - %EOPNOTSUPP: Landlock is supported by the kernel but disabled at boot time;
> > @@ -447,6 +664,8 @@ SYSCALL_DEFINE4(landlock_add_rule, const int, ruleset_fd,
> >   *   %CAP_SYS_ADMIN in its namespace.
> >   * - %E2BIG: The maximum number of stacked rulesets is reached for the current
> >   *   thread.
> > + * - %ESRCH: When called with %LANDLOCK_RESTRICT_SELF_TSYNC, the processes'
> > + *   threads were in different Landlock domains.
> >   */
> >  SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> >  		flags)
> > @@ -467,10 +686,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> >  	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
> >  		return -EPERM;
> >  
> > -	/* No flag for now. */
> > -	if (flags)
> > +	/* TSYNC is the only supported flag. */
> > +	if (flags & ~LANDLOCK_RESTRICT_SELF_TSYNC)
> >  		return -EINVAL;
> >  
> > +	if (flags & LANDLOCK_RESTRICT_SELF_TSYNC)
> > +		return restrict_all_threads(ruleset_fd);
> > +
> 
> As explained above, restrict_all_threads()'s logic should be called
> later.
> 
> Starting here, as seccomp does, we can lock siglock (whatever flags are
> used) to be sure there is not concurrent call to
> landlock_restrict_self() on other threads of the same process.
> 
> If LANDLOCK_RESTRICT_SELF_TSYNC is set, we also need to lock
> cred_guard_mutex to avoid race conditions with concurrent execve calls.

I was confused about this lock, but found the remark in the seccomp
logic again as well, which says:

  /*
   * Make sure we cannot change seccomp or nnp state via TSYNC
   * while another thread is in the middle of calling exec.
   */

In my understanding, cred_guard_mutex gets locked in early stages of a
execve() that prepares the "bprm".  If I understand this correctly,
this is the scenario where the other thread is starting to run
execve() (about to replace the entire process), but at the same time
it still looks like a normal thread that belongs to the process.  Is
that right?

> >  	/* Gets and checks the ruleset. */
> >  	ruleset = get_ruleset_from_fd(ruleset_fd, FMODE_CAN_READ);
> >  	if (IS_ERR(ruleset))
> > -- 
> > 2.48.1
> > 
> > 

Thanks for the review and for keeping the discussion at the more
conceptual level for now!

–Günther