linux-kernel - Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez0pWg3OTABfCKRk5sWrURM-HdJhQMcWedEppc_z1rrVJw@mail.gmail.com>
Date: Tue, 27 May 2025 16:26:40 +0200
From: Jann Horn <jannh@...gle.com>
To: Günther Noack <gnoack3000@...il.com>
Cc: Paul Moore <paul@...l-moore.com>, Mickaël Salaün <mic@...ikod.net>, 
	sergeh@...nel.org, David Howells <dhowells@...hat.com>, Kees Cook <keescook@...omium.org>, 
	linux-security-module@...r.kernel.org, 
	Konstantin Meskhidze <konstantin.meskhidze@...wei.com>, linux-kernel@...r.kernel.org, 
	Peter Newman <peternewman@...gle.com>
Subject: Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()

On Mon, Mar 10, 2025 at 2:04 PM Günther Noack <gnoack3000@...il.com> wrote:
> Hello Paul and Serge!
>
> On Tue, Mar 04, 2025 at 09:25:51PM +0100, Mickaël Salaün wrote:
> > On Fri, Feb 28, 2025 at 06:33:55PM +0100, Günther Noack wrote:
> > > Hello!
> > >
> > > Thanks for the review!
> > >
> > > I'm adding David Howells to this thread as well.  David, maybe you can
> > > help us and suggest a appropriate way to update the struct cred across
> > > multiple threads?
>
> Paul and Serge, since you are volunteering to take ownership of
> credentials, maybe you can advise on what is the best approach here?
>
> To summarize the approaches that I have been discussing with Mickaël:
>
> Approach 1: Use the creds API thread-by-thread (implemented here)
>
>   * Each task calls prepare_creds() and commit_creds() on its own, in
>     line with the way the API is designed to be used (from a single
>     task).
>   * Task work gets scheduled with a pseudo-signal and the task that
>     invoked the syscall is waiting for all of them to return.
>   * Task work can fail at the beginning due to prepare_creds(), in
>     which case all tasks have to abort_creds(). Additional
>     synchronization is needed for that.
>
>   Drawback: We need to grab the system-global task lock to prevent new
>   thread creation and also grab the per-process signal lock to prevent
>   races with other creds accesses, for the entire time as we wait for
>   each task to do the task work.

The tasklist_lock and the siglock cant be used while we're waiting,
they're not sleepable locks. Also, copy_process() currently copies
creds to the new task with copy_creds() without holding any locks yet.

I think one way to implement this option without introducing new locks
in core kernel code would be to have the thread initiating the
credential change do this in three separate steps:

 - loop over the other threads under RCU (for searching for new
threads to send task work to)
 - allocate state for threads (outside RCU, in sleepable context)
 - wait until all threads are in task work


I guess that might look roughly like this, as a rough sketch:


struct landlock_sync_state {
  atomic_t num_threads_entered;
  atomic_t num_threads_done;
  bool abort;
  wait_queue_head_t wq_for_initiator;
  wait_queue_head_t wq_for_others;
};
struct landlock_tw_state {
  struct task_struct *task;
  struct callback_head work;
  struct landlock_sync_state *shared;
};

size_t num_threads_signalled = 0;
size_t num_threads_allocated = 0;
size_t num_threads_to_allocate;
struct landlock_tw_state **thread_works = NULL;
struct landlock_sync_state state;
bool all_threads_signalled = false;
while (true) {
  /* scan for threads we haven't sent task work to yet */
  rcu_read_lock();
  num_threads_to_allocate = 0;
  for_each_thread(current, thread) {
    /*
     * this PF_EXITING check is a bit dodgy but not much worse than
     * what seccomp already does
    */
    if (thread == current || (READ_ONCE(thread->flags) & PF_EXITING))
      continue;

    for (int i=0; i<num_threads_signalled; i++) {
      if (thread_works[i]->task == thread)
        goto next_thread;
    }

    all_threads_signalled = false;
    if (num_threads_allocated > num_threads_signalled) {
      struct landlock_tw_state *ltws = thread_works[num_threads_signalled];

      ltws->task = thread;
      ltws->shared = &state;
      if (task_work_add(thread, &ltws->work, TWA_SIGNAL) == 0)
        num_threads_signalled++;
    } else {
      num_threads_to_allocate++;
    }

next_thread:;
  }
  rcu_read_unlock();

  if (all_threads_signalled)
    break; /* success */

  /*
   * If we need state for more threads, allocate it now and immediately retry.
   * Normally we only need to go through this path once, more if we race with
   * clone().
   */
  if (num_threads_to_allocate) {
    size_t new_count = num_threads_allocated + num_threads_to_allocate;
    struct landlock_tw_state **reallocd = kvrealloc(thread_works,
size_mul(sizeof(struct landlock_tw_state*), new_count), GFP_KERNEL);
    if (!reallocd)
      goto bailout;
    thread_works = reallocd;
    continue;
  }

  /* wait for all the threads we signalled */
  if (wait_event_interruptible(&state.wq_for_initiator,
          atomic_read(&state.num_threads_entered) == num_threads_signalled))
    goto bailout;

  /*
   * Now loop at least once more to check that none of the threads called
   * clone() in the meantime.
   */
  all_threads_signalled = true;
}

/* success */

/* ... coordinate the cred prepare and commit across threads here ... */

bailout:
for (int i=0; i<num_threads_signalled; i++) {
  if (task_work_cancel(thread_works[i].task, &thread_works[i].work))
    atomic_inc(&state.num_threads_done);
}
WRITE_ONCE(state.abort, true);
wake_up(&state.wq_for_others);
wait_event(&state.wq_for_initiator,
atomic_read(&state.num_threads_done) == num_threads_signalled);


This would be somewhat complex, but at least it would keep the
complexity isolated in the code for updating creds across tasks, so I
think this would be reasonable.