linux-kernel - Re: [RFC PATCH 4/4 v0.3] sched/umcg: RFC: implement UMCG syscalls

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e1403574-1151-8399-0ce9-bb80852ec56b@uwaterloo.ca>
Date:   Fri, 23 Jul 2021 15:06:09 -0400
From:   Thierry Delisle <tdelisle@...terloo.ca>
To:     Peter Oskolkov <posk@...gle.com>
CC:     Peter Oskolkov <posk@...k.io>, Andrei Vagin <avagin@...gle.com>,
        Ben Segall <bsegall@...gle.com>, Jann Horn <jannh@...gle.com>,
        Jim Newsome <jnewsome@...project.org>,
        Joel Fernandes <joel@...lfernandes.org>,
        <linux-api@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Paul Turner <pjt@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Peter Buhr <pabuhr@...terloo.ca>
Subject: Re: [RFC PATCH 4/4 v0.3] sched/umcg: RFC: implement UMCG syscalls

 > In my tests reclaimed nodes have their next pointers immediately set
 > to point to the list head. If the kernel gets a node with its @next
 > pointing to something else, then yes, things break down (the kernel
 > kills the process); this has happened occasionally when I had a bug in
 > the userspace code.

I believe that approach is fine for production, but for testing it may
not detect some bugs. For example, it may not detect the race I detail
below.

 > Could you, please, provide a bit more details on when/how the race can
 > happen? Servers add themselves to the list, so there can be no races
 > there (servers going idle: add-to-the-list; wait; gc (under a lock);
 > restore @next; do stuff).

I believe the race can happen if the kernel attempts to wake an idle
server concurrently with a call to gc.

Here's an example list where the head points to a list of 3 items denoted
A, B, C, and the second character denotes whether the element is marked
for deletion: 'x' means marked, 'o' means unmarked. 'H' denotes the head.

         H -> Ax -> Bo -> Co

Now, atomic_stack_gc is expected to unlink node 'A' so it can be reclaimed,
leading to "H -> Bo -> Co". Once 'A' is unlinked it can be either deleted
or pushed back on the list at a later time.

In the following snippet of the atomic_stack_pop function (trimmed for
space):

      static inline uint64_t* atomic_stack_pop(uint64_t *head)
      {
          uint64_t *curr = (uint64_t *)atomic_load_explicit(head);

          do {
              if (!curr) return NULL;

              // <--- Pause
              uint64_t next = atomic_load_explicit(curr);

At the location marked "<--- Pause", assume the code has the address of
node 'A' and is about to dereference it to get the address of node 'B'. If
the thread running this code pauses for any reason, a different CPU can
run atomic_stack_gc and delete node 'A'. At that point, the pop function
resumes and dereferences a node that no longer exists.

The thread pause can have many causes; in user-space, preemption is the
most obvious, but hardware interrupts or even last-level cache misses can
cause enough of a slowdown for this to happen.

The fundamental problem is that there is nothing to prevent multiple
threads from manipulating the list AND concurrently attempting to reclaim
nodes. As far as I know, this requires locking or a lock-free memory
reclamation scheme where nodes are unlinked and then consensus is
established that no thread can reach the unlinked data before reclaiming
the unlinked node. While both approaches are do-able, it requires extra
communication between the kernel and user-space.

In general, the kernel needs to be robust to misbehaving users and their
concurrent operations. Hence, I would be wary of any looping in kernel
involving an atomic operation, which requires cooperation among threads
to avoid starvation.

 > Workers are trickier, as they can be woken by signals and then block
 > again, but stray signals are so bad here that I'm thinking of actually
 > not letting sleeping workers wake on signals. Other than signals
 > waking queued/unqueued idle workers, are there any other potential
 > races here?

Timeouts on blocked threads is virtually the same as a signal I think. I
can see that both could lead to attempts at waking workers that are not
blocked.

I haven't looked too much at the state transitions yet. I think the
guarantees offered by the two lists will result in potential races in the
rest of the code.

Thierry