linux-kernel - Re: [PATCH tip/core/rcu 45/88] rcu: Add memory barriers for NOCB leader wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170608201148.GA2553@templeofstupid.com>
Date:   Thu, 8 Jun 2017 13:11:48 -0700
From:   Krister Johansen <kjlx@...pleofstupid.com>
To:     "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:     linux-kernel@...r.kernel.org, mingo@...nel.org,
        jiangshanlai@...il.com, dipankar@...ibm.com,
        akpm@...ux-foundation.org, mathieu.desnoyers@...icios.com,
        josh@...htriplett.org, tglx@...utronix.de, peterz@...radead.org,
        rostedt@...dmis.org, dhowells@...hat.com, edumazet@...gle.com,
        fweisbec@...il.com, oleg@...hat.com, bobby.prani@...il.com,
        stable@...r.kernel.org, gregkh@...uxfoundation.org
Subject: Re: [PATCH tip/core/rcu 45/88] rcu: Add memory barriers for NOCB
 leader wakeup

Hi Paul,

On Thu, May 25, 2017 at 02:59:18PM -0700, Paul E. McKenney wrote:
> Wait/wakeup operations do not guarantee ordering on their own.  Instead,
> either locking or memory barriers are required.  This commit therefore
> adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
> 
> Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> ---
>  kernel/rcu/tree_plugin.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 0b1042545116..573fbe9640a0 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
>  	if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
>  		/* Prior smp_mb__after_atomic() orders against prior enqueue. */
>  		WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
> +		smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
>  		swake_up(&rdp_leader->nocb_wq);
>  	}
>  }
> @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
>  	 * nocb_gp_head, where they await a grace period.
>  	 */
>  	gotcbs = false;
> +	smp_mb(); /* wakeup before ->nocb_head reads. */
>  	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
>  		rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
>  		if (!rdp->nocb_gp_head)

May I impose upon you to CC this patch to stable, and tag it as fixing
abedf8e241?  I ran into this on a production 4.9 branch.  When I
debugged it, I discovered that it went all the way back to 4.6.  The
tl;dr is that at least for some environments, the missed wakeup
manifests itself as a series of hung-task warnings to console and if I'm
unlucky it can also generate a hang that can block interactive logins
via ssh.

In case it's useful, I'm attching the RCA that I sent out before I
circulated a similar patch for internal review.  You beat me to the
punch in terms of getting a patch out a public mailing list.

--- >8 cut here 8< ---

Production boxes go through fits where they generate spurts of hung task
warnings.  Sometimes, the slowdown is enough to ensnare a process
holding the rtnl_lock, or prevent systemd from letting ssh logins on to
the box.

Below is a splat from a relatively benign ocurrence:

   INFO: task uc-spawn:34443 blocked for more than 120 seconds.
         Not tainted 4.9.4-1.el7.x86_64 #1
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   uc-spawn        D    0 34443      1 0x00000082
    ffff883fec88e4c0 0000000000000000 ffff883ff1b4ac00 ffff883ffddd9540
    ffff883fee999600 ffffc9001c907c30 ffffffff8175e3f6 0000000000000202
    ffff883fee999600 7fffffffffffffff ffff883fee999600 7fffffffffffffff
   Call Trace:
    [<ffffffff8175e3f6>] ? __schedule+0x1a6/0x650
    [<ffffffff8175e8d6>] schedule+0x36/0x80
    [<ffffffff81761aec>] schedule_timeout+0x21c/0x3a0
    [<ffffffff8175f3b2>] wait_for_completion+0xf2/0x130
    [<ffffffff810b6710>] ? wake_up_q+0x80/0x80
    [<ffffffff810eec27>] __wait_rcu_gp+0xd7/0x100
    [<ffffffff810f264e>] synchronize_sched+0x5e/0x80
    [<ffffffff810f1da0>] ? __call_rcu+0x320/0x320
    [<ffffffff810eeb30>] ? trace_raw_output_rcu_utilization+0x60/0x60
    [<ffffffff8125673b>] kern_unmount+0x2b/0x40
    [<ffffffff812d6f85>] mq_put_mnt+0x15/0x20
    [<ffffffff812d76de>] put_ipc_ns+0x3e/0xa0
    [<ffffffff810ab768>] free_nsproxy+0x38/0x90
    [<ffffffff810ab8c9>] switch_task_namespaces+0x59/0x60
    [<ffffffff810ab8e0>] exit_task_namespaces+0x10/0x20
    [<ffffffff8108d787>] do_exit+0x2b7/0xac0
    [<ffffffff81003510>] ? syscall_trace_enter+0x1d0/0x2b0
    [<ffffffff8108e00f>] do_group_exit+0x3f/0xb0
    [<ffffffff8108e094>] SyS_exit_group+0x14/0x20
    [<ffffffff81003a47>] do_syscall_64+0x67/0x180
    [<ffffffff8176306b>] entry_SYSCALL64_slow_path+0x25/0x25

In order to debug this further, a kdump was obtained.

crash> ps 34443
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
  34443      1  33  ffff883fee999600  UN   0.0       0      0  uc-spawn

crash> bt 34443
PID: 34443  TASK: ffff883fee999600  CPU: 33  COMMAND: "uc-spawn"
 #0 [ffffc9001c907bb8] __schedule at ffffffff8175e3f6
 #1 [ffffc9001c907c38] schedule at ffffffff8175e8d6
 #2 [ffffc9001c907c50] schedule_timeout at ffffffff81761aec
 #3 [ffffc9001c907cf0] wait_for_completion at ffffffff8175f3b2
 #4 [ffffc9001c907d50] __wait_rcu_gp at ffffffff810eec27
 #5 [ffffc9001c907da0] synchronize_sched at ffffffff810f264e
 #6 [ffffc9001c907df0] kern_unmount at ffffffff8125673b
 #7 [ffffc9001c907e08] mq_put_mnt at ffffffff812d6f85
 #8 [ffffc9001c907e18] put_ipc_ns at ffffffff812d76de
 #9 [ffffc9001c907e30] free_nsproxy at ffffffff810ab768
#10 [ffffc9001c907e48] switch_task_namespaces at ffffffff810ab8c9
#11 [ffffc9001c907e70] exit_task_namespaces at ffffffff810ab8e0
#12 [ffffc9001c907e80] do_exit at ffffffff8108d787
#13 [ffffc9001c907ee8] do_group_exit at ffffffff8108e00f
#14 [ffffc9001c907f18] sys_exit_group at ffffffff8108e094
#15 [ffffc9001c907f28] do_syscall_64 at ffffffff81003a47
    RIP: 00007f6b6e1f4209  RSP: 00007ffc488028b8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000001  RCX: 00007f6b6e1f4209
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: 0000000000000001
    RBP: 00007f6b6e4ea840   R8: 000000000000003c   R9: 00000000000000e7
    R10: fffffffffffffe90  R11: 0000000000000246  R12: 00007f6b6e4ea840
    R13: 00007f6b6e4efe80  R14: 00007ffc48802918  R15: 00007ffc48802998
    ORIG_RAX: 00000000000000e7  CS: 0033  SS: 002b

This particular process is blocked in its exit path, waiting for the
synchronize_sched() call it invoked to complete.  This function queues a
call_rcu_sched() invocation with a callback that triggers a wakeup of
the invoking process.  In order for this process to become unblocked,
a grace period must elapse, and a the callback must be called.

Typically, bugs like this imply that some processor failed to yield, or
some thread is holding rcu_read_lock_sched(), and hence preventing a
grace period from occurring.  However, after dilligent inspection I can
find no evidence of such a thing occurring.  After dumping all stacks,
no threads are in call frames that hold this type of read-side critical
section.  All cpus appear to be mostly idle and responding to interrupt
events.  The rcu threads themselves are not blocked waiting for
completion of a grace period either.

A quick look at rcu_sched_state shows nothing obviously asmiss:

crash> p rcu_sched_state
rcu_sched_state = $1 = {
  node = {{
      lock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      },
      gpnum = 22469944,
      completed = 22469944,
      qsmask = 0,
      qsmaskinit = 3,
      qsmaskinitnext = 0,
      expmask = 0,
      expmaskinit = 3,
      expmaskinitnext = 0,
      grpmask = 0,
      grplo = 0,
      grphi = 71,
      grpnum = 0 '\000',
      level = 0 '\000',
      wait_blkd_tasks = false,
      parent = 0x0,
      blkd_tasks = {
        next = 0xffffffff81c70428 <rcu_sched_state+104>,
        prev = 0xffffffff81c70428 <rcu_sched_state+104>
      },
      gp_tasks = 0x0,
      exp_tasks = 0x0,
      boost_tasks = 0x0,
      <...>

Note that according to this, the most recent gp completed was numbered
22469944.

Now here's the rcu_data for CPU 33, where the process got stuck.  It's
at least 15 grace periods behind.

crash> struct rcu_data 0xffff883ffddda380
struct rcu_data {
  completed = 22469929,
  gpnum = 22469930,
  rcu_qs_ctr_snap = 18203094,
  cpu_no_qs = {
    b = {
      norm = 0 '\000',
      exp = 0 '\000'
    },
    s = 0
  },
  core_needs_qs = true,
  beenonline = true,
  gpwrap = false,
  mynode = 0xffffffff81c70600 <rcu_sched_state+576>,
  grpmask = 8589934592,
  ticks_this_gp = 0,
  nxtlist = 0x0,
  nxttail = {0x0, 0x0, 0x0, 0x0},
  nxtcompleted = {0, 0, 0, 0},
  qlen_lazy = 0,
  qlen = 0,
  qlen_last_fqs_check = 0,
  n_cbs_invoked = 0,
  n_nocbs_invoked = 10626703,
  n_cbs_orphaned = 0,
  n_cbs_adopted = 0,
  n_force_qs_snap = 0,
  blimit = 10,
  dynticks = 0xffff883ffddcffd0,
  dynticks_snap = 28482762,
  dynticks_fqs = 19371,
  offline_fqs = 0,
  cond_resched_completed = 22421984,
  n_rcu_pending = 2265627,
  n_rp_core_needs_qs = 31314,
  n_rp_report_qs = 595411,
  n_rp_cb_ready = 0,
  n_rp_cpu_needs_gp = 757,
  n_rp_gp_completed = 367886,
  n_rp_gp_started = 38620,
  n_rp_nocb_defer_wakeup = 126,
  n_rp_need_nothing = 1262827,
  barrier_head = {
    next = 0x0,
    func = 0xffffffff810f0090 <rcu_barrier_callback>
  },
  exp_workdone0 = {
    counter = 21444
  },
  exp_workdone1 = {
    counter = 0
  },
  exp_workdone2 = {
    counter = 0
  },
  exp_workdone3 = {
    counter = 0
  },
  nocb_head = 0xffff881f8444a138,
  nocb_tail = 0xffff881f9f8ee8d0,
  nocb_q_count = {
    counter = 500
  },
  nocb_q_count_lazy = {
    counter = 18
  },
  nocb_follower_head = 0x0,
  nocb_follower_tail = 0xffff883ffddda508,
  nocb_wq = {
    lock = {
      raw_lock = {
        val = {
          counter = 0
        }
      }
    },
    task_list = {
      next = 0xffffc900193e3e80,
      prev = 0xffffc900193e3e80
    }
  },
  nocb_kthread = 0xffff883ff0861600,
  nocb_defer_wakeup = 0,
  nocb_gp_head = 0xffff883fe18c71a0,
  nocb_gp_tail = 0xffffc9001c7a3da0,
  nocb_leader_sleep = false,
  nocb_next_follower = 0xffff883ffde1a380,
  nocb_leader = 0xffff883ffdd9a380,
  softirq_snap = 1002642,
  cpu = 33,
  rsp = 0xffffffff81c703c0 <rcu_sched_state>
}

Notice also that the nocb_q_count is 500.  With a bit of looking we can
see that the blocked process has its callback queued in both the
nocb_gp_head list and the nocb_head list.  I'll spare readers all of the
output, but the correct linkage was located by walking the list entries
and then grepping for the correct pointer.

crash> struct completion 0xffffc9001c907dc0
struct completion {
  done = 0,
  wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffffc9001c907d10,
      prev = 0xffffc9001c907d10
    }
  }
}

crash> struct rcu_synchronize 0xffffc9001c907db0
struct rcu_synchronize {
  head = {
    next = 0xffff883fe4d18398,
    func = 0xffffffff810eeb30 <wakeme_after_rcu>
  },
  completion = {
    done = 0,
    wait = {
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      },
      task_list = {
        next = 0xffffc9001c907d10,
        prev = 0xffffc9001c907d10
      }
    }
  }
}

This is the stack for CPU 33's nocb kthread.  It's sleeping.

crash> bt 0xffff883ff0861600
PID: 279    TASK: ffff883ff0861600  CPU: 39  COMMAND: "rcuos/33"
 #0 [ffffc900193e3db0] __schedule at ffffffff8175e3f6
 #1 [ffffc900193e3e30] schedule at ffffffff8175e8d6
 #2 [ffffc900193e3e48] rcu_nocb_kthread at ffffffff810f2a67
 #3 [ffffc900193e3ec8] kthread at ffffffff810a9e19
 #4 [ffffc900193e3f50] ret_from_fork at ffffffff81763215

Since CPU 33 is a follower in this configuration, it needs to be awoken
by a leader.  In this case CPU 32 is the leader assigned to wake CPU
33's nocb_kthread.  It's also sleeping.

crash> bt 0xffff883ff0805800
PID: 271    TASK: ffff883ff0805800  CPU: 38  COMMAND: "rcuos/32"
 #0 [ffffc900193a3db0] __schedule at ffffffff8175e3f6
 #1 [ffffc900193a3e30] schedule at ffffffff8175e8d6
 #2 [ffffc900193a3e48] rcu_nocb_kthread at ffffffff810f2ad5
 #3 [ffffc900193a3ec8] kthread at ffffffff810a9e19
 #4 [ffffc900193a3f50] ret_from_fork at ffffffff81763215

Here's a red flag: The leader is asleep despite nocb_leader_sleep being
false. The code in nocb_leader_wait() has an swait_event_interruptible()
that's supposed to only let this thread wait when nocb_leader_sleep is
true.

crash> struct rcu_data.nocb_leader_sleep 0xffff883ffdd9a380
  nocb_leader_sleep = false

The leader's wq contains active entries, so something is sleeping here:

crash> struct rcu_data.nocb_wq 0xffff883ffdd9a380
  nocb_wq = {
    lock = {
      raw_lock = {
        val = {
          counter = 0
        }
      }
    },
    task_list = {
      next = 0xffffc900193a3e80,
      prev = 0xffffc900193a3e80
    }
  }

crash> struct -o rcu_data.nocb_wq 0xffff883ffdd9a380
struct rcu_data {
  [ffff883ffdd9a518] struct swait_queue_head nocb_wq;
}

crash> struct list_head 0xffffc900193a3e80
struct list_head {
  next = 0xffff883ffdd9a520,
  prev = 0xffff883ffdd9a520
}

This is the waitq of the callback parked by the process that was
stranded on cpu 33.  It's also correctly configured.

crash> struct wait_queue_t 0xffffc9001c907cf8
struct wait_queue_t {
  flags = 1,
  private = 0xffff883fee999600,
  func = 0xffffffff810b6710 <default_wake_function>,
  task_list = {
    next = 0xffffc9001c907dd0,
    prev = 0xffffc9001c907dd0
  }
}
crash> ps -m 34443
[ 0 00:11:07.318] [UN]  PID: 34443  TASK: ffff883fee999600  CPU: 33  COMMAND: "uc-spawn"
crash> ps -m 0xffff883ff0861600
[ 0 00:11:44.271] [IN]  PID: 279    TASK: ffff883ff0861600  CPU: 39  COMMAND: "rcuos/33"
crash> ps -m ffff883ff0805800
[ 0 00:11:44.208] [IN]  PID: 271    TASK: ffff883ff0805800  CPU: 38  COMMAND: "rcuos/32"

Looking at the PS data, we can see that our processes have been stranded
for more than 11 minutes.  The rcuos threads for 37 seconds longer than
pid 34443.

The reason this appears to be a transient hang is due to the code in
__call_rcu_nocb_enqueue().  We seem to be missing the initial wakeup,
and once that happens, we'll only send another wakeup if nocb_q_count is
greater than qlen_last_fqs_check + qhimark.  In our case, we need to
accumulate 10,000 callbacks before we get another bite at this.  Since
these cpus are relatively idle, we've only accrued 500 on the follower
and ~2000 on the leader.  Leader's full state if below:

crash> struct rcu_data 0xffff883ffdd9a380
struct rcu_data {
  completed = 22469935,
  gpnum = 22469935,
  rcu_qs_ctr_snap = 17067139,
  cpu_no_qs = {
    b = {
      norm = 0 '\000',
      exp = 0 '\000'
    },
    s = 0
  },
  core_needs_qs = false,
  beenonline = true,
  gpwrap = false,
  mynode = 0xffffffff81c70600 <rcu_sched_state+576>,
  grpmask = 4294967296,
  ticks_this_gp = 1,
  nxtlist = 0x0,
  nxttail = {0x0, 0x0, 0x0, 0x0},
  nxtcompleted = {18446744073709551526, 18446744073709551526, 18446744073709551526, 0},
  qlen_lazy = 0,
  qlen = 0,
  qlen_last_fqs_check = 0,
  n_cbs_invoked = 0,
  n_nocbs_invoked = 10730040,
  n_cbs_orphaned = 0,
  n_cbs_adopted = 0,
  n_force_qs_snap = 0,
  blimit = 10,
  dynticks = 0xffff883ffdd8ffd0,
  dynticks_snap = 29638122,
  dynticks_fqs = 19616,
  offline_fqs = 0,
  cond_resched_completed = 22467956,
  n_rcu_pending = 2314079,
  n_rp_core_needs_qs = 30976,
  n_rp_report_qs = 654184,
  n_rp_cb_ready = 0,
  n_rp_cpu_needs_gp = 813,
  n_rp_gp_completed = 369917,
  n_rp_gp_started = 37049,
  n_rp_nocb_defer_wakeup = 135,
  n_rp_need_nothing = 1251981,
  barrier_head = {
    next = 0x0,
    func = 0xffffffff810f0090 <rcu_barrier_callback>
  },
  exp_workdone0 = {
    counter = 20577
  },
  exp_workdone1 = {
    counter = 0
  },
  exp_workdone2 = {
    counter = 0
  },
  exp_workdone3 = {
    counter = 0
  },
  nocb_head = 0xffff883ff0f6f700,
  nocb_tail = 0xffff881f9f8ed2d0,
  nocb_q_count = {
    counter = 2029
  },
  nocb_q_count_lazy = {
    counter = 71
  },
  nocb_follower_head = 0x0,
  nocb_follower_tail = 0xffff883ffdd9a508,
  nocb_wq = {
    lock = {
      raw_lock = {
        val = {
          counter = 0
        }
      }
    },
    task_list = {
      next = 0xffffc900193a3e80,
      prev = 0xffffc900193a3e80
    }
  },
  nocb_kthread = 0xffff883ff0805800,
  nocb_defer_wakeup = 0,
  nocb_gp_head = 0x0,
  nocb_gp_tail = 0xffff883feb6c4d00,
  nocb_leader_sleep = false,
  nocb_next_follower = 0xffff883ffddda380,
  nocb_leader = 0xffff883ffdd9a380,
  softirq_snap = 1061978,
  cpu = 32,
  rsp = 0xffffffff81c703c0 <rcu_sched_state>
}

The nocb path can miss a wakeup sent to the master. If the system is
fairly idle it can take a substantial amount of time in order for
qhimark to be reached so that we kick the leader again.  It might be
possible to tune qhimark, but this is intellectually unsatisfying,
especially since that tunable has use outside of the nocb path.

Instead, it looks like in the conversion to simple wait queues the
sematics of swake_up differ slightly from wake_up and the rcu_nocb code
failed to make a necessary adjustment.

Here's the wakeup path:

static void wake_nocb_leader(struct rcu_data *rdp, bool force)
{
        struct rcu_data *rdp_leader = rdp->nocb_leader;

        if (!READ_ONCE(rdp_leader->nocb_kthread))
                return;
        if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
                /* Prior smp_mb__after_atomic() orders against prior enqueue. */
                WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
                swake_up(&rdp_leader->nocb_wq);
        }
}

And here's swake_up iteslf:

void swake_up(struct swait_queue_head *q)
{
        unsigned long flags;

        if (!swait_active(q))
                return;

        raw_spin_lock_irqsave(&q->lock, flags);
        swake_up_locked(q);
        raw_spin_unlock_irqrestore(&q->lock, flags);
}

It's important to note that swait_active() isn't called with any locks
held so we're in a situation where the write to nocb_leader_sleep and,
potentially, a store to q->task_list are in flight on different CPUs at
the same time.  A comment in wake_nocb_leader notes that the
smp_mb__after_atomic() gives us ordering against the prior enqueue, but
that's a red herring.  In order to wake the leader, the waker's store to
nocb_leader_sleep must be visibile on the CPU where the awaking process
can observe it.

The documentation in memory-barriers.txt cautions us in this way:

   [!] Note that the memory barriers implied by the sleeper and the
   waker do _not_ order multiple stores before the wake-up with respect
   to loads of those stored values after the sleeper has called
   set_current_state().  For instance, if the sleeper does:

        set_current_state(TASK_INTERRUPTIBLE);
        if (event_indicated)
                break;
        __set_current_state(TASK_RUNNING);
        do_something(my_data);

   and the waker does:

        my_data = value;
        event_indicated = 1;
        wake_up(&event_wait_queue);

   there's no guarantee that the change to event_indicated will be
   perceived by the sleeper as coming after the change to my_data.  In
   such a circumstance, the code on both sides must interpolate its own
   memory barriers between the separate data accesses.  Thus the above
   sleeper ought to do:

        set_current_state(TASK_INTERRUPTIBLE);
        if (event_indicated) {
                smp_rmb();
                do_something(my_data);
        }

   and the waker should do:

        my_data = value;
        smp_wmb();
        event_indicated = 1;
        wake_up(&event_wait_queue);

  [https://www.kernel.org/doc/Documentation/memory-barriers.txt]

The RT-linux tree had a patch in its swait_head_has_waiters() function
to ensure that wakeups were not missed:

https://www.spinics.net/lists/linux-rt-users/msg10340.html

That barrier was dropped from the patch as part of getting it ready for
mainline:

http://www.spinics.net/lists/linux-rt-users/msg13691.html

The patch in the prior message seemed to have a comment, though it's
possible it was subsequently dropped.  Also note that waitqueue_active
in the current code base contains the following comment:

/**
 * waitqueue_active -- locklessly test for waiters on the queue
 * @q: the waitqueue to test for waiters
 *
 * returns true if the wait list is not empty
 *
 * NOTE: this function is lockless and requires care, incorrect usage _will_
 * lead to sporadic and non-obvious failure.
 *
 * Use either while holding wait_queue_head_t::lock or when used for wakeups
 * with an extra smp_mb() like:
 *
 *      CPU0 - waker                    CPU1 - waiter
 *
 *                                      for (;;) {
 *      @cond = true;                     prepare_to_wait(&wq, &wait, state);
 *      smp_mb();                         // smp_mb() from set_current_state()
 *      if (waitqueue_active(wq))         if (@cond)
 *        wake_up(wq);                      break;
 *                                        schedule();
 *                                      }
 *                                      finish_wait(&wq, &wait);
 *
 * Because without the explicit smp_mb() it's possible for the
 * waitqueue_active() load to get hoisted over the @cond store such that we'll
 * observe an empty wait list while the waiter might not observe @cond.
 *
 * Also note that this 'optimization' trades a spin_lock() for an smp_mb(),
 * which (when the lock is uncontended) are of roughly equal cost.
 */
[include/linux/wait.h]

That's exactly what we're seeing go wrong here.  The waiter didn't see
the condtion become true as it was going to sleep, and the waker didn't
see the waiter as being asleep.

Based upon all of this, it's pretty clear that we need a memory barrier
added to wake_nocb_leader().  The good news is that most other users of
swake_up(), are already using other atomic operations that cause updates
to their values to become visibile.

The fix should be something as simple as this:

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 56583e7..1322d0b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1765,8 +1765,8 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
        if (!READ_ONCE(rdp_leader->nocb_kthread))
                return;
        if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
-               /* Prior smp_mb__after_atomic() orders against prior enqueue. */
-               WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
+               /* mb ensures nocb_leader_sleep is observed prior to wakeup */
+               smp_store_mb(rdp_leader->nocb_leader_sleep, false);
                swake_up(&rdp_leader->nocb_wq);
        }
 }

--- >8 cut here 8< ---

Thanks,

-K