linux-kernel - Re: rcu: WARNING in rcu_seq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACT4Y+bW13N50TuzCr68VbYRQxmzW0xhi5p0cezit1GQHOWAfQ@mail.gmail.com>
Date:   Tue, 7 Mar 2017 08:05:19 +0100
From:   Dmitry Vyukov <dvyukov@...gle.com>
To:     Paul McKenney <paulmck@...ux.vnet.ibm.com>
Cc:     josh@...htriplett.org, Steven Rostedt <rostedt@...dmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        jiangshanlai@...il.com, LKML <linux-kernel@...r.kernel.org>,
        syzkaller <syzkaller@...glegroups.com>
Subject: Re: rcu: WARNING in rcu_seq_end

On Tue, Mar 7, 2017 at 12:08 AM, Paul E. McKenney
<paulmck@...ux.vnet.ibm.com> wrote:
>> >> >> >> Hello,
>> >> >> >>
>> >> >> >> Paul, you wanted bugs in rcu.
>> >> >> >
>> >> >> > Well, whether I want them or not, I must deal with them.  ;-)
>> >> >> >
>> >> >> >> I've got this WARNING while running syzkaller fuzzer on
>> >> >> >> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
>> >> >> >>
>> >> >> >> ------------[ cut here ]------------
>> >> >> >> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
>> >> >> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>> >> >> >> Kernel panic - not syncing: panic_on_warn set ...
>> >> >> >> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
>> >> >> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> >> >> >> Workqueue: events wait_rcu_exp_gp
>> >> >> >> Call Trace:
>> >> >> >>  __dump_stack lib/dump_stack.c:15 [inline]
>> >> >> >>  dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
>> >> >> >>  panic+0x1fb/0x412 kernel/panic.c:179
>> >> >> >>  __warn+0x1c4/0x1e0 kernel/panic.c:540
>> >> >> >>  warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
>> >> >> >>  rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>> >> >> >>  rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
>> >> >> >>  rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
>> >> >> >>  rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
>> >> >> >>  wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
>> >> >> >>  process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
>> >> >> >>  worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
>> >> >> >>  kthread+0x326/0x3f0 kernel/kthread.c:227
>> >> >> >>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
>> >> >> >> Dumping ftrace buffer:
>> >> >> >>    (ftrace buffer empty)
>> >> >> >> Kernel Offset: disabled
>> >> >> >> Rebooting in 86400 seconds..
>> >> >> >>
>> >> >> >>
>> >> >> >> Not reproducible. But looking at the code, shouldn't it be:
>> >> >> >>
>> >> >> >>  static void rcu_seq_end(unsigned long *sp)
>> >> >> >>  {
>> >> >> >>         smp_mb(); /* Ensure update-side operation before counter increment. */
>> >> >> >> +       WARN_ON_ONCE(!(*sp & 0x1));
>> >> >> >>         WRITE_ONCE(*sp, *sp + 1);
>> >> >> >> -       WARN_ON_ONCE(*sp & 0x1);
>> >> >> >>  }
>> >> >> >>
>> >> >> >> ?
>> >> >> >>
>> >> >> >> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
>> >> >> >> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
>> >> >> >> consequently can allow start of next grace periods. Which in turn can
>> >> >> >> make the warning fire. Am I missing something?
>> >> >> >>
>> >> >> >> I don't see any other bad consequences of this. The rest of
>> >> >> >> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
>> >> >> >> returned and destroyed work on stack and next period has started and
>> >> >> >> ended, but it seems OK.
>> >> >> >
>> >> >> > I believe that this is a heygood change, but I don't see how it will
>> >> >> > help in this case.  BTW, may I have your Signed-off-by?
>> >> >> >
>> >> >> > The reason I don't believe that it will help is that the
>> >> >> > rcu_exp_gp_seq_end() function is called from a workqueue handler that
>> >> >> > is invoked holding ->exp_mutex, and this mutex is not released until
>> >> >> > after the handler invokes rcu_seq_end() and then wakes up the task that
>> >> >> > scheduled the workqueue handler.  So the ordering above should not matter
>> >> >> > (but I agree that your ordering is cleaner.
>> >> >> >
>> >> >> > That said, it looks like I am missing some memory barriers, please
>> >> >> > see the following patch.
>> >> >> >
>> >> >> > But what architecture did you see this on?
>> >> >>
>> >> >>
>> >> >> This is just x86.
>> >> >>
>> >> >> You seem to assume that wait_event() waits for the wakeup. It does not
>> >> >> work this way. It can return as soon as the condition becomes true
>> >> >> without ever waiting:
>> >> >>
>> >> >> 305 #define wait_event(wq, condition)                                       \
>> >> >> 306 do {                                                                    \
>> >> >> 307         might_sleep();                                                  \
>> >> >> 308         if (condition)                                                  \
>> >> >> 309                 break;                                                  \
>> >> >> 310         __wait_event(wq, condition);                                    \
>> >> >> 311 } while (0)
>> >> >
>> >> > Agreed, hence my patch in the previous email.  I guess I knew that, but
>> >>
>> >> Ah, you meant to synchronize rcu_seq_end with rcu_seq_done?
>> >
>> > No, there is a mutex release and acquisition that do the synchronization,
>> > but only if the wakeup has appropriate barriers.  The issue is that
>> > part of the mutex's critical section executes in a workqueue, possibly
>> > on some other CPU.
>>
>> What is that mutex? And what locks/unlocks provide synchronization? I
>> see that one uses exp_mutex and another -- exp_wake_mutex.
>
> Both of them.
>
> ->exp_mutex is acquired by the task requesting the grace period, and
> the counter's first increment is done by that task under that mutex.
> This task then schedules a workqueue, which drives forward the grace
> period.  Upon grace-period completion, the workqueue handler does the
> second increment (the one that your patch addressed).  The workqueue
> handler then acquires ->exp_wake_mutex and wakes the task that holds
> ->exp_mutex (along with all other tasks waiting for this grace period),
> and that task releases ->exp_mutex, which allows the next grace period to
> start (and the first increment for that next grace period to be carried
> out under that lock).  The workqueue handler releases ->exp_wake_mutex
> after finishing its wakeups.


Then we need the following for the case when task requesting the grace
period does not block, right?

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d80c2587bed8..aa7ba83f6a56 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3534,7 +3534,7 @@ static void rcu_seq_start(unsigned long *sp)
 static void rcu_seq_end(unsigned long *sp)
 {
        smp_mb(); /* Ensure update-side operation before counter increment. */
-       WRITE_ONCE(*sp, *sp + 1);
+       smp_store_release(sp, *sp + 1);
        WARN_ON_ONCE(*sp & 0x1);
 }

@@ -3554,7 +3554,7 @@ static unsigned long rcu_seq_snap(unsigned long *sp)
  */
 static bool rcu_seq_done(unsigned long *sp, unsigned long s)
 {
-       return ULONG_CMP_GE(READ_ONCE(*sp), s);
+       return ULONG_CMP_GE(smp_load_acquire(sp), s);

 }