linux-kernel - Re: [PATCH] rcu/nocb: Fix misordered rcu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20211018183604.GT880162@paulmck-ThinkPad-P17-Gen-1>
Date:   Mon, 18 Oct 2021 11:36:04 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Frederic Weisbecker <frederic@...nel.org>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Uladzislau Rezki <urezki@...il.com>,
        Boqun Feng <boqun.feng@...il.com>,
        Neeraj Upadhyay <neeraju@...eaurora.org>,
        Josh Triplett <josh@...htriplett.org>,
        Joel Fernandes <joel@...lfernandes.org>, rcu@...r.kernel.org
Subject: Re: [PATCH] rcu/nocb: Fix misordered rcu_barrier() while
 (de-)offloading

On Mon, Oct 18, 2021 at 07:42:42PM +0200, Frederic Weisbecker wrote:
> On Mon, Oct 18, 2021 at 09:18:14AM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 18, 2021 at 01:32:59PM +0200, Frederic Weisbecker wrote:
> > > When an rdp is in the process of (de-)offloading, rcu_core() and the
> > > nocb kthreads can process callbacks at the same time. This leaves many
> > > possible scenarios leading to an rcu barrier to execute before
> > > the preceding callbacks. Here is one such example:
> > > 
> > >             CPU 0                                  CPU 1
> > >        --------------                         ---------------
> > >      call_rcu(callbacks1)
> > >      call_rcu(callbacks2)
> > >      // move callbacks1 and callbacks2 on the done list
> > >      rcu_advance_callbacks()
> > >      call_rcu(callbacks3)
> > >      rcu_barrier_func()
> > >          rcu_segcblist_entrain(...)
> > >                                             nocb_cb_wait()
> > >                                                 rcu_do_batch()
> > >                                                     callbacks1()
> > >                                                     cond_resched_tasks_rcu_qs()
> > >      // move callbacks3 and rcu_barrier_callback()
> > >      // on the done list
> > >      rcu_advance_callbacks()
> > >      rcu_core()
> > >          rcu_do_batch()
> > >              callbacks3()
> > >              rcu_barrier_callback()
> > >                                                     //MISORDERING
> > >                                                     callbacks2()
> > > 
> > > Fix this with preventing two concurrent rcu_do_batch() on a  same rdp
> > > as long as an rcu barrier callback is pending somewhere.
> > > 
> > > Reported-by: Paul E. McKenney <paulmck@...nel.org>
> > > Signed-off-by: Frederic Weisbecker <frederic@...nel.org>
> > > Cc: Josh Triplett <josh@...htriplett.org>
> > > Cc: Joel Fernandes <joel@...lfernandes.org>
> > > Cc: Boqun Feng <boqun.feng@...il.com>
> > > Cc: Neeraj Upadhyay <neeraju@...eaurora.org>
> > > Cc: Uladzislau Rezki <urezki@...il.com>
> > 
> > Yow!
> > 
> > But how does the (de-)offloading procedure's acquisition of
> > rcu_state.barrier_mutex play into this?  In theory, that mutex was
> > supposed to prevent these sorts of scenarios.  In practice, it sounds
> > like the shortcomings in this theory should be fully explained so that
> > we don't get similar bugs in the future.  ;-)
> 
> I think you're right. The real issue is something I wanted to
> fix next: RCU_SEGCBLIST_RCU_CORE isn't cleared when nocb is enabled on
> boot so rcu_core() always run concurrently with nocb kthreads in TREE04,
> without holding rcu_barrier mutex of course (I mean with the latest patchset).

That would do it!

> Ok forget this patch, I'm testing again with simply clearing
> RCU_SEGCBLIST_RCU_CORE on boot.

Sounds good, looking forward to it!

							Thanx, Paul