[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211018174242.GA450204@lothringen>
Date: Mon, 18 Oct 2021 19:42:42 +0200
From: Frederic Weisbecker <frederic@...nel.org>
To: "Paul E. McKenney" <paulmck@...nel.org>
Cc: LKML <linux-kernel@...r.kernel.org>,
Uladzislau Rezki <urezki@...il.com>,
Boqun Feng <boqun.feng@...il.com>,
Neeraj Upadhyay <neeraju@...eaurora.org>,
Josh Triplett <josh@...htriplett.org>,
Joel Fernandes <joel@...lfernandes.org>, rcu@...r.kernel.org
Subject: Re: [PATCH] rcu/nocb: Fix misordered rcu_barrier() while
(de-)offloading
On Mon, Oct 18, 2021 at 09:18:14AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 18, 2021 at 01:32:59PM +0200, Frederic Weisbecker wrote:
> > When an rdp is in the process of (de-)offloading, rcu_core() and the
> > nocb kthreads can process callbacks at the same time. This leaves many
> > possible scenarios leading to an rcu barrier to execute before
> > the preceding callbacks. Here is one such example:
> >
> > CPU 0 CPU 1
> > -------------- ---------------
> > call_rcu(callbacks1)
> > call_rcu(callbacks2)
> > // move callbacks1 and callbacks2 on the done list
> > rcu_advance_callbacks()
> > call_rcu(callbacks3)
> > rcu_barrier_func()
> > rcu_segcblist_entrain(...)
> > nocb_cb_wait()
> > rcu_do_batch()
> > callbacks1()
> > cond_resched_tasks_rcu_qs()
> > // move callbacks3 and rcu_barrier_callback()
> > // on the done list
> > rcu_advance_callbacks()
> > rcu_core()
> > rcu_do_batch()
> > callbacks3()
> > rcu_barrier_callback()
> > //MISORDERING
> > callbacks2()
> >
> > Fix this with preventing two concurrent rcu_do_batch() on a same rdp
> > as long as an rcu barrier callback is pending somewhere.
> >
> > Reported-by: Paul E. McKenney <paulmck@...nel.org>
> > Signed-off-by: Frederic Weisbecker <frederic@...nel.org>
> > Cc: Josh Triplett <josh@...htriplett.org>
> > Cc: Joel Fernandes <joel@...lfernandes.org>
> > Cc: Boqun Feng <boqun.feng@...il.com>
> > Cc: Neeraj Upadhyay <neeraju@...eaurora.org>
> > Cc: Uladzislau Rezki <urezki@...il.com>
>
> Yow!
>
> But how does the (de-)offloading procedure's acquisition of
> rcu_state.barrier_mutex play into this? In theory, that mutex was
> supposed to prevent these sorts of scenarios. In practice, it sounds
> like the shortcomings in this theory should be fully explained so that
> we don't get similar bugs in the future. ;-)
I think you're right. The real issue is something I wanted to
fix next: RCU_SEGCBLIST_RCU_CORE isn't cleared when nocb is enabled on
boot so rcu_core() always run concurrently with nocb kthreads in TREE04,
without holding rcu_barrier mutex of course (I mean with the latest patchset).
Ok forget this patch, I'm testing again with simply clearing
RCU_SEGCBLIST_RCU_CORE on boot.
Thanks.
Powered by blists - more mailing lists