[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20120627053307.GA14913@gmail.com>
Date: Wed, 27 Jun 2012 07:33:07 +0200
From: Ingo Molnar <mingo@...nel.org>
To: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc: mingo@...e.hu, linux-kernel@...r.kernel.org, josh@...htriplett.org,
tglx@...utronix.de, sbw@....edu
Subject: Re: [GIT PULL rcu/urgent] Fix for RCU-related hang
* Paul E. McKenney <paulmck@...ux.vnet.ibm.com> wrote:
> Hello, Ingo,
>
> This series has a single patch that fixes a system hang that can occur
> in perhaps unusual but very real circumstances. This hang occurs
> because of a very stupid bug of mine introduced in commit b1420f1c
> (Make rcu_barrier() less disruptive) that can cause CPUs to miscount
> RCU callbacks. The sequence of events leading to the hang is as follows:
>
> 1. A CPU miscounts its callbacks.
> 2. That CPU invokes all of its callbacks, so that its callback
> list is empty, but the callback count is nonzero.
> 3. That CPU goes offline. Because its callback list is empty,
> RCU's CPU-hotplug CPU_DEAD notifiers leave both the list and
> the count alone. (In contrast, had the list been non-empty,
> RCU's CPU_DEAD notifiers would have emptied the list and
> zeroed the count.)
> 4. One of the remaining CPUs executes one of the rcu_barrier()
> family of primitives. The rcu_barrier() primitive notes
> that the offline CPU has a non-zero count of callbacks, and
> therefore hangs waiting for this count to reach zero. The
> theory behind the indefinite wait is that the only reason that
> an offline CPU can have a non-zero number of RCU callbacks is
> that the CPU's CPU_DEAD notifiers have not yet executed.
> But they already have executed, so the offlined CPU's callback
> count will remain non-zero until it is brought back online,
> in other words, perhaps never.
>
> However, this bug is likely to pass a combined rcutorture/CPU-hotplug
> stress test because offlined CPUs tend to be brought back online
> reasonably quickly. For the rcutorture tests to fail, the system must be
> in the state indicated by step #3 above at the time the "rmmod rcutorture"
> executes.
>
> The fix is simply to prevent the miscounting.
>
> This change is available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/urgent
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> Paul E. McKenney (1):
> rcu: Stop rcu_do_batch() from multiplexing the "count" variable
>
> kernel/rcutree.c | 14 +++++++-------
> 1 files changed, 7 insertions(+), 7 deletions(-)
Pulled, thanks Paul!
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists