lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140511203617.17152.21133.stgit@srivatsabhat.in.ibm.com>
Date:	Mon, 12 May 2014 02:06:39 +0530
From:	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>
To:	peterz@...radead.org, tglx@...utronix.de, mingo@...nel.org,
	tj@...nel.org, rusty@...tcorp.com.au, akpm@...ux-foundation.org,
	fweisbec@...il.com, hch@...radead.org
Cc:	mgorman@...e.de, riel@...hat.com, bp@...e.de, rostedt@...dmis.org,
	mgalbraith@...e.de, ego@...ux.vnet.ibm.com,
	paulmck@...ux.vnet.ibm.com, oleg@...hat.com, rjw@...ysocki.net,
	linux-kernel@...r.kernel.org, srivatsa.bhat@...ux.vnet.ibm.com
Subject: [PATCH v3 0/2] CPU hotplug: Fix the long-standing "IPI to offline
 CPU" issue


Hi,

There is a long-standing problem related to CPU hotplug which causes IPIs to
be delivered to offline CPUs, and the smp-call-function IPI handler code
prints out a warning whenever this is detected. Every once in a while this
(usually harmless) warning gets reported on LKML, but so far it has not been
completely fixed. Usually the solution involves finding out the IPI sender
and fixing it by adding appropriate synchronization with CPU hotplug.

However, while going through one such internal bug reports, I found that
there is a significant bug in the receiver side itself (more specifically,
in stop-machine) that can lead to this problem even when the sender code
is perfectly fine. This patchset fixes that synchronization problem in the
CPU hotplug stop-machine code.

Patch 1 adds some additional debug code to the smp-call-function framework,
to help debug such issues easily.

Patch 2 modifies the stop-machine code to ensure that any IPIs that were sent
while the target CPU was online, would be noticed and handled by that CPU
without fail before it goes offline. Thus, this avoids scenarios where IPIs
are received on offline CPUs (as long as the sender uses proper hotplug
synchronization).


In fact, I debugged the problem by using Patch 1, and found that the
payload of the IPI was always the block layer's trigger_softirq() function.
But I was not able to find anything wrong with the block layer code. That's
when I started looking at the stop-machine code and realized that there is
a race-window which makes the IPI _receiver_ the culprit, not the sender.
Patch 2 fixes that race and hence this should put an end to most of the
hard-to-debug IPI-to-offline-CPU issues.


Changes in v3:

Rewrote patch 2 and split the MULTI_STOP_DISABLE_IRQ state into two:
MULTI_STOP_DISABLE_IRQ_INACTIVE and MULTI_STOP_DISABLE_IRQ_ACTIVE, and
used this framework to ensure that the CPU going offline always disables
its interrupts last. Suggested by Tejun Heo.

v1 and v2:
https://lkml.org/lkml/2014/5/6/474


 Srivatsa S. Bhat (2):
      smp: Print more useful debug info upon receiving IPI on an offline CPU
      CPU hotplug, stop-machine: Plug race-window that leads to "IPI-to-offline-CPU"


 kernel/smp.c          |   18 ++++++++++++++----
 kernel/stop_machine.c |   25 ++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 7 deletions(-)


Thanks,
Srivatsa S. Bhat
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ