lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140526110743.16203.18186.stgit@srivatsabhat.in.ibm.com>
Date:	Mon, 26 May 2014 16:38:06 +0530
From:	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>
To:	peterz@...radead.org, tglx@...utronix.de, mingo@...nel.org,
	tj@...nel.org, rusty@...tcorp.com.au, akpm@...ux-foundation.org,
	fweisbec@...il.com, hch@...radead.org
Cc:	mgorman@...e.de, riel@...hat.com, bp@...e.de, rostedt@...dmis.org,
	mgalbraith@...e.de, ego@...ux.vnet.ibm.com,
	paulmck@...ux.vnet.ibm.com, oleg@...hat.com, rjw@...ysocki.net,
	linux-kernel@...r.kernel.org, srivatsa.bhat@...ux.vnet.ibm.com
Subject: [PATCH v7 0/2] CPU hotplug: Fix the long-standing "IPI to offline
 CPU" issue


Hi,

There is a long-standing problem related to CPU hotplug which causes IPIs to
be delivered to offline CPUs, and the smp-call-function IPI handler code
prints out a warning whenever this is detected. Every once in a while this
(usually harmless) warning gets reported on LKML, but so far it has not been
completely fixed. Usually the solution involves finding out the IPI sender
and fixing it by adding appropriate synchronization with CPU hotplug.

However, while going through one such internal bug reports, I found that
there is a significant bug in the receiver side itself (more specifically,
in stop-machine) that can lead to this problem even when the sender code
is perfectly fine. This patchset handles that scenario to ensure that a
CPU doesn't go offline with callbacks still pending.

Patch 1 adds some additional debug code to the smp-call-function framework,
to help debug such issues easily.

Patch 2 adds a mechanism to flush any pending smp-call-function callbacks
queued on the CPU going offline (including those callbacks for which the
IPIs from the source CPUs might not have arrived in time at the outgoing CPU).
This ensures that a CPU never goes offline with work still pending. Also,
the warning condition in smp-call-function IPI handler code is modified to
trigger only if an IPI is received on an offline CPU *and* it still has
pending callbacks to execute, since that's the only remaining buggy scenario
after applying this patch.


In fact, I debugged the problem by using Patch 1, and found that the
payload of the IPI was always the block layer's trigger_softirq() function.
But I was not able to find anything wrong with the block layer code. That's
when I started looking at the stop-machine code and realized that there is
a race-window which makes the IPI _receiver_ the culprit, not the sender.
Patch 2 handles this scenario and hence this should put an end to most of
the hard-to-debug IPI-to-offline-CPU issues.



Changes in v7:
* Modified the warning condition in smp-call-function IPI handler code, such
  that it triggers only if an offline CPU got an IPI *and* it still had pending
  callbacks to execute.
* Completely dropped the patch that modified the stop-machine code to
  introduce additional states to order the disabling of interrupts on various
  CPUs. This strict ordering is not necessary any more after the first change.
  Thanks to Frederic Weisbecker for suggesting this enhancement.

Changes in v6:
Modified Patch 3 to flush the pending callbacks from CPU_DYING notifier
instead of stop-machine directly, so that only the CPU hotplug path will
run this code, instead of everybody who uses stop-machine. Suggested by
Peter Zijlstra.

Changes in v5:
Added Patch 3 to flush out any pending smp-call-function callbacks on the
outgoing CPU, as suggested by Frederic Weisbecker.

Changes in v4:
Rewrote a comment in Patch 2 and reorganized the code for better readability.

Changes in v3:
Rewrote patch 2 and split the MULTI_STOP_DISABLE_IRQ state into two:
MULTI_STOP_DISABLE_IRQ_INACTIVE and MULTI_STOP_DISABLE_IRQ_ACTIVE, and
used this framework to ensure that the CPU going offline always disables
its interrupts last. Suggested by Tejun Heo.

v1 and v2:
https://lkml.org/lkml/2014/5/6/474


 Srivatsa S. Bhat (2):
      smp: Print more useful debug info upon receiving IPI on an offline CPU
      CPU hotplug, smp: Flush any pending IPI callbacks before CPU offline


 kernel/smp.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 8 deletions(-)


Regards,
Srivatsa S. Bhat
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ