linux-kernel - Re: smp_call_function

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFyspvwLbkqktHHib7LB7pWW9a1CS-rc4oLJoz_Z9kQSRw@mail.gmail.com>
Date:	Fri, 20 Feb 2015 08:49:36 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	Rafael David Tinoco <inaddy@...ntu.com>,
	Peter Anvin <hpa@...or.com>,
	Jiang Liu <jiang.liu@...ux.intel.com>,
	Peter Zijlstra <peterz@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Jens Axboe <axboe@...nel.dk>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Gema Gomez <gema.gomez-solano@...onical.com>,
	Christopher Arges <chris.j.arges@...onical.com>,
	"the arch/x86 maintainers" <x86@...nel.org>
Subject: Re: smp_call_function_single lockups

On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar <mingo@...nel.org> wrote:
>
> So if my memory serves me right, I think it was for local
> APICs, and even there mostly it was a performance issue: if
> an IO-APIC sent more than 2 IRQs per 'level' to a local
> APIC then the IO-APIC might be forced to resend those IRQs,
> leading to excessive message traffic on the relevant
> hardware bus.

Hmm. I have a distinct memory of interrupts actually being lost, but I
really can't find anything to support that memory, so it's probably
some drug-induced confusion of mine. I don't find *anything* about
interrupt "levels" any more in modern Intel documentation on the APIC,
but maybe I missed something. But it might all have been an IO-APIC
thing.

> I think you got it right.
>
> So the principle of EOI acknowledgement from the OS to the
> local APIC is specific to the IRQ that raised the interrupt
> and caused the vector to be executed, so it's not possible
> to ack the 'wrong' IRQ.

I agree - but only for cases where we actualyl should ACK in the first
place. The LAPIC knows what its own priorities were, so it always just
clears the highest-priority ISR bit.

> So I _think_ it's not possible to accidentally acknowledge
> a pending IRQ that has not been issued to the CPU yet
> (unless we have hardirqs enabled), just by writing stray
> EOIs to the local APIC. So in that sense the ExtInt irq0
> case should be mostly harmless.

It must clearly be mostly harmless even if I'm right, since we've
always done the ACK for it, as far as I can tell. So you're probably
right, and it doesn't matter.

In particular, it would depend on exactly when the ISR bit actually
gets set in the APIC. If it gets set only after the CPU has actually
*accepted* the interrupt, then it should not be possible for a
spurious EOI write to screw things up, because it's still happening in
the context of the previously executing interrupt, so a new ISR bit
hasn't been set yet, and any old ISR bits must be lower-priority than
the currently executing interrupt routine (that were interrupted by a
higher-priority one).

That sounds like the likely scenario. It just worries me. And my test
patch did actually trigger, even if it only seemed to trigger during
device probing (when the probing itself probably caused and then
cleared an interrupt).

> I also fully share your frustration about the level of
> obfuscation the various APIC drivers display today.
>
> The lack of a simple single-IPI implementation is annoying
> as well - when that injury was first inflicted with
> clustered APICs I tried to resist, but AFAICR there were
> some good hardware arguments why it cannot be kept and I
> gave up.
>
> If you agree then I can declare a feature stop for new
> hardware support (that isn't a stop-ship issue for users)
> until it's all cleaned up for real, and Thomas started some
> of that work already.

Well, the attached patch for that seems pretty trivial. And seems to
work for me (my machine also defaults to x2apic clustered mode), and
allows the APIC code to start doing a "send to specific cpu" thing one
by one, since it falls back to the send_IPI_mask() function if no
individual CPU IPI function exists.

NOTE! There's a few cases in arch/x86/kernel/apic/vector.c that also
do that "apic->send_IPI_mask(cpumask_of(i), .." thing, but they aren't
that important, so I didn't bother with them.

NOTE2! I've tested this, and it seems to work, but maybe there is
something seriously wrong. I skipped the "disable interrupts" part
when doing the "send_IPI", for example, because I think it's entirely
unnecessary for that case. But this has certainly *not* gotten any
real stress-testing.

But /proc/interrupts shows thousands of rescheduling IPI's, so this
should all have triggered.

                                Linus

View attachment "patch.diff" of type "text/plain" (2793 bytes)