linux-kernel - Re: smp_call_function

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150220093000.GA22661@gmail.com>
Date:	Fri, 20 Feb 2015 10:30:00 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Rafael David Tinoco <inaddy@...ntu.com>,
	Peter Anvin <hpa@...or.com>,
	Jiang Liu <jiang.liu@...ux.intel.com>,
	Peter Zijlstra <peterz@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Jens Axboe <axboe@...nel.dk>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Gema Gomez <gema.gomez-solano@...onical.com>,
	Christopher Arges <chris.j.arges@...onical.com>,
	the arch/x86 maintainers <x86@...nel.org>
Subject: Re: smp_call_function_single lockups


* Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
> > On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds
> > <torvalds@...ux-foundation.org> wrote:
> >>
> >> Are there known errata for the x2apic?
> >
> > .. and in particular, do we still have to worry about 
> > the traditional local apic "if there are more than two 
> > pending interrupts per priority level, things get lost" 
> > problem?
> >
> > I forget the exact details. Hopefully somebody 
> > remembers.
> 
> I can't find it in the docs. I find the "two-entries per 
> vector", but not anything that is per priority level 
> (group of 16 vectors). Maybe that was the IO-APIC, in 
> which case it's immaterial for IPI's.

So if my memory serves me right, I think it was for local 
APICs, and even there mostly it was a performance issue: if 
an IO-APIC sent more than 2 IRQs per 'level' to a local 
APIC then the IO-APIC might be forced to resend those IRQs, 
leading to excessive message traffic on the relevant 
hardware bus.

( I think the 'resend' was automatic in this case, i.e. a 
  hardware fallback for a CPU side resource shortage, and 
  it could not result in actually lost IRQs. I never saw 
  this documented properly, so people inside Intel or AMD 
  would be in a better position to comment on this ... I 
  might be mis-remembering this or confusing different 
  bugs. )

> However, having now mostly re-acquainted myself with the 
> APIC details, it strikes me that we do have some oddities 
> here.
> 
> In particular, a few interrupt types are very special: 
> NMI, SMI, INIT, ExtINT, or SIPI are handled early in the 
> interrupt acceptance logic, and are sent directly to the 
> CPU core, without going through the usual intermediate 
> IRR/ISR dance.
> 
> And why might this matter? It's important because it 
> means that those kinds of interrupts must *not* do the 
> apic EOI that ack_APIC_irq() does.
> 
> And we correctly don't do ack_APIC_irq() for NMI etc, but 
> it strikes me that ExtINT is odd and special.
> 
> I think we still use ExtINT for some odd cases. We used 
> to have some magic with the legacy timer interrupt, for 
> example. And I think they all go through the normal 
> "do_IRQ()" logic regardless of whether they are ExtINT or 
> not.
> 
> Now, what happens if we send an EOI for an ExtINT 
> interrupt? It basically ends up being a spurious IPI. And 
> I *think* that what normally happens is absolutely 
> nothing at all. But if in addition to the ExtINT, there 
> was a pending IPI (or other pending ISR bit set), maybe 
> we lose interrupts..

1)

I think you got it right.

So the principle of EOI acknowledgement from the OS to the 
local APIC is specific to the IRQ that raised the interrupt 
and caused the vector to be executed, so it's not possible 
to ack the 'wrong' IRQ.

But technically the EOI is state-less, i.e. (as you know) 
we write a constant value to a local APIC register without 
indicating which vector or external IRQ we meant. The OS 
wants to ack 'the IRQ that we are executing currently', but 
this leaves the situation a bit confused in cases where for 
example an IRQ handler enables IRQs, another IRQ comes in 
and stays unacked.

So I _think_ it's not possible to accidentally acknowledge 
a pending IRQ that has not been issued to the CPU yet 
(unless we have hardirqs enabled), just by writing stray 
EOIs to the local APIC. So in that sense the ExtInt irq0 
case should be mostly harmless.

But I could be wrong :-/

2)

So my suggestion for this bug would be:

The 'does a stray EOI matter' question could also be tested 
by deliberately writing two EOIs instead of just one - does 
this trigger the bug faster?

Then perhaps try to make sure that no hardirqs get ever 
enabled in an irq handler, and figure out whether any of 
the IRQs in question are edge triggered - but AFAICS it 
could be 'any' IRQ handler or flow causing the problem, 
right?

3)

I also fully share your frustration about the level of 
obfuscation the various APIC drivers display today.

The lack of a simple single-IPI implementation is annoying 
as well - when that injury was first inflicted with 
clustered APICs I tried to resist, but AFAICR there were 
some good hardware arguments why it cannot be kept and I 
gave up.

If you agree then I can declare a feature stop for new 
hardware support (that isn't a stop-ship issue for users) 
until it's all cleaned up for real, and Thomas started some 
of that work already.

> .. and it's entirely possible that I'm just completely 
> full of shit. Who is the poor bastard who has worked most 
> with things like ExtINT, and can educate me? I'm adding 
> Ingo, hpa and Jiang Liu as primary contacts..

So the buck stops at my desk, but any help is welcome!

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/