linux-kernel - Conclusions from my investigation about ioapic programming

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <m1irdtw3jb.fsf_-_@ebiederm.dsl.xmission.com>
Date:	Fri, 23 Feb 2007 03:51:04 -0700
From:	ebiederm@...ssion.com (Eric W. Biederman)
To:	linux-kernel@...r.kernel.org
Cc:	Zwane Mwaikambo <zwane@...radead.org>,
	Ashok Raj <ashok.raj@...el.com>, Ingo Molnar <mingo@...e.hu>,
	Andrew Morton <akpm@...l.org>,
	"Lu, Yinghai" <yinghai.lu@....com>,
	Natalie Protasevich <protasnb@...il.com>,
	Andi Kleen <ak@...e.de>,
	"Siddha, Suresh B" <suresh.b.siddha@...el.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Conclusions from my investigation about ioapic programming


Ok. This is just an email to summarize my findings after investigating
the ioapic programming.

The ioapics on the E75xx chipset do have issues if you attempt to
reprogramming them outside of the irq handler.  I have on several
instances caused the state machine to get stuck such that an
individual ioapic entry was no longer capable of delivering
interrupts.  I suspect the remote IRR bit was set stuck on such that
switch the irq to edge triggered and back to level triggered would not
clear it but I did not confirm this.  I just know that I was switching
the irq to between level and edge triggered with the irq masked
and the irq did not fire.


The ioapics on the AMD 8xxx chipset do have issues if you attempt
to reprogram them outside of the irq handler.  I would up with 
remote IRR set and never clearing.  But by temporarily switching
the irq to edge triggered while it was masked I could clear
this condition.

I could not hit verifiable bugs in the ioapics on the Nforce4
chipset.  It's amazing one part of that chipset that I can't find
issues with.



I did find an algorithm that will work successfully for migrating
IRQs in process context if you have an ioapic that will follow pci
ordering rules.  In particulars the properties that the algorithm
depend on are reads guaranteeing that outstanding writes are flushed,
and in this context irqs in flight are considered writes.  I have
assumed that to devices outside of the cpu asic the cpu and the local
apic appear as the same device.

The algorithm was:
- Be running with interrupts enabled in process context.
- Mask the ioapic.
- Read the ioapic to flush outstanding reads to the local apic.
- Read the local apic to flush outstanding irqs to be send the cpu.

- Now that all of the irqs have been delivered and the irq is masked
  that irq is finally quiescent.

- With the irq quiescent it is safe to reprogram interrupt controller
  and the irq reception data structures.

There were a lot more details but that was the essence.

What I discovered was that except on the nforce chipset masking the
ioapic and then issue a read did not behave as if the interrupts were
flushed to the local apic. 

I did not look close enough to tell if local apics suffered from this
issue.  With local apics at least a read was necessary before you
could guarantee the local apic would deliver pending irqs.  A work
around on the local apics is to simply issue a low priority interrupt
as an IPI and wait for it to be processed.  This guarantees that all
higher priority interrupts have been flushed from the apic, and that
the local apic has processed interrupts.

For ioapics because they cannot be stimulated to send any irq by
stimulation from the cpu side not similar work around was possible.



** Conclusions.

*IRQs must be reprogramed in interrupt context.

The result of this is investigation is that I am convinced we need
to perform the irq migration activities in interrupt context although
I am not convinced it is completely safe.  I suspect multiple irqs
firing closely enough to each other may hit the same issues as
migrating irqs from process context.  However the odds are on our
side, when we are in irq context.

The reasoning for this is simply that.
- Before we reprogram a level triggered irq it's remote irr bit
  must be cleared by the irq being acknowledged before the can be
  safely reprogrammed.

- There is no generally effective way short of receiving an additional
  irq to ensure that the irq handler has run.  Polling the ioapics
  remote irr bit does not work.


* The CPU hotplug is currently very buggy.

Irq migration in the cpu hotplug case is a serious problem.  If we can
only safely migrate irqs from interrupt context and we cannot control
when those interrupts fire, then we cannot bound the amount of time it
will take to migrate the irqs away from a cpu.   The current cpu
hotplug code currently calls chip->set_affinity directly which is
wrong, as it does not take the necessary locks, and it does not
attempt to delay execution until we are in process context.

* Only an additional irq can signal the completion of an irq movement.

The attempt to rebuild the irq migration code from first principles
did bear some fruit.  I asked the question: "When is it safe to tear
down the data structures for irq movement?".  The only answer I have
is when I have received an irq provably from after the irq was
reprogrammed.  This is because the only way I can reliably synchronize
with irq delivery from an apic is to receive an additional irq.

Currently this is a problem both for cpu hotplug on x86_64 and i386
and for general irq migration on x86_64.

Patches to follow shortly.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/