linux-kernel - Re: Device hang when offlining a CPU due to IRQ misrouting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20070619180003.GE7160@linux-os.sc.intel.com>
Date:	Tue, 19 Jun 2007 11:00:03 -0700
From:	"Siddha, Suresh B" <suresh.b.siddha@...el.com>
To:	"Eric W. Biederman" <ebiederm@...ssion.com>
Cc:	"Darrick J. Wong" <djwong@...ibm.com>,
	"Siddha, Suresh B" <suresh.b.siddha@...el.com>,
	linux-kernel@...r.kernel.org
Subject: Re: Device hang when offlining a CPU due to IRQ misrouting

On Tue, Jun 19, 2007 at 11:54:45AM -0600, Eric W. Biederman wrote:
> "Darrick J. Wong" <djwong@...ibm.com> writes:
> 
> > On Mon, Jun 18, 2007 at 04:54:34PM -0700, Siddha, Suresh B wrote:
> >
> >> > <call to set_affinity>
> >> > [  256.298787] irq=4341 affinity=d
> >> > <ethernet on irq 4341 stops working>
> >> 
> >> And just to make sure, at this point, your MSI irq 4341 affinity
> >> (/proc/irq/4341/smp_affinity) still points to '2'?
> >
> > Actually, it's 0xD.  From the kernel's perspective the mask has been
> > updated (and I even stuck a printk into set_msi_irq_affinity to verify
> > that the writes are happening) but ... the hardware doesn't seem to
> > reflect this.  I also tried putting read_msi_msg right afterwards to
> > compare contents, though it complained about all the MSIs _except_ for
> > 4341.  (Of course, I could just be way off on the effectiveness of
> > that.)
> 
> The fact that MSI interrupts are having problems is odd.  It is possible
> that we still have a bug in there somewhere but msi interrupts should
> be safe to migrate outside of irq context (no known hardware bugs).
> As we can actually synchronize with the irq source and eliminate all
> of the migration races.
> 
> The non-msi case requires hitting a hardware race that is rare enough
> you should not normally have problems.

Yep. But Darrick's seems to say, problem happens consistently.

Anyhow, Darrick there is a general bug in this area, can you try this and
see if it helps?

diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
index 3eaceac..a0e11c9 100644
--- a/arch/x86_64/kernel/irq.c
+++ b/arch/x86_64/kernel/irq.c
@@ -144,17 +144,35 @@ void fixup_irqs(cpumask_t map)
 
 	for (irq = 0; irq < NR_IRQS; irq++) {
 		cpumask_t mask;
+		int break_affinity = 0;
+		int set_affinity = 1;
+
 		if (irq == 2)
 			continue;
 
+		/* irq's are disabled at this point */
+		spin_lock(&irq_desc[irq].lock);
+
 		cpus_and(mask, irq_desc[irq].affinity, map);
 		if (any_online_cpu(mask) == NR_CPUS) {
-			printk("Breaking affinity for irq %i\n", irq);
+			break_affinity = 1;
 			mask = map;
 		}
+
+		irq_desc[irq].chip->mask(irq);
+
 		if (irq_desc[irq].chip->set_affinity)
 			irq_desc[irq].chip->set_affinity(irq, mask);
 		else if (irq_desc[irq].action && !(warned++))
+			set_affinity = 0;
+
+		irq_desc[irq].chip->unmask(irq);
+
+		spin_unlock(&irq_desc[irq].lock);
+
+		if (break_affinity && set_affinity)
+			printk("Broke affinity for irq %i\n", irq);
+		else if (!set_affinity)
 			printk("Cannot set affinity for irq %i\n", irq);
 	}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/