linux-kernel - RE: 2.6.19-rc1 genirq causes either boot hang or "do

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <19D0D50E9B1D0A40A9F0323DBFA04ACC023B0D17@USRV-EXCH4.na.uis.unisys.com>
Date:	Mon, 9 Oct 2006 10:28:13 -0500
From:	"Protasevich, Natalie" <Natalie.Protasevich@...SYS.com>
To:	"Eric W. Biederman" <ebiederm@...ssion.com>,
	"Arjan van de Ven" <arjan@...radead.org>
Cc:	"Linus Torvalds" <torvalds@...l.org>,
	"Muli Ben-Yehuda" <muli@...ibm.com>, "Ingo Molnar" <mingo@...e.hu>,
	"Thomas Gleixner" <tglx@...utronix.de>,
	"Benjamin Herrenschmidt" <benh@...nel.crashing.org>,
	"Rajesh Shah" <rajesh.shah@...el.com>, "Andi Kleen" <ak@....de>,
	"Luck, Tony" <tony.luck@...el.com>,
	"Andrew Morton" <akpm@...l.org>,
	"Linux-Kernel" <linux-kernel@...r.kernel.org>,
	"Badari Pulavarty" <pbadari@...il.com>,
	"Roland Dreier" <rdreier@...co.com>
Subject: RE: 2.6.19-rc1 genirq causes either boot hang or "do_IRQ: cannot handle IRQ -1"

 

> -----Original Message-----
> From: Eric W. Biederman [mailto:ebiederm@...ssion.com] 
> Sent: Monday, October 09, 2006 8:46 AM
> To: Arjan van de Ven
> Cc: Linus Torvalds; Muli Ben-Yehuda; Ingo Molnar; Thomas 
> Gleixner; Benjamin Herrenschmidt; Rajesh Shah; Andi Kleen; 
> Protasevich, Natalie; Luck, Tony; Andrew Morton; 
> Linux-Kernel; Badari Pulavarty; Roland Dreier
> Subject: Re: 2.6.19-rc1 genirq causes either boot hang or 
> "do_IRQ: cannot handle IRQ -1"
> 
> Arjan van de Ven <arjan@...radead.org> writes:
> 
> >> > So yes, having software say "We want to steer this particular 
> >> > interrupt to this L3 cache domain" sounds eminently sane.
> >> >
> >> > Having software specify which L1 cache domain it wants 
> to pollute 
> >> > is likely just crazy micro-management.
> >> 
> >> The current interrupt delivery abstraction in the kernel 
> is a set of 
> >> cpus an interrupt can be delivered to.  Which seem 
> sufficient to the 
> >> cause of aiming at a cache domain.  Frequently the lower levels of 
> >> interrupt delivery map this to a single cpu because of hardware 
> >> limitations but in certain cases we can honor a multiple 
> cpu request.
> >> 
> >> I believe the scheduler has knowledge about different locality 
> >> domains for NUMA and everything else.  So what is wanting 
> on our side 
> >> is some architecture? work to do the broad steering by default.
> >
> >
> > well normally this is the job of the userspace IRQ balancer to get 
> > right; the thing is undergoing a redesign right now to be 
> smarter and 
> > deal better with dual/quad core, numa etc etc, but as a principle 
> > thing this is best done in userspace (simply because there's higher 
> > level information there, like "is this interrupt for a network 
> > device", so that policy can take that into account)
> 
> So far I have seen all of that higher level information in 
> the kernel, and it has to export it to user space for the 
> user space daemon to do anything about it.
> 
> The only time I have seen user space control being useful is 
> when there isn't a proper default policy so at least you can 
> distribute things between the cache domains properly.  So far 
> I have been more tempted to turn it off as it will routinely 
> change which NUMA node an irq is pointing at, which is not at 
> all ideal for performance, and I haven't seen a way (short of 
> replacing it) to tell the user space irq balancer that it got 
> it's default policy wrong.
> 
> It is quite possible I don't have the whole story, but so far 
> it just feels like we are making a questionable decision by 
> pushing things out to user space.  If nothing else it seems 
> to make changes more difficult in the irq handling 
> infrastructure because we have to maintain stable interfaces. 
>  Things like per cpu counters for every irq start to look 
> really questionable when you scale the system up in size.

I'd like also to question current policies of user space irqbalanced. It
seems to just go round-robin without much heuristics involved. We are
seeing loss of timer interrupts on our systems - and the more processors
the more noticeable it is, but it starts even on 8x partitions; on 48x
system I see about 50% loss, on both ia32 and x86_64 (haven't checked on
ia64 yet). With say 16 threads it is unsettling to see 70% overall idle
time, and still only 40-50% of interrupts go through. System's time is
not affected, so the problem is on the back burner for now :) It's not
clear yet whether this is software or hardware fault, and how much
damage it does (performance wise etc.) I will provide more information
as it becomes available, just wanted to raise red flag, maybe others
also see such phenomena.
Thanks,
--Natalie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/