[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.00.0906170903370.2800@localhost.localdomain>
Date: Wed, 17 Jun 2009 10:45:38 +0200 (CEST)
From: Thomas Gleixner <tglx@...utronix.de>
To: LKML <linux-kernel@...r.kernel.org>
cc: rt-users <linux-rt-users@...r.kernel.org>,
Ingo Molnar <mingo@...e.hu>,
Steven Rostedt <rostedt@...dmis.org>,
Peter Zijlstra <peterz@...radead.org>,
Carsten Emde <ce@...g.ch>,
Clark Williams <williams@...hat.com>,
Frank Rowand <frank.rowand@...sony.com>,
Robin Gareus <robin@...eus.org>,
Gregory Haskins <ghaskins@...ell.com>,
Philippe Reynes <philippe.reynes@...smpp.fr>,
Fernando Lopez-Lezcano <nando@...ma.Stanford.EDU>,
Will Schmidt <will_schmidt@...t.ibm.com>,
Darren Hart <dvhltc@...ibm.com>, Jan Blunck <jblunck@...e.de>,
Sven-Thorsten Dietrich <sdietrich@...ell.com>,
Jon Masters <jcm@...hat.com>
Subject: [ANNOUNCE] 2.6.29.5-rt21
We are pleased to announce the next update to our new preempt-rt
series.
- update to 2.6.29.5 (2.6.29.5-rt20, which I uploaded yesterday but
did not announce due to the findings below)
- softirq: lower default priority below hardirq default priority
This fixes a long standing default priority configuration problem of
the -rt series. On UP machines this can result in net_tx softirq
running in an endless loop and starving the irq threads and the other
softirq threads and of course everything with lower priority. It might
be possible to happen on a SMP machine when the hardirq thread
affinities are tweaked in the right way.
What happens is:
tx interrupt
lock(card->tx_lock);
dev_kfree_skb_any(skb);
blocks on a contended lock
net_tx softirq runs
unlocks contended lock but does not schedule away due to equal prio
repeat:
calls xmit
try_lock(card->tx_lock) fails
-> reschedule skb which keeps net_tx running
goto repeat;
The scheduler does not schedule away net_tx, so this goes on forever.
This has been there forever, but it seems to be easier to trigger in
the 29 -rt series which is probably due to the slab cache lock breaks
we did.
The problem is restricted to a dozen of wireless adapters and network
cards where e1000e is the most popular one. We could patch the
affected drivers for -rt, but we need to have a closer look at the
general assumptions of drivers vs. hardirq/softirq. Note, this is not
a mainline problem as the semantics are entirely correct there.
Lowering the priorities of the softirq threads below the hardirq
threads priorities is a safe workaround for now. It prevents the
runaway scenario under all circumstances as it resembles the mainline
semantics closely.
For all existing -rt systems the problem can be solved w/o patching
the kernel by adjusting the priority of the softirq threads from the
init scripts with chrt.
It's extremly hard to trigger this, we never had a report of that
before, and I want to say thanks to Bernd Oelker who meticulously
worked on reproducing the problem and debugging it with all evil
methods and patches I could come up with. And no, I'm not going to
tell you which nasty hacks made it possible to decode this :)
Download locations:
http://rt.et.redhat.com/download/
http://www.kernel.org/pub/linux/kernel/projects/rt/
Information on the RT patch can be found at:
http://rt.wiki.kernel.org/index.php/Main_Page
to build the 2.6.29.5-rt21 tree, the following patches should be
applied:
http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.29.5.tar.bz2
http://www.kernel.org/pub/linux/kernel/projects/rt/patch-2.6.29.5-rt21.bz2
The broken out patches are also available at the same download
locations.
Enjoy !
tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists