lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 08 Jul 2007 22:08:23 -0700
From:	Fernando Lopez-Lezcano <nando@...ma.Stanford.EDU>
To:	Gabriel C <nix.or.die@...glemail.com>
Cc:	nando@...ma.Stanford.EDU, Carsten Emde <carsten.emde@...dl.org>,
	"jcaceres@...ma.Stanford.EDU" <jcaceres@...ma.Stanford.EDU>,
	Steven Rostedt <rostedt@...dmis.org>,
	RT-Users <linux-rt-users@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Rui Nuno Capela <rncbc@...bc.org>,
	Ingo Molnar <mingo@...e.hu>
Subject: Re: v2.6.21.5-rt19 (sched_getaffinity?)

On Sun, 2007-07-08 at 20:53 -0700, Fernando Pablo Lopez-Lezcano wrote: 
> On Mon, 9 Jul 2007, Gabriel C wrote:
> > Fernando Lopez-Lezcano wrote:
> >> On Sat, 2007-07-07 at 11:24 +0200, Ingo Molnar wrote:
> >>> * Fernando Lopez-Lezcano <nando@...ma.Stanford.EDU> wrote:
> >>>>>> Changes since 2.6.21.5-rt18:
> >>>>>> - Fixed a nasty and hard to track down slowness / boot problem on SMP
> >>>>>> machines with CONFIG_NOHZ enabled. The problem was caused by the timer
> >>>>>> wheel base lock held during the get_next_timer_interrupt() call in the
> >>>>>> idle path, which eventually led to a bogus PI boosting of the idle task
> >>>>>> and in consequence a stale wrong scheduler selection for the affected 
> >>>>>> idle
> >>>>>> task.
> >>>>>> 
> >>>>>> Kudos to Carsten Emde, who patiently and meticulously isolated the
> >>>>>> problem and provided the traces, which allowed to identify the root 
> >>>>>> cause.
> >>>>>> 
> >>>>>> Problem solution: Prevent idle task boosting
> >>>>>> 
> >>>>> Maybe someone remember me whining about troubles with 2.6.21-rt2..18 on 
> >>>>> my Core2 T7200 laptop (fujitsu-siemens amilo i1520).
> >>>>> 
> >>>>> Althought I'm still with my fingers crossed, I can tell the good news 
> >>>>> are that 2.6.21.5-rt19 (and -rt20) does behave far better now on the 
> >>>>> very same box.
> >>>>> 
> >>>> Yes, it works much better indeed...
> >>>> 
> >>>> Ingo: is there a place where I can read about the changes in different 
> >>>> rtxx releases? What is new/better/fixed in rt20? (I see scheduler stuff 
> >>>> in a diff from rt19 to rt20 but I don't really know what it means).
> >>>> 
> >>> and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when CFS 
> >>> was merged.
> >>> 
> >>> i _think_ Rui might have seen two separate problems. Perhaps by the time 
> >>> we fixed the first problem (which Rui saw since -rt2) we introduced the 
> >>> other one via -rt11 - which then got fixed in -rt19.
> >> 
> >> Ahh, CFS is now part of rt, I was obviously not paying attention... I'm
> >> really trying to provide a "stable" rt kernel for audio usage and
> >> including another subsystem into rt is - IMHO - not going to help.
> >> What's the chance of splitting things?
> >> 
> >>> btw., we'd love to get more feedback regarding CFS. CFS is a completely 
> >>> new scheduler for Linux. 
> >> 
> >> Then I'd rather have it separate from rt.
> >> 
> >>> It has a design centered around keeping application latencies down, so it 
> >>> is ultimately real-time friendly, and it should also make things work 
> >>> better for desktop-ish and audio-ish stuff as well. (even under 
> >>> SCHED_OTHER)
> >>> 
> >> 
> >> Maybe this is CFS related? (tail of a thread in the Planet CCRMA mailing
> >> list):
> >> 
> >> On Sun, 2007-07-08 at 15:26 -0400, Hector Centeno wrote:
> >> 
> >>> Ok, so just to confirm, that 2.6.21-0182.rt19.1.fc7.ccrmart works fine
> >>> on my desktop but on my laptop it makes Firefox and Tomboy to crash.
> >>> On the same laptop using 2.6.21-0182.rt17.1.fc7.ccrmart there is no
> >>> problem.
> >>> 
> >> I managed to completely hang firefox (fc7) with flash 9 installed
> >> (unkillable even with -9).
> >
> > Firefox with flash 9 does not work good , there are a lot bugs reported 
> > about ( just google ) and it hangs on vanilla or whatever other kernels 
> > as well. Not only Firefox but also Swiftfox, Opera, Epiphany etc.
> >
> > The most time Firefox dies when you use flash 9 and close a window or a 
> > tab.
> 
> More tests...
> 
> The problem is the rt kernel AFAICT, this goes beyond Flash 9, way 
> beyond:
> 
> _OpenOffice_ hangs with 2.6.21.5-rt20, works fine with stock Fedora 7 
> kernel. Flash 9 hangs with 2.6.21.5-rt20, works fine with the stock Fedora 
> 7 kernel. Same machine booting different kernels, I'd say it is the 
> kernel.
> 
> The only way out for a hung app is a reboot.
> 
> Ingo: what would be a good way to trace this? It makes the rt kernels not 
> very usable at least on this hardware (more tests tomorrow in the CCRMA 
> machines).
> 
> Same on 2.6.21.5-rt18 with CONFIG_NO_HZ not set.

I forgot to include the output of strace... and of course now I can't
repeat the openoffice hang. 

I do get flash 9 (I know, not the best example) and tomboy to hang as
reported by one of my Planet CCRMA users - flash 9 tested working on
stock fedora 7 kernel - and both seem to hang in the same system call:

sched_getaffinity(3528, 32,  <unfinished ...>

Full output of strace attached for both cases. 

Hopefully this will make the bug immediately obvious to someone :-)
[running on a laptop with the 7700 Intel cpu]
-- Fernando


Download attachment "firefox.trace.gz" of type "application/x-gzip" (155285 bytes)

Download attachment "tomboy.trace.gz" of type "application/x-gzip" (2960 bytes)

Powered by blists - more mailing lists