[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LNX.2.00.0908282240480.32738@cinke.fazekas.hu>
Date: Sat, 29 Aug 2009 16:15:14 +0200 (CEST)
From: Marton Balint <cus@...ekas.hu>
To: Ingo Molnar <mingo@...e.hu>
cc: Peter Zijlstra <peterz@...radead.org>,
Andreas Mohr <andi@...as.de>, linux-kernel@...r.kernel.org
Subject: Re: CPU scheduler weirdness?
On Thu, 20 Aug 2009, Marton Balint wrote:
>
>
> On Thu, 20 Aug 2009, Ingo Molnar wrote:
>
>>
>> * Marton Balint <cus@...ekas.hu> wrote:
>>
>>>
>>> On Wed, 19 Aug 2009, Peter Zijlstra wrote:
>>>
>>>> On Wed, 2009-08-19 at 14:34 +0200, Marton Balint wrote:
>>>>>
>>>>> On Wed, 19 Aug 2009, Peter Zijlstra wrote:
>>>>>
>>>>>> On Wed, 2009-08-19 at 14:01 +0200, Marton Balint wrote:
>>>>>>> On Wed, 19 Aug 2009, Peter Zijlstra wrote:
>>>>>>>> On Tue, 2009-08-18 at 21:49 +0200, Marton Balint wrote:
>>>>>>>>
>>>>>>>>> In the meantime, I was able to create a tiny C program which always
>>>>>>>>> succesfully reproduces the bug. It's basically an endless loop which
>>>>>>>>> does
>>>>>>>>> not stop while the process is running on the last CPU core. The
>>>>>>>>> program
>>>>>>>>> creates multiple instances of itself, to be able to keep all of the
>>>>>>>>> CPU
>>>>>>>>> cores busy. After 1 second, the processes running on other than the
>>>>>>>>> last
>>>>>>>>> CPU core die, the processes running on the last CPU core remain
>>>>>>>>> stuck
>>>>>>>>> there...
>>>>>>>>>
>>>>>>>>> I tested it on my dual core system, if someone could test it on a
>>>>>>>>> quad
>>>>>>>>> core and report back that would probably be useful.
>>>>>>>>>
>>>>>>>>> Usage: ./schedtest <number of CPU cores>
>>>>>>>>>
>>>>>>>>> And don't forget to kill the stuck processes after using the
>>>>>>>>> program! :)
>>>>>>>>
>>>>>>>> So what's the bug? Sure one task will stay on the cpu, and because
>>>>>>>> there
>>>>>>>> is no contention it doesn't get migrated, and therefore won't quit,
>>>>>>>> how's that a problem?
>>>>>>>
>>>>>>> Problem is that more than one processes remain on that CPU core, and
>>>>>>> none
>>>>>>> of them get migrated to other (idle) cores. I tested it with my E8400
>>>>>>> processor and 2.6.31-rc5-git3 kernel.
>>>>>>
>>>>>> Only one remains here.. on a c2q running 2.6.31-rc6-tip
>>>>>>
>>>>>> Do you have a .config handy?
>>>>>>
>>>>>
>>>>> Yes it's in my original post:
>>>>>
>>>>> http://marc.info/?l=linux-kernel&m=125012584709800&w=2
>>>>
>>>> Right you are,.. so I build a kernel with the cgroup scheduler in and
>>>> tested it on a dual-core opteron machine, but I can't seem to reproduce
>>>> this.
>>>>
>>>> Are you using cgroups in any way, or do you simply have it enabled in
>>>> your config?
>>>
>>> No, it's just enabled. Actually the kernel is from the
>>> openSUSE build service:
>>>
>>> http://download.opensuse.org/repositories/Kernel:/HEAD/openSUSE_11.1/x86_64/
>>>
>>> But the problem is present for both the kernel-default
>>> kernel and the kernel-vanilla kernel which does not
>>> contain any suse-specific patches.
>>>
>>> This evening I had a bit more time to test, and I've
>>> made a surprising discovery: I can only reproduce the
>>> bug if the kernel module of my TV tuner card is loaded.
>>> I have a Leadtek Winfast 2000 XP Expert TV card, it
>>> uses the cx8800 kernel module. It seems that the
>>> problem is somehow related to the infrared sensor of
>>> the TV card, because I recompiled the module with the
>>> 'case CX88_BOARD_WINFAST2000XP_EXPERT:' line removed
>>> from cx88-input.c and I couldn't reproduce the bug with
>>> the new kernel module.
>>
>> Extremely weird. Are timers somehow busted?
>
> How can I check that?
>
> In the meantime, I updated my original C program and also created a kernel
> module (schedtest_mod.c) which causes the same scheduling problems as the
> kernel module of my TV card. The kernel module is a skeleton of the infrared
> sensor polling code in cx88-input.c. It uses schedule_delayed_work, this
> seems to cause the problem. The C program (schedtest.c) is also updated, it
> now detects the number of CPU cores, from now, what you can set as a command
> line parameter is the CPU core number, on which the schedtest processes will
> not quit. (previously this was always the last core).
>
> So to reproduce the bug on a dual core system, compile and insert the kernel
> module (schedtest_mod.c). Then check dmesg, it should contain on which CPU
> core is the delayed_work running. You should use the CPU core id of the
> _other_ CPU core as a command line parameter to the updated schedtest
> program.
>
> And by the way, thank you guys for the help so far, hopefully we'll get to
> the bottom of this :)
I reproduced the bug with the previously provided kernel module and
C program on a different computer (it's a laptop with a core2 duo P8400
CPU), and also bisected the bug to this commit:
sched: fine-tune SD_MC_INIT:
14800984706bf6936bbec5187f736e928be5c218
If I add again the removed SD_BALANCE_NEWIDLE to flags, then everything
works as expected. So what would be the correct fix for this bug?
Revert the patch? Or just add SD_BALANCE_NEWIDLE to flags?
Regards,
Marton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists