linux-kernel - Re: current linux-2.6.git: cpusets completely broken

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4877BD66.30802@qualcomm.com>
Date:	Fri, 11 Jul 2008 13:07:02 -0700
From:	Max Krasnyansky <maxk@...lcomm.com>
To:	Vegard Nossum <vegard.nossum@...il.com>
CC:	Paul Menage <menage@...gle.com>,
	Dmitry Adamushko <dmitry.adamushko@...il.com>,
	Paul Jackson <pj@....com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>, miaox@...fujitsu.com,
	rostedt@...dmis.org, Thomas Gleixner <tglx@...utronix.de>,
	Linux Kernel <linux-kernel@...r.kernel.org>
Subject: Re: current linux-2.6.git: cpusets completely broken

Vegard Nossum wrote:
> On Fri, Jul 11, 2008 at 9:36 PM, Paul Menage <menage@...gle.com> wrote:
>> On Fri, Jul 11, 2008 at 12:07 PM, Vegard Nossum <vegard.nossum@...il.com> wrote:
>>> The result of having CPUSETS enabled as above is a 100% reproducible
>>> BUG on the very first cpu hot-unplug:
>>>
>>> ------------[ cut here ]------------
>>> kernel BUG at xxx/linux-2.6/kernel/sched.c:5859!
>> That doesn't quite match up with any BUG in 2.6.26-rc9 - what tree is
>> this last crash based on?
> 
> latest mainline. Commit e5a5816f7875207cb0a0a7032e39a4686c5e10a4.
> 
> Is this one:
> 
> /* called under rq->lock with disabled interrupts */
> static void migrate_dead(unsigned int dead_cpu, struct task_struct *p)
> {
>         struct rq *rq = cpu_rq(dead_cpu);
> 
>         /* Must be exiting, otherwise would be on tasklist. */
>         BUG_ON(!p->exit_state);
> 
>>> Also, this is on the latest linux-2.6.git! Since we're so close to
>>> release, maybe cpusets should simply be marked BROKEN for now? (Unless
>>> we can fix it, of course. The alternative is to apply Miao Xie's
>>> workaround patch temporarily.)
>> If we were going to mark anything as broken, wouldn't cpu-hotplug be
>> the more appropriate victim? I suspect that there are more systems
>> using cpusets in production environments than using cpu hotplug. But
>> as you say, fixing it sounds better.
> 
> I'm sorry for the harsh characterization and suggestion; please accept
> my apology. It was purely a result of my excitement at having made
> some progress in this case.
> 
> But I have more good news; reverting this:
> 
> commit f18f982abf183e91f435990d337164c7a43d1e6d
> Author: Max Krasnyansky <maxk@...lcomm.com>
> Date:   Thu May 29 11:17:01 2008 -0700
> 
>     sched: CPU hotplug events must not destroy scheduler domains created by the
> cpusets
> 
>     First issue is not related to the cpusets. We're simply leaking doms_cur.
>     It's allocated in arch_init_sched_domains() which is called for every
>     hotplug event. So we just keep reallocation doms_cur without freeing it.
>     I introduced free_sched_domains() function that cleans things up.
> 
>     Second issue is that sched domains created by the cpusets are
>     completely destroyed by the CPU hotplug events. For all CPU hotplug
>     events scheduler attaches all CPUs to the NULL domain and then puts
>     them all into the single domain thereby destroying domains created
>     by the cpusets (partition_sched_domains).
>     The solution is simple, when cpusets are enabled scheduler should not
>     create default domain and instead let cpusets do that. Which is
>     exactly what the patch does.
> 
>     Signed-off-by: Max Krasnyansky <maxk@...lcomm.com>
>     Cc: pj@....com
>     Cc: menage@...gle.com
>     Cc: rostedt@...dmis.org
>     Acked-by: Peter Zijlstra <a.p.zijlstra@...llo.nl>
>     Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> 
> gets rid of the BUG! (Added people to Ccs.)
Really ? Just by looking at the backtraces in your first email it seems 
unrelated.

> Might I instead suggest a revert of this? (Again, unless somebody else
> can spot the real error and fix it before 2.6.26 is out :-))
I'd actually be ok with reverting it. Paul and I were looking into some 
circular locking issues triggered by the very same patch. Since we do 
not have a solution yet we could revert it for now and work on a fix 
during .27-rc series.

Max


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/