linux-kernel - Scheduler grouping failure; division by zero in select_task_rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <1290975266.3292.316.camel@localhost>
Date:	Sun, 28 Nov 2010 20:14:26 +0000
From:	Ben Hutchings <benh@...ian.org>
To:	Frede_Feuerstein@....net, Ingo Molnar <mingo@...e.hu>,
	Peter Zijlstra <peterz@...radead.org>
Cc:	603229@...s.debian.org, LKML <linux-kernel@...r.kernel.org>
Subject: Scheduler grouping failure; division by zero in select_task_rq_fair

On Sun, 2010-11-28 at 06:00 +0100, Frede Feuerstein wrote:
[...]
> > The division by zero appears to be a result of getting bad information
> > from the firmware about the groups of processors.
> 
> Well, technically a division error always is a result of bad data fed to
> that division. I rather meant, that this is the point to backtrace the
> error.
> Though the bios of the w2100z is known for some problems, the cpus are
> reported correctly by the bios and it is the latest version (R01-B5-S1).
> 
> >   I realise that this
> > same bad information did not previously result in a crash, but I (and
> > the upstream developers) need to know what that information is before we
> > can understand how this can be avoided.
> 
> Are there any means to gather more information ? Tell me and i shall do
> it. 

I think this is now enough information.

Ingo, Peter, the output from scheduler domain/group setup was:

[    0.536554] CPU0 attaching sched-domain:
[    0.540004]  domain 0: span 0-1 level MC
[    0.548002]   groups: 0 1
[    0.560003]   domain 1: span 0-3 level NODE
[    0.568002]    groups:
[    0.574179] ERROR: domain->cpu_power not set
[    0.576002]
[    0.580002] ERROR: groups don't span domain->span
[    0.584004] CPU1 attaching sched-domain:
[    0.588007]  domain 0: span 0-1 level MC
[    0.596002]   groups: 1 0 (cpu_power = 1023)
[    0.612002] ERROR: parent span is not a superset of domain->span
[    0.616003]   domain 1: span 1-3 level CPU
[    0.624002]    groups: 1 (cpu_power = 2048) 2-3 (cpu_power = 2048)
[    0.644003]    domain 2: span 0-3 level NODE
[    0.652004]     groups: 1-3 (cpu_power = 4096)
[    0.668002] ERROR: domain->cpu_power not set
[    0.672002]
[    0.676002] ERROR: groups don't span domain->span
[    0.680004] CPU2 attaching sched-domain:
[    0.684003]  domain 0: span 2-3 level MC
[    0.692003]   groups: 2 3
[    0.704003]   domain 1: span 1-3 level CPU
[    0.712003]    groups: 2-3 (cpu_power = 2048) 1 (cpu_power = 2048)
[    0.736003]    domain 2: span 0-3 level NODE
[    0.744003]     groups: 1-3 (cpu_power = 4096)
[    0.760003] ERROR: domain->cpu_power not set
[    0.764003]
[    0.768003] ERROR: groups don't span domain->span
[    0.772004] CPU3 attaching sched-domain:
[    0.776003]  domain 0: span 2-3 level MC
[    0.784003]   groups: 3 2
[    0.794183]   domain 1: span 1-3 level CPU
[    0.800003]    groups: 2-3 (cpu_power = 2048) 1 (cpu_power = 2048)
[    0.822183]    domain 2: span 0-3 level NODE
[    0.828003]     groups: 1-3 (cpu_power = 4096)
[    0.842180] ERROR: domain->cpu_power not set
[    0.844003]
[    0.848003] ERROR: groups don't span domain->span

and the oops is:

[    0.852154] divide error: 0000 [#1] SMP
[    0.856002] last sysfs file:
[    0.856002] CPU 1
[    0.856002] Modules linked in:
[    0.856002] Pid: 2, comm: kthreadd Not tainted 2.6.32-5-amd64 #1 W1100z/2100z
[    0.856002] RIP: 0010:[<ffffffff810416e9>]  [<ffffffff810416e9>] select_task_rq_fair+0x665/0 x800
[    0.856002] RSP: 0018:ffff88003fdb7c90  EFLAGS: 00010046
[    0.856002] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    0.856002] RDX: 0000000000000000 RSI: 0000000000000200 RDI: 0000000000000200
[    0.856002] RBP: ffff88004120fd50 R08: 0000000000000000 R09: ffff88007f98f0b0
[    0.856002] R10: 0000000000000000 R11: 00000000000252d0 R12: ffff88007f98f060
[    0.856002] R13: ffff88007f98f070 R14: ffffffffffffffff R15: 0000000000015780
[    0.856002] FS:  0000000000000000(0000) GS:ffff880041200000(0000) knlGS:0000000000000000
[    0.856002] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    0.856002] CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006e0
[    0.856002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.856002] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.856002] Process kthreadd (pid: 2, threadinfo ffff88003fdb6000, task ffff88003fdc8710)
[    0.856002] Stack:
[    0.856002]  0000000000015780 0000000000015780 0000000000015780 0000000000015780
[    0.856002] <0> 0000000000015780 0000000000015788 0000000000015788 ffffffff8146c260
[    0.856002] <0> 0000000800000000 ffff88007f9b0000 ffff880041215780 0000000081317f88
[    0.856002] Call Trace:
[    0.856002]  [<ffffffff8104d2b2>] ? copy_process+0x1007/0x115f
[    0.856002]  [<ffffffff810475f4>] ? select_task_rq+0xb/0x3e
[    0.856002]  [<ffffffff8104b53b>] ? wake_up_new_task+0x35/0xf6
[    0.856002]  [<ffffffff8104d65e>] ? do_fork+0x254/0x31e
[    0.856002]  [<ffffffff81041aa9>] ? pick_next_task_fair+0xca/0xd6
[    0.856002]  [<ffffffff8104802b>] ? finish_task_switch+0x3a/0xaf
[    0.856002]  [<ffffffff81011b42>] ? kernel_thread+0x82/0xe0
[    0.856002]  [<ffffffff810648c8>] ? kthread+0x0/0x81
[    0.856002]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
[    0.856002]  [<ffffffff8106488d>] ? kthreadd+0xb1/0xec
[    0.856002]  [<ffffffff814f3140>] ? early_idt_handler+0x0/0x71
[    0.856002]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
[    0.856002]  [<ffffffff814f3140>] ? early_idt_handler+0x0/0x71
[    0.856002]  [<ffffffff810dfda5>] ? do_set_mempolicy+0x128/0x13a
[    0.856002]  [<ffffffff810647dc>] ? kthreadd+0x0/0xec
[    0.856002]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
[    0.856002] Code: 00 02 00 00 4c 89 ef 48 63 d2 e8 0f c6 14 00 3b 05 ad 33 49 00 89 c2 0f 8c  6f ff ff ff 41 8b 4c 24 08 48 c1 e3 0a 31 d2 48 89 d8 <48> f7 f1 83 bc 24 a8 00 00 00 00 48 89  c1 75 22 4c 39 f0 73 15
[    0.856002] RIP  [<ffffffff810416e9>] select_task_rq_fair+0x665/0x800
[    0.856002]  RSP <ffff88003fdb7c90>
[    0.856002] ---[ end trace a22d306b065d4a66 ]---

There's more information in the bug log at <http://bugs.debian.org/603229>.

If you think this has been fixed since 2.6.32 (I didn't see any relevant
changes) then we have a package of 2.6.36 which Frede can test.

Ben.

-- 
Ben Hutchings, Debian Developer and kernel team member


Download attachment "signature.asc" of type "application/pgp-signature" (829 bytes)