lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1260825420.2217.40.camel@pasglop>
Date:	Tue, 15 Dec 2009 08:17:00 +1100
From:	Benjamin Herrenschmidt <benh@...nel.crashing.org>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Sachin Sant <sachinp@...ibm.com>,
	Linux/PPC Development <linuxppc-dev@...abs.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...e.hu>, linux-next@...r.kernel.org
Subject: Re: [Next] CPU Hotplug test failures on powerpc

On Mon, 2009-12-14 at 13:19 +0100, Peter Zijlstra wrote:

> > >> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
> > >>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
> > >>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
> > >>     sp: c00000000c933650
> > >>    msr: 8000000000089032
> > >>   current = 0xc00000000c173840
> > >>   paca    = 0xc000000000bc2600
> > >>     pid   = 2602, comm = hotplug06.top.s
> > >> enter ? for help
> > >> [link register   ] c000000000342f10 .cpumask_next_and+0x4c/0x94
> > >> [c00000000c933650] c0000000000e9f34 .cpuset_cpus_allowed_locked+0x38/0x74 (unreliable)
> > >> [c00000000c9336e0] c000000000090074 .move_task_off_dead_cpu+0xc4/0x1ac
> > >> [c00000000c9337a0] c0000000005e4e5c .migration_call+0x304/0x830
> > >> [c00000000c933880] c0000000005e0880 .notifier_call_chain+0x68/0xe0
> > >> [c00000000c933920] c00000000012a92c ._cpu_down+0x210/0x34c
> > >> [c00000000c933a90] c00000000012aad8 .cpu_down+0x70/0xa8
> > >> [c00000000c933b20] c000000000525940 .store_online+0x54/0x894
> > >> [c00000000c933bb0] c000000000463430 .sysdev_store+0x3c/0x50
> > >> [c00000000c933c20] c0000000001f8320 .sysfs_write_file+0x124/0x18c
> > >> [c00000000c933ce0] c00000000017edac .vfs_write+0xd4/0x1fc
> > >> [c00000000c933d80] c00000000017efdc .SyS_write+0x58/0xa0
> > >> [c00000000c933e30] c0000000000085b4 syscall_exit+0x0/0x40
> > >> --- Exception: c01 (System Call) at 00000fff9fa8a8f8
> > >> SP (fffe7aef200) is in userspace
> > >> 0:mon> e
> > >> cpu 0x0: Vector: 100 (System Reset) at [c00000000c9333d0]
> > >>     pc: c0000000003433d8: .find_next_bit+0x54/0xc4
> > >>     lr: c000000000342f10: .cpumask_next_and+0x4c/0x94
> > >>     sp: c00000000c933650
> > >>    msr: 8000000000089032
> > >>   current = 0xc00000000c173840
> > >>   paca    = 0xc000000000bc2600
> > >>     pid   = 2602, comm = hotplug06.top.s
> > >>
> 
> OK so how do I read that above thing? What's a System Reset? Is that
> like the x86 triple fault thing?

Nah, it's an NMI that throws you into xmon. Basically, the machine was
hung and Sachin interrupted it with an NMI to see what was going on. The
above is the backtrace. It was at the moment of the NMI inside
find_next_bit() called from cpumask_next_and() etc... 

> >From what I can make of it, its in move_task_off_dead_cpu(), right after
> having called cpuset_cpus_allowed_locked(), doing that cpumask_any_and()
> call.

Yes, it looks like it.

> static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
> {
>         int dest_cpu;
>         const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(dead_cpu));
> 
> again:
>         /* Look for allowed, online CPU in same node. */
>         for_each_cpu_and(dest_cpu, nodemask, cpu_active_mask)
>                 if (cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
>                         goto move;
> 
>         /* Any allowed, online CPU? */
>         dest_cpu = cpumask_any_and(&p->cpus_allowed, cpu_active_mask);
>         if (dest_cpu < nr_cpu_ids)
>                 goto move;
> 
>         /* No more Mr. Nice Guy. */
>         if (dest_cpu >= nr_cpu_ids) {
>                 cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> ====>           dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
> 
>                 /*
>                  * Don't tell them about moving exiting tasks or
>                  * kernel threads (both mm NULL), since they never
>                  * leave kernel.
>                  */
>                 if (p->mm && printk_ratelimit()) {
>                         pr_info("process %d (%s) no longer affine to cpu%d\n",
>                                 task_pid_nr(p), p->comm, dead_cpu);
>                 }
>         }
> 
> move:
>         /* It can have affinity changed while we were choosing. */
>         if (unlikely(!__migrate_task_irq(p, dead_cpu, dest_cpu)))
>                 goto again;
> }
> 
> Both masks, p->cpus_allowed and cpu_active_mask are stable in that p
> won't go away since we hold the tasklist_lock (in migrate_list_tasks),
> and cpu_active_mask is static storage, so WTH is it going funny on?

Sachin, this is 100% reproduceable right ? You should be able to
sprinkle it with some xmon_printf() (rather than printk, just add a
prototype extern void xmon_printf(const char *fmt,...); somewhere, this
has the advantage of being fully synchronous and will print out even if
the printk sem is held.

Cheers,
Ben.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ