[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54F9F3D7.1030905@oracle.com>
Date: Fri, 06 Mar 2015 11:37:11 -0700
From: David Ahern <david.ahern@...cle.com>
To: Mike Galbraith <efault@....de>
CC: Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: NMI watchdog triggering during load_balance
On 3/6/15 11:11 AM, Mike Galbraith wrote:
> That was the question, _do_ you have any control, because that topology
> is toxic. I guess your reply means 'nope'.
>
>> The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8
>> threads per core and each cpu has 4 memory controllers.
>
> Thank god I've never met one of these, looks like the box from hell :)
>
>> If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a
>> noticeable improvement -- watchdog does not trigger and I do not get the
>> rq locks held for 2-3 seconds. But there is still fairly high cpu usage
>> for an idle system. Perhaps I should leave SCHED_MC on and disable
>> SCHED_SMT; I'll try that today.
>
> Well, if you disable SMT,your troubles _should_ shrink radically, as
> your box does. You should probably look at why you have CPU domains.
> You don't ever want to see that on a NUMA box.
In responding earlier today I realized that the topology is all wrong as
you were pointing out. There should be 16 NUMA domains (4 memory
controllers per socket and 4 sockets). There should be 8 sibling cores.
I will look into why that is not getting setup properly and what we can
do about fixing it.
--
But, I do not understand how the wrong topology is causing the NMI
watchdog to trigger. In the end there are still N domains, M groups per
domain and P cpus per group. Doesn't the balancing walk over all of them
irrespective of physical topology?
Here's another data point that jelled this morning explaining the
problem to someone: the NMI watchdog trips on a mass exit:
TPC: <_raw_spin_trylock_bh+0x38/0x100>
g0: 7fffffffffffffff g1: 00000000000000ff g2: 0000000000070f8c g3:
fffe403b97891c98
g4: fffe803b963eda00 g5: 000000010036c000 g6: fffe803b84108000 g7:
0000000000000093
o0: 0000000000000fe0 o1: 0000000000000fe0 o2: ffffff0000000000 o3:
0000000000200200
o4: 0000000000a98080 o5: 0000000000000000 sp: fffe803b8410ada1 ret_pc:
00000000006800dc
RPC: <cpumask_next_and+0x44/0x6c>
l0: 0000000000e9b114 l1: 0000000000000001 l2: 0000000000000001 l3:
0000000000000005
l4: 0000000000002000 l5: fffe803b8410b990 l6: 0000000000000004 l7:
0000000000f267b0
i0: 0000000100b10700 i1: 00000000ffffffff i2: 0000000101324d80 i3:
fffe803b8410b6c0
i4: 0000000000000038 i5: 0000000000000498 i6: fffe803b8410ae51 i7:
000000000045dc30
I7: <double_rq_lock+0x4c/0x68>
Call Trace:
[000000000045dc30] double_rq_lock+0x4c/0x68
[000000000046a23c] load_balance+0x278/0x740
[00000000008aa178] __schedule+0x378/0x8e4
[00000000008aab1c] schedule+0x68/0x78
[00000000004718ac] do_exit+0x798/0x7c0
[000000000047195c] do_group_exit+0x88/0xc0
[0000000000481148] get_signal_to_deliver+0x3ec/0x4c8
[000000000042cbc0] do_signal+0x70/0x5e4
[000000000042d14c] do_notify_resume+0x18/0x50
[00000000004049c4] __handle_signal+0xc/0x2c
For example the stream program has 1024 threads (1 for each CPU). If you
ctrl-c the program or wait for it terminate that's when it trips. Other
workloads that routinely trip it are make -j N, N some number (e.g., on
a 256 cpu system 'make -j 128'), 10 seconds later oops stop that build,
ctrl-c ... boom with the above stack trace.
Code wise ... and this is still present in 3.18 and 3.20:
schedule()
- __schedule()
+ irqs disabled: raw_spin_lock_irq(&rq->lock);
pick_next_task
- idle_balance()
+ irqs enabled:
different task: context_switch(rq, prev, next)
--> finish_lock_switch eventually
same task: raw_spin_unlock_irq(&rq->lock) or
For 2.6.39 it's the invocation of idle_balance which is triggering load
balancing with IRQs disabled. That's when the NMI watchdog trips.
I'll pound on 3.18 and see if I can reproduce something similar there.
David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists