linux-kernel - Re: NMI watchdog triggering during load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <54F9F3D7.1030905@oracle.com>
Date:	Fri, 06 Mar 2015 11:37:11 -0700
From:	David Ahern <david.ahern@...cle.com>
To:	Mike Galbraith <efault@....de>
CC:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: NMI watchdog triggering during load_balance

On 3/6/15 11:11 AM, Mike Galbraith wrote:
> That was the question, _do_ you have any control, because that topology
> is toxic.  I guess your reply means 'nope'.
>
>> The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8
>> threads per core and each cpu has 4 memory controllers.
>
> Thank god I've never met one of these, looks like the box from hell :)
>
>> If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a
>> noticeable improvement -- watchdog does not trigger and I do not get the
>> rq locks held for 2-3 seconds. But there is still fairly high cpu usage
>> for an idle system. Perhaps I should leave SCHED_MC on and disable
>> SCHED_SMT; I'll try that today.
>
> Well, if you disable SMT,your troubles _should_ shrink radically, as
> your box does. You should probably look at why you have CPU domains.
> You don't ever want to see that on a NUMA box.

In responding earlier today I realized that the topology is all wrong as 
you were pointing out. There should be 16 NUMA domains (4 memory 
controllers per socket and 4 sockets). There should be 8 sibling cores. 
I will look into why that is not getting setup properly and what we can 
do about fixing it.

--

But, I do not understand how the wrong topology is causing the NMI 
watchdog to trigger. In the end there are still N domains, M groups per 
domain and P cpus per group. Doesn't the balancing walk over all of them 
irrespective of physical topology?

Here's another data point that jelled this morning explaining the 
problem to someone: the NMI watchdog trips on a mass exit:

TPC: <_raw_spin_trylock_bh+0x38/0x100>
g0: 7fffffffffffffff g1: 00000000000000ff g2: 0000000000070f8c g3: 
fffe403b97891c98
g4: fffe803b963eda00 g5: 000000010036c000 g6: fffe803b84108000 g7: 
0000000000000093
o0: 0000000000000fe0 o1: 0000000000000fe0 o2: ffffff0000000000 o3: 
0000000000200200
o4: 0000000000a98080 o5: 0000000000000000 sp: fffe803b8410ada1 ret_pc: 
00000000006800dc
RPC: <cpumask_next_and+0x44/0x6c>
l0: 0000000000e9b114 l1: 0000000000000001 l2: 0000000000000001 l3: 
0000000000000005
l4: 0000000000002000 l5: fffe803b8410b990 l6: 0000000000000004 l7: 
0000000000f267b0
i0: 0000000100b10700 i1: 00000000ffffffff i2: 0000000101324d80 i3: 
fffe803b8410b6c0
i4: 0000000000000038 i5: 0000000000000498 i6: fffe803b8410ae51 i7: 
000000000045dc30
I7: <double_rq_lock+0x4c/0x68>
Call Trace:
  [000000000045dc30] double_rq_lock+0x4c/0x68
  [000000000046a23c] load_balance+0x278/0x740
  [00000000008aa178] __schedule+0x378/0x8e4
  [00000000008aab1c] schedule+0x68/0x78
  [00000000004718ac] do_exit+0x798/0x7c0
  [000000000047195c] do_group_exit+0x88/0xc0
  [0000000000481148] get_signal_to_deliver+0x3ec/0x4c8
  [000000000042cbc0] do_signal+0x70/0x5e4
  [000000000042d14c] do_notify_resume+0x18/0x50
  [00000000004049c4] __handle_signal+0xc/0x2c


For example the stream program has 1024 threads (1 for each CPU). If you 
ctrl-c the program or wait for it terminate that's when it trips. Other 
workloads that routinely trip it are make -j N, N some number (e.g., on 
a 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, 
ctrl-c ... boom with the above stack trace.

Code wise ... and this is still present in 3.18 and 3.20:

schedule()
- __schedule()
   + irqs disabled: raw_spin_lock_irq(&rq->lock);

      pick_next_task
      - idle_balance()

   + irqs enabled:
     different task: context_switch(rq, prev, next)
                     --> finish_lock_switch eventually
     same task: raw_spin_unlock_irq(&rq->lock) or


For 2.6.39 it's the invocation of idle_balance which is triggering load 
balancing with IRQs disabled. That's when the NMI watchdog trips.

I'll pound on 3.18 and see if I can reproduce something similar there.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/