linux-kernel - Re: NMI watchdog triggering during load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150307093647.GP23367@worktop.ger.corp.intel.com>
Date:	Sat, 7 Mar 2015 10:36:47 +0100
From:	Peter Zijlstra <peterz@...radead.org>
To:	David Ahern <david.ahern@...cle.com>
Cc:	Mike Galbraith <efault@....de>, Ingo Molnar <mingo@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: NMI watchdog triggering during load_balance

On Fri, Mar 06, 2015 at 11:37:11AM -0700, David Ahern wrote:
> On 3/6/15 11:11 AM, Mike Galbraith wrote:
> In responding earlier today I realized that the topology is all wrong as you
> were pointing out. There should be 16 NUMA domains (4 memory controllers per
> socket and 4 sockets). There should be 8 sibling cores. I will look into why
> that is not getting setup properly and what we can do about fixing it.

So we changed the numa topology setup a while back; see commit
cb83b629bae0 ("sched/numa: Rewrite the CONFIG_NUMA sched domain
support").

> But, I do not understand how the wrong topology is causing the NMI watchdog
> to trigger. In the end there are still N domains, M groups per domain and P
> cpus per group. Doesn't the balancing walk over all of them irrespective of
> physical topology?

Not quite; so for regular load balancing only the first CPU in the
domain will iterate up.

So if you have 4 'nodes' only 4 CPUs will iterate the entire machine,
not all 1024.



> Call Trace:
>  [000000000045dc30] double_rq_lock+0x4c/0x68
>  [000000000046a23c] load_balance+0x278/0x740
>  [00000000008aa178] __schedule+0x378/0x8e4
>  [00000000008aab1c] schedule+0x68/0x78
>  [00000000004718ac] do_exit+0x798/0x7c0
>  [000000000047195c] do_group_exit+0x88/0xc0
>  [0000000000481148] get_signal_to_deliver+0x3ec/0x4c8
>  [000000000042cbc0] do_signal+0x70/0x5e4
>  [000000000042d14c] do_notify_resume+0x18/0x50
>  [00000000004049c4] __handle_signal+0xc/0x2c
> 
> 
> For example the stream program has 1024 threads (1 for each CPU). If you
> ctrl-c the program or wait for it terminate that's when it trips. Other
> workloads that routinely trip it are make -j N, N some number (e.g., on a
> 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, ctrl-c
> ... boom with the above stack trace.
> 
> Code wise ... and this is still present in 3.18 and 3.20:
> 
> schedule()
> - __schedule()
>   + irqs disabled: raw_spin_lock_irq(&rq->lock);
> 
>      pick_next_task
>      - idle_balance()

> For 2.6.39 it's the invocation of idle_balance which is triggering load
> balancing with IRQs disabled. That's when the NMI watchdog trips.

So for idle_balance() look at SD_BALANCE_NEWIDLE, only domains with that
set will get iterated.

I suppose you could try something like the below on 3.18

Which will disable SD_BALANCE_NEWDILE on all 'distant' nodes; but first
check how your fixed numa topology looks and if you trigger that case at
all.

---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17141da77c6e..7fce683928fe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6268,6 +6268,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
 			sd->flags &= ~(SD_BALANCE_EXEC |
 				       SD_BALANCE_FORK |
+				       SD_BALANCE_NEWIDLE |
 				       SD_WAKE_AFFINE);
 		}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/