linux-kernel - Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56DDBC88.9060308@mellanox.com>
Date:	Mon, 7 Mar 2016 12:38:16 -0500
From:	Chris Metcalf <cmetcalf@...lanox.com>
To:	Peter Zijlstra <peterz@...radead.org>,
	Daniel Thompson <daniel.thompson@...aro.org>
CC:	Russell King <linux@....linux.org.uk>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>, Andrew Morton <akpm@...l.org>,
	<linux-kernel@...r.kernel.org>, Aaron Tomlin <atomlin@...hat.com>,
	"Rafael J. Wysocki" <rjw@...ysocki.net>,
	Daniel Lezcano <daniel.lezcano@...aro.org>
Subject: Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle
 cpus

On 03/07/2016 04:48 AM, Peter Zijlstra wrote:
> On Tue, Mar 01, 2016 at 11:01:42AM -0500, Chris Metcalf wrote:
>> +++ b/kernel/sched/idle.c
>> @@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused)
>>   __setup("hlt", cpu_idle_nopoll_setup);
>>   #endif
>> +static DEFINE_PER_CPU(bool, cpu_idling);
>> +
>> +/* Was the cpu was in the low-level idle code when interrupted? */
>> +bool in_cpu_idle(void)
>> +{
>> +	return this_cpu_read(cpu_idling);
>> +}
>> +
>>   static inline int cpu_idle_poll(void)
>>   {
>>   	rcu_idle_enter();
>>   	trace_cpu_idle_rcuidle(0, smp_processor_id());
>>   	local_irq_enable();
>>   	stop_critical_timings();
>> +	this_cpu_write(cpu_idling, true);
>>   	while (!tif_need_resched() &&
>>   		(cpu_idle_force_poll || tick_check_broadcast_expired()))
>>   		cpu_relax();
>> +	this_cpu_write(cpu_idling, false);
>>   	start_critical_timings();
>>   	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
>>   	rcu_idle_exit();
>> @@ -89,7 +99,9 @@ void default_idle_call(void)
>>   		local_irq_enable();
>>   	} else {
>>   		stop_critical_timings();
>> +		this_cpu_write(cpu_idling, true);
>>   		arch_cpu_idle();
>> +		this_cpu_write(cpu_idling, false);
>>   		start_critical_timings();
>>   	}
>>   }
> No, we're not going to add random crap here. This is actually considered
> a fast path for some workloads.
>
> There's already far too much fat in the whole going to idle and coming
> out of idle. We should be trimming this, not adding to it.

I'm a little skeptical that a single percpu write is going to add much
measurable overhead to this path.  However, we can certainly adapt
alternate approaches that stay away from the actual idle code.

One approach (diff appended) is to just test to see if the PC is
actually in the architecture-specific halt code.  There are two downsides:

1. It requires a small amount of per-architecture support.  I've provided
    the tile support as an example, since that's what I tested.  I expect
    x86 is a little more complicated since there are more idle paths and
    they don't currently run the idle instruction(s) at a fixed address, but
    it's unlikely to be too complicated on any platform.
    Still, adding anything per-architecture is certainly a downside.

2. As proposed, my new alternate solution only handles the non-polling
    case, so if you are in the polling loop, we won't benefit from having
    the NMI backtrace code skip over you.  However my guess is that 99% of
    the time folks do choose to run the default non-polling mode, so this
    probably still achieves a pretty reasonable outcome.

A different approach that would handle downside #2 and probably make it
easier to implement the architecture-specific code for more complicated
platforms like x86 would be to use the SCHED_TEXT model and tag all the
low-level idling functions as CPUIDLE_TEXT.  Then the "are we idling"
test is just a range compare on the PC against __cpuidle_text_{start,end}.

We'd have to decide whether to make cpu_idle_poll() non-inline and just
test for being in that function, or whether we could tag all of
cpu_idle_loop() as being CPUIDLE_TEXT and just omit any backtrace
whenever the PC is anywhere in that function.  Obviously if we have
called out to more complicated code (e.g. Daniel's concern about calling
out to power management code) the PC would no longer be in the CPUIDLE_TEXT
at that point, so that might be OK too.

Let me know what you think is the right direction here.

Thanks!

diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index 4b7cef9e94e0..93ec51a4853b 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -92,6 +92,9 @@ extern void smp_nap(void);
  /* Enable interrupts racelessly and nap forever: helper for arch_cpu_idle(). */
  extern void _cpu_idle(void);
  
+/* The address of the actual nap instruction. */
+extern long _cpu_idle_nap[];
+
  #else /* __ASSEMBLY__ */
  
  /*
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index b5f30d376ce1..a83a426f1755 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -70,6 +70,11 @@ void arch_cpu_idle(void)
  	_cpu_idle();
  }
  
+bool arch_cpu_in_idle(struct pt_regs *regs)
+{
+	return regs->pc == (unsigned long)_cpu_idle_nap;
+}
+
  /*
   * Release a thread_info structure
   */
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index d2ca8c38f9c4..24462927fa49 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -279,6 +279,7 @@ void arch_cpu_idle_prepare(void);
  void arch_cpu_idle_enter(void);
  void arch_cpu_idle_exit(void);
  void arch_cpu_idle_dead(void);
+bool arch_cpu_in_idle(struct pt_regs *);
  
  DECLARE_PER_CPU(bool, cpu_dead_idle);
  
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 544a7133cbd1..d9dbab6526a9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -77,6 +77,7 @@ void __weak arch_cpu_idle(void)
  	cpu_idle_force_poll = 1;
  	local_irq_enable();
  }
+bool __weak arch_cpu_in_idle(struct pt_regs *regs) { return false; }
  
  /**
   * default_idle_call - Default CPU idle routine.
diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
index db63ac75eba0..bcc4ecc828f2 100644
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@ -17,6 +17,7 @@
  #include <linux/kprobes.h>
  #include <linux/nmi.h>
  #include <linux/seq_buf.h>
+#include <linux/cpu.h>
  
  #ifdef arch_trigger_cpumask_backtrace
  /* For reliability, we're prepared to waste bits here. */
@@ -151,11 +152,16 @@ bool nmi_cpu_backtrace(struct pt_regs *regs)
  
  		/* Replace printk to write into the NMI seq */
  		this_cpu_write(printk_func, nmi_vprintk);
-		pr_warn("NMI backtrace for cpu %d\n", cpu);
-		if (regs)
-			show_regs(regs);
-		else
-			dump_stack();
+		if (regs != NULL && arch_cpu_in_idle(regs)) {
+			pr_warn("NMI backtrace for cpu %d skipped: idle\n",
+				cpu);
+		} else {
+			pr_warn("NMI backtrace for cpu %d\n", cpu);
+			if (regs)
+				show_regs(regs);
+			else
+				dump_stack();
+		}
  		this_cpu_write(printk_func, printk_func_save);
  
  		cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com