linux-kernel - Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100314102747.GB5140@nowhere>
Date:	Sun, 14 Mar 2010 11:27:53 +0100
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Li Zefan <lizf@...fujitsu.com>,
	Lai Jiangshan <laijs@...fujitsu.com>, stable@...nel.org
Subject: Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI
	context

On Fri, Mar 12, 2010 at 09:57:00PM -0500, Steven Rostedt wrote:
> From: Steven Rostedt <srostedt@...hat.com>
> 
> A bug was found with Li Zefan's ftrace_stress_test that caused applications
> to segfault during the test.
> 
> Placing a tracing_off() in the segfault code, and examining several
> traces, I found that the following was always the case. The lock tracer
> was enabled (lockdep being required) and userstack was enabled. Testing
> this out, I just enabled the two, but that was not good enough. I needed
> to run something else that could trigger it. Running a load like hackbench
> did not work, but executing a new program would. The following would
> trigger the segfault within seconds:
> 
>   # echo 1 > /debug/tracing/options/userstacktrace
>   # echo 1 > /debug/tracing/events/lock/enable
>   # while :; do ls > /dev/null ; done
> 
> Enabling the function graph tracer and looking at what was happening
> I finally noticed that all cashes happened just after an NMI.
> 
>  1)               |    copy_user_handle_tail() {
>  1)               |      bad_area_nosemaphore() {
>  1)               |        __bad_area_nosemaphore() {
>  1)               |          no_context() {
>  1)               |            fixup_exception() {
>  1)   0.319 us    |              search_exception_tables();
>  1)   0.873 us    |            }
> [...]
>  1)   0.314 us    |  __rcu_read_unlock();
>  1)   0.325 us    |    native_apic_mem_write();
>  1)   0.943 us    |  }
>  1)   0.304 us    |  rcu_nmi_exit();
> [...]
>  1)   0.479 us    |  find_vma();
>  1)               |  bad_area() {
>  1)               |    __bad_area() {
> 
> After capturing several traces of failures, all of them happened
> after an NMI. Curious about this, I added a trace_printk() to the NMI
> handler to read the regs->ip to see where the NMI happened. In which I
> found out it was here:
> 
> ffffffff8135b660 <page_fault>:
> ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
> ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>
> 
> What was happening is that the NMI would happen at the place that a page
> fault occurred. It would call rcu_read_lock() which was traced by
> the lock events, and the user_stack_trace would run. This would trigger
> a page fault inside the NMI. I do not see where the CR2 register is
> saved or restored in NMI handling. This means that it would corrupt
> the page fault handling that the NMI interrupted.
> 
> The reason the while loop of ls helped trigger the bug, was that
> each execution of ls would cause lots of pages to be faulted in, and
> increase the chances of the race happening.
> 
> The simple solution is to not allow user stack traces in NMI context.
> After this patch, I ran the above "ls" test for a couple of hours
> without any issues. Without this patch, the bug would trigger in less
> than a minute.
> 
> Cc: stable@...nel.org
> Reported-by: Li Zefan <lizf@...fujitsu.com>
> Signed-off-by: Steven Rostedt <rostedt@...dmis.org>



Wow, that's a race :)

In perf this is dealt with a special copy_from_user_nmi()
(see in arch/x86/kernel/cpu/perf_event.c)

May be save_stack_trace_user() should use that instead
of a __copy_from_user_inatomic() based thing, just to
cover such NMI corner race case.



> ---
>  kernel/trace/trace.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 484337d..e52683f 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1284,6 +1284,13 @@ ftrace_trace_userstack(struct ring_buffer *buffer, unsigned long flags, int pc)
>  	if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
>  		return;
>  
> +	/*
> +	 * NMIs can not handle page faults, even with fix ups.
> +	 * The save user stack can (and often does) fault.
> +	 */
> +	if (unlikely(in_nmi()))
> +		return;
> +
>  	event = trace_buffer_lock_reserve(buffer, TRACE_USER_STACK,
>  					  sizeof(*entry), flags, pc);
>  	if (!event)
> -- 
> 1.7.0
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/