linux-kernel - Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <520f0cf11003141505l607f2f3bxa36a9cffd82cab50@mail.gmail.com>
Date:	Sun, 14 Mar 2010 23:05:08 +0100
From:	John Kacur <jkacur@...hat.com>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Li Zefan <lizf@...fujitsu.com>,
	Lai Jiangshan <laijs@...fujitsu.com>, stable@...nel.org
Subject: Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI 
	context

On Sat, Mar 13, 2010 at 3:57 AM, Steven Rostedt <rostedt@...dmis.org> wrote:
> From: Steven Rostedt <srostedt@...hat.com>
>
> A bug was found with Li Zefan's ftrace_stress_test that caused applications
> to segfault during the test.
>
> Placing a tracing_off() in the segfault code, and examining several
> traces, I found that the following was always the case. The lock tracer
> was enabled (lockdep being required) and userstack was enabled. Testing
> this out, I just enabled the two, but that was not good enough. I needed
> to run something else that could trigger it. Running a load like hackbench
> did not work, but executing a new program would. The following would
> trigger the segfault within seconds:
>
>  # echo 1 > /debug/tracing/options/userstacktrace
>  # echo 1 > /debug/tracing/events/lock/enable
>  # while :; do ls > /dev/null ; done
>
> Enabling the function graph tracer and looking at what was happening
> I finally noticed that all cashes happened just after an NMI.
>
>  1)               |    copy_user_handle_tail() {
>  1)               |      bad_area_nosemaphore() {
>  1)               |        __bad_area_nosemaphore() {
>  1)               |          no_context() {
>  1)               |            fixup_exception() {
>  1)   0.319 us    |              search_exception_tables();
>  1)   0.873 us    |            }
> [...]
>  1)   0.314 us    |  __rcu_read_unlock();
>  1)   0.325 us    |    native_apic_mem_write();
>  1)   0.943 us    |  }
>  1)   0.304 us    |  rcu_nmi_exit();
> [...]
>  1)   0.479 us    |  find_vma();
>  1)               |  bad_area() {
>  1)               |    __bad_area() {
>
> After capturing several traces of failures, all of them happened
> after an NMI. Curious about this, I added a trace_printk() to the NMI
> handler to read the regs->ip to see where the NMI happened. In which I
> found out it was here:
>
> ffffffff8135b660 <page_fault>:
> ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
> ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>
>
> What was happening is that the NMI would happen at the place that a page
> fault occurred. It would call rcu_read_lock() which was traced by
> the lock events, and the user_stack_trace would run. This would trigger
> a page fault inside the NMI. I do not see where the CR2 register is
> saved or restored in NMI handling. This means that it would corrupt
> the page fault handling that the NMI interrupted.
>
> The reason the while loop of ls helped trigger the bug, was that
> each execution of ls would cause lots of pages to be faulted in, and
> increase the chances of the race happening.
>
> The simple solution is to not allow user stack traces in NMI context.
> After this patch, I ran the above "ls" test for a couple of hours
> without any issues. Without this patch, the bug would trigger in less
> than a minute.
>
> Cc: stable@...nel.org
> Reported-by: Li Zefan <lizf@...fujitsu.com>
> Signed-off-by: Steven Rostedt <rostedt@...dmis.org>
> ---
>  kernel/trace/trace.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 484337d..e52683f 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1284,6 +1284,13 @@ ftrace_trace_userstack(struct ring_buffer *buffer, unsigned long flags, int pc)
>        if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
>                return;
>
> +       /*
> +        * NMIs can not handle page faults, even with fix ups.
> +        * The save user stack can (and often does) fault.
> +        */
> +       if (unlikely(in_nmi()))
> +               return;
> +
>        event = trace_buffer_lock_reserve(buffer, TRACE_USER_STACK,
>                                          sizeof(*entry), flags, pc);
>        if (!event)
> --
> 1.7.0
>

That is some awfully cool detective work. Just one question, if it is
so easy to trigger, why did you wrap it in "unlikely"?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/