linux-kernel - Re: Ftrace vs perf user page fault statistics differences

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170613160434.605fdef2@gandalf.local.home>
Date:   Tue, 13 Jun 2017 16:04:34 -0400
From:   Steven Rostedt <rostedt@...dmis.org>
To:     Will Hawkins <whh8b@...ginia.edu>
Cc:     LKML <linux-kernel@...r.kernel.org>
Subject: Re: Ftrace vs perf user page fault statistics differences

On Tue, 13 Jun 2017 14:02:08 -0400
Will Hawkins <whh8b@...ginia.edu> wrote:

> Thank you for pointing this out. I had been using -F for exactly the
> reason that you mentioned. I failed to include it in the command that
> I sent along. Very sorry for the confusion. Here is an updated version
> of the command that I issued:
> 
> 
> sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
> handle_mm_fault -F ../one_page_play/page
> 
> and I generated output like
> 
> ./trace-cmd report --profile
> 
> and I see the following (among some other output):
> 
>   Event: page_fault_user:0x7f094f7dd104 (1)
>   Event: page_fault_user:0x4000e0 (1)
>   Event: page_fault_user:0x7f094f7eae4a (1)
>   Event: page_fault_user:0x7f094f860d40 (1)
>   Event: page_fault_user:0x7f094f7db560 (1)
>   Event: page_fault_user:0x4040cb (1)
>   Event: page_fault_user:0x401825 (1)
>   Event: page_fault_user:0x401473 (1)
>   Event: page_fault_user:0x7f094f7e64c4 (1)
>   Event: page_fault_user:0x7f094f7f1212 (1)
> 
> That output comes from under the task: page-<pid> heading, so it seems
> like those faults are being attributed to the page task.
> 
> This command seems to show something interesting:
> 
> sudo ./trace-cmd record -e exceptions:page_fault_user -p
> function_graph -g __do_fault -F ../one_page_play/page
> 
> and the relevant output from
> 
> ./trace-cmd report --profile
> 
> is
> 
> task: page-4032
>   Event: func: __do_fault() (4) Total: 6685 Avg: 1671 Max:
> 2398(ts:170150.060916) Min:855(ts:170150.054713)
>   Event: page_fault_user:0x7ffad3143d40 (1)
>   Event: page_fault_user:0x4000e0 (1)
>   Event: page_fault_user:0x401473 (1)
>   Event: page_fault_user:0x7ffad30c94c4 (1)
> 
> This is closer to what I would expect. The first of the two 0x4...
> addresses is the entry point and the second is the target. Basically,
> that is exactly what I expect. The other two are the "suspicious"
> entries. Neither matches the copy_user_enhanced_fast_string symbol
> location and are not loaded in the binary (according to gdb).

As you state below, there is faults recorded before the exec. Which is
true with trace-cmd (not sure about perf). As trace-cmd does do some
work after tracing is started and before the exec is called.

> 
> It is odd to me that the output from the previous command includes
> information about the trace-cmd process since I specified the -F
> option.

Did the trace-cmd process have the same pid as the page program? The -F
makes it follow the pid, which can be trace-cmd before the fork.

Oh, and what versions are you running of the kernel as well as
trace-cmd. It's best to always get the latest trace-cmd. Older versions
did have some bugs.

Also, function_graph can have some residual traces from a sched_switch.

That is, the tracing of a task specified by -F is enabled and disabled
per cpu at sched_switch. The sched_switch tracepoint is before the
actual context switch. When the next task is the task specified by -F,
tracing begins. But the context switch can start tracing the previous
task right after the sched switch tracepoint. Here, in
kernel/sched/core.c in __schedule():

		trace_sched_switch(preempt, prev, next);

		/* Also unlocks the rq: */
		rq = context_switch(rq, prev, next, &rf);


The context_switch() function will be traced even when the previous
task isn't the task from -F.


> 
> But, back to exactly what you asked. Here is the result of running
> perf again with the -a option. I ran this command:
> 
> sudo perf record --call-graph fp -e page-faults -a ../one_page_play/page
> 
>     50.41%     page  ld-2.19.so         [.] do_lookup_x
>                |
>                --- do_lookup_x
> 
>     44.21%     perf  [kernel.kallsyms]  [k] iov_iter_fault_in_readable
>                |
>                --- iov_iter_fault_in_readable
>                    generic_file_buffered_write
>                    __generic_file_aio_write
>                    generic_file_aio_write
>                    ext4_file_write
>                    do_sync_write
>                    vfs_write
>                    sys_write
>                    system_call_fastpath
>                    __write_nocancel
>                    0x4081a5
>                    0x407a40
>                    __libc_start_main
> 
>      4.13%     perf  perf               [.] 0x0000000000015b54
>                |
>                --- 0x415b54
>                    0x4081a5
>                    0x407a40
>                    __libc_start_main
> 
>      0.41%     page  page               [.] _start
>                |
>                --- _start
> 
>      0.41%     page  page               [.] target
>                |
>                --- target
> 
>      0.41%     page  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
>                |
>                --- copy_user_enhanced_fast_string
>                    load_elf_binary
>                    search_binary_handler
>                    do_execve_common.isra.23
>                    sys_execve
>                    stub_execve
>                    __execve
> 
> 
> What is interesting, is that the output differs based on whether I've
> dropped the kernel caches before I run the perf record. When I do
> that, there are no page faults attributed to the entry point or the
> target of the program. I would imagine that, after dropping caches,
> the readahead handler picks up those pages when the binary is loaded
> and negates the need for a page fault. That, indeed, seems to be the
> case. I can see that when I run perf with an additional
> block:block_rq_issue event. Immediately after dropping the caches,
> there is a block request event. On subsequent executions, there is no
> such event but there are the page faults that I expect.
> 
> What I did notice is that some of the page faults come from the task
> before execve is called. From what I've seen, this looks like the
> kernel reclaiming pages from the spawning process before it is
> replaced with the new binary (during the call to execve). After the
> execve, there seem to be to page faults:
> 
> 
>             page-4613  [006] 171795.748310: funcgraph_entry:
> 0.151 us   |  mutex_unlock();
>             page-4613  [006] 171795.748313: funcgraph_entry:
> 0.166 us   |  __fsnotify_parent();
>             page-4613  [006] 171795.748313: funcgraph_entry:
> 0.321 us   |  fsnotify();
>             page-4613  [006] 171795.748314: funcgraph_entry:
> 0.090 us   |  __sb_end_write();
>             page-4613  [006] 171795.748317: funcgraph_entry:
>         |  trace_do_page_fault() {
>             page-4613  [006] 171795.748317: page_fault_user:
> address=__per_cpu_end ip=__per_cpu_end error_code=0x4
>             page-4613  [006] 171795.748319: funcgraph_exit:
> 2.254 us   |  }
>             page-4613  [006] 171795.748321: funcgraph_entry:
>         |  trace_do_page_fault() {
>             page-4613  [006] 171795.748322: page_fault_user:
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>             page-4613  [006] 171795.748323: funcgraph_exit:
> 1.144 us   |  }
> 
> NOTICE THIS:
>             page-4613  [006] 171795.748324: funcgraph_entry:
>         |  sys_execve() {
>             page-4613  [007] 171795.748391: block_rq_issue:       8,0
> R 0 () 764812912 + 16 [trace-cmd]
>             page-4613  [005] 171795.759476: funcgraph_exit:       #
> 11152.111 us |  }
>             page-4613  [005] 171795.759477: funcgraph_entry:
> 3.745 us   |  do_notify_resume();
>             page-4613  [005] 171795.759481: funcgraph_entry:
>         |  trace_do_page_fault() {
>             page-4613  [005] 171795.759482: page_fault_user:
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>             page-4613  [005] 171795.759487: funcgraph_exit:
> 5.833 us   |  }
>             page-4613  [005] 171795.759488: funcgraph_entry:
>         |  trace_do_page_fault() {
>             page-4613  [005] 171795.759489: page_fault_user:
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>             page-4613  [005] 171795.759490: funcgraph_exit:
> 2.003 us   |  }
>             page-4613  [005] 171795.759492: funcgraph_entry:
>         |  sys_exit() {
> 
> I wish that I could get something "better" than "
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14" for those page
> faults. That would really tell me more about whether this is the
> "correct" behavior.

Also, you might want to do a trace-cmd report -l, which lets you see if
the event happened in interrupt context.

> 
> As ever, thank you very much for your help! Using these tools has been
> an incredible learning experience. I still think that I am just
> missing something stupid, but I really appreciate your patience.

No no, you got me curious as well. Although, the only page faults I see
are from before the exec and the actual program. Could you run the -F
on your program with the other options you have and report a full:

  trace-cmd report -l

And then you can ask about what you don't understand being there.

-- Steve


-- Steve