linux-kernel - Re: Ftrace vs perf user page fault statistics differences

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAE+MWFsBGEqe9bru3b6xWG4sSsT3wyGvfBS7uK80u+eUn9sm-A@mail.gmail.com>
Date:   Tue, 13 Jun 2017 21:27:20 -0400
From:   Will Hawkins <whh8b@...ginia.edu>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     LKML <linux-kernel@...r.kernel.org>
Subject: Re: Ftrace vs perf user page fault statistics differences

On Tue, Jun 13, 2017 at 4:04 PM, Steven Rostedt <rostedt@...dmis.org> wrote:
> On Tue, 13 Jun 2017 14:02:08 -0400
> Will Hawkins <whh8b@...ginia.edu> wrote:
>
>> Thank you for pointing this out. I had been using -F for exactly the
>> reason that you mentioned. I failed to include it in the command that
>> I sent along. Very sorry for the confusion. Here is an updated version
>> of the command that I issued:
>>
>>
>> sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
>> handle_mm_fault -F ../one_page_play/page
>>
>> and I generated output like
>>
>> ./trace-cmd report --profile
>>
>> and I see the following (among some other output):
>>
>>   Event: page_fault_user:0x7f094f7dd104 (1)
>>   Event: page_fault_user:0x4000e0 (1)
>>   Event: page_fault_user:0x7f094f7eae4a (1)
>>   Event: page_fault_user:0x7f094f860d40 (1)
>>   Event: page_fault_user:0x7f094f7db560 (1)
>>   Event: page_fault_user:0x4040cb (1)
>>   Event: page_fault_user:0x401825 (1)
>>   Event: page_fault_user:0x401473 (1)
>>   Event: page_fault_user:0x7f094f7e64c4 (1)
>>   Event: page_fault_user:0x7f094f7f1212 (1)
>>
>> That output comes from under the task: page-<pid> heading, so it seems
>> like those faults are being attributed to the page task.
>>
>> This command seems to show something interesting:
>>
>> sudo ./trace-cmd record -e exceptions:page_fault_user -p
>> function_graph -g __do_fault -F ../one_page_play/page
>>
>> and the relevant output from
>>
>> ./trace-cmd report --profile
>>
>> is
>>
>> task: page-4032
>>   Event: func: __do_fault() (4) Total: 6685 Avg: 1671 Max:
>> 2398(ts:170150.060916) Min:855(ts:170150.054713)
>>   Event: page_fault_user:0x7ffad3143d40 (1)
>>   Event: page_fault_user:0x4000e0 (1)
>>   Event: page_fault_user:0x401473 (1)
>>   Event: page_fault_user:0x7ffad30c94c4 (1)
>>
>> This is closer to what I would expect. The first of the two 0x4...
>> addresses is the entry point and the second is the target. Basically,
>> that is exactly what I expect. The other two are the "suspicious"
>> entries. Neither matches the copy_user_enhanced_fast_string symbol
>> location and are not loaded in the binary (according to gdb).
>
> As you state below, there is faults recorded before the exec. Which is
> true with trace-cmd (not sure about perf). As trace-cmd does do some
> work after tracing is started and before the exec is called.
>
>>
>> It is odd to me that the output from the previous command includes
>> information about the trace-cmd process since I specified the -F
>> option.
>
> Did the trace-cmd process have the same pid as the page program? The -F
> makes it follow the pid, which can be trace-cmd before the fork.
>
> Oh, and what versions are you running of the kernel as well as
> trace-cmd. It's best to always get the latest trace-cmd. Older versions
> did have some bugs.

I am building from source from commit
1785022e12cf0079889c519ffc7157f2143326ea. I haven't pulled in a few
days, but it seems fairly recent. I hope that I didn't miss an update.
I can pull if you think that might be beneficial.

This is on a relatively old Kernel from Ubuntu's LTS 14.04: 3.13.0-93-generic

>
> Also, function_graph can have some residual traces from a sched_switch.
>
> That is, the tracing of a task specified by -F is enabled and disabled
> per cpu at sched_switch. The sched_switch tracepoint is before the
> actual context switch. When the next task is the task specified by -F,
> tracing begins. But the context switch can start tracing the previous
> task right after the sched switch tracepoint. Here, in
> kernel/sched/core.c in __schedule():
>
>                 trace_sched_switch(preempt, prev, next);
>
>                 /* Also unlocks the rq: */
>                 rq = context_switch(rq, prev, next, &rf);
>
>
> The context_switch() function will be traced even when the previous
> task isn't the task from -F.
>
>

This seems to make very good sense and does seem to match what was
starting to emerge based on my investigations from earlier today. I
don't consider this a problem at all, just not something that I would
have naively expected. Now that I understand the system better, it
makes perfect sense.


>>
>> But, back to exactly what you asked. Here is the result of running
>> perf again with the -a option. I ran this command:
>>
>> sudo perf record --call-graph fp -e page-faults -a ../one_page_play/page
>>
>>     50.41%     page  ld-2.19.so         [.] do_lookup_x
>>                |
>>                --- do_lookup_x
>>
>>     44.21%     perf  [kernel.kallsyms]  [k] iov_iter_fault_in_readable
>>                |
>>                --- iov_iter_fault_in_readable
>>                    generic_file_buffered_write
>>                    __generic_file_aio_write
>>                    generic_file_aio_write
>>                    ext4_file_write
>>                    do_sync_write
>>                    vfs_write
>>                    sys_write
>>                    system_call_fastpath
>>                    __write_nocancel
>>                    0x4081a5
>>                    0x407a40
>>                    __libc_start_main
>>
>>      4.13%     perf  perf               [.] 0x0000000000015b54
>>                |
>>                --- 0x415b54
>>                    0x4081a5
>>                    0x407a40
>>                    __libc_start_main
>>
>>      0.41%     page  page               [.] _start
>>                |
>>                --- _start
>>
>>      0.41%     page  page               [.] target
>>                |
>>                --- target
>>
>>      0.41%     page  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
>>                |
>>                --- copy_user_enhanced_fast_string
>>                    load_elf_binary
>>                    search_binary_handler
>>                    do_execve_common.isra.23
>>                    sys_execve
>>                    stub_execve
>>                    __execve
>>
>>
>> What is interesting, is that the output differs based on whether I've
>> dropped the kernel caches before I run the perf record. When I do
>> that, there are no page faults attributed to the entry point or the
>> target of the program. I would imagine that, after dropping caches,
>> the readahead handler picks up those pages when the binary is loaded
>> and negates the need for a page fault. That, indeed, seems to be the
>> case. I can see that when I run perf with an additional
>> block:block_rq_issue event. Immediately after dropping the caches,
>> there is a block request event. On subsequent executions, there is no
>> such event but there are the page faults that I expect.
>>
>> What I did notice is that some of the page faults come from the task
>> before execve is called. From what I've seen, this looks like the
>> kernel reclaiming pages from the spawning process before it is
>> replaced with the new binary (during the call to execve). After the
>> execve, there seem to be to page faults:
>>
>>
>>             page-4613  [006] 171795.748310: funcgraph_entry:
>> 0.151 us   |  mutex_unlock();
>>             page-4613  [006] 171795.748313: funcgraph_entry:
>> 0.166 us   |  __fsnotify_parent();
>>             page-4613  [006] 171795.748313: funcgraph_entry:
>> 0.321 us   |  fsnotify();
>>             page-4613  [006] 171795.748314: funcgraph_entry:
>> 0.090 us   |  __sb_end_write();
>>             page-4613  [006] 171795.748317: funcgraph_entry:
>>         |  trace_do_page_fault() {
>>             page-4613  [006] 171795.748317: page_fault_user:
>> address=__per_cpu_end ip=__per_cpu_end error_code=0x4
>>             page-4613  [006] 171795.748319: funcgraph_exit:
>> 2.254 us   |  }
>>             page-4613  [006] 171795.748321: funcgraph_entry:
>>         |  trace_do_page_fault() {
>>             page-4613  [006] 171795.748322: page_fault_user:
>> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>>             page-4613  [006] 171795.748323: funcgraph_exit:
>> 1.144 us   |  }
>>
>> NOTICE THIS:
>>             page-4613  [006] 171795.748324: funcgraph_entry:
>>         |  sys_execve() {
>>             page-4613  [007] 171795.748391: block_rq_issue:       8,0
>> R 0 () 764812912 + 16 [trace-cmd]
>>             page-4613  [005] 171795.759476: funcgraph_exit:       #
>> 11152.111 us |  }
>>             page-4613  [005] 171795.759477: funcgraph_entry:
>> 3.745 us   |  do_notify_resume();
>>             page-4613  [005] 171795.759481: funcgraph_entry:
>>         |  trace_do_page_fault() {
>>             page-4613  [005] 171795.759482: page_fault_user:
>> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>>             page-4613  [005] 171795.759487: funcgraph_exit:
>> 5.833 us   |  }
>>             page-4613  [005] 171795.759488: funcgraph_entry:
>>         |  trace_do_page_fault() {
>>             page-4613  [005] 171795.759489: page_fault_user:
>> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>>             page-4613  [005] 171795.759490: funcgraph_exit:
>> 2.003 us   |  }
>>             page-4613  [005] 171795.759492: funcgraph_entry:
>>         |  sys_exit() {
>>
>> I wish that I could get something "better" than "
>> address=__per_cpu_end ip=__per_cpu_end error_code=0x14" for those page
>> faults. That would really tell me more about whether this is the
>> "correct" behavior.
>
> Also, you might want to do a trace-cmd report -l, which lets you see if
> the event happened in interrupt context.
>
>>
>> As ever, thank you very much for your help! Using these tools has been
>> an incredible learning experience. I still think that I am just
>> missing something stupid, but I really appreciate your patience.
>
> No no, you got me curious as well. Although, the only page faults I see
> are from before the exec and the actual program. Could you run the -F
> on your program with the other options you have and report a full:
>
>   trace-cmd report -l
>
> And then you can ask about what you don't understand being there.
>
> -- Steve

I have a slightly refined record command that I think has convinced me
that I see what is going on. Here's the record command:

sudo ./trace-cmd record -p function_graph -T --func-stack -l
do_page_fault -l do_execve --profile ../one_page_play/page

and here's the report command:

./trace-cmd report -l --comm="page"

Here's the (snipped) output:

...
    page-7717    7d... 198178.664966: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
    page-7717    7d... 198178.664968: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
    page-7717    7d... 198178.664970: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
    page-7717    7d... 198178.664971: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
    page-7717    7d... 198178.664973: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
    page-7717    7d... 198178.664974: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
    page-7717    7d... 198178.664975: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
    page-7717    7d... 198178.664977: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x14
...
    page-7717    7dN.. 198178.664985: sched_wakeup:
comm=migration/7 pid=58 prio=0 success=1 target_cpu=007
    page-7717    7dN.. 198178.664986: kernel_stack:         <stack trace>
=> try_to_wake_up (ffffffff8109d2a2)
=> wake_up_process (ffffffff8109d3a5)
...
=> sched_exec (ffffffff8109bcd0)
=> do_execve_common.isra.23 (ffffffff811c88ce)
=> SyS_execve (ffffffff811c9176)
=> stub_execve (ffffffff8173afc9)
  <idle>-0       4d... 198178.664993: sched_switch:
prev_comm=swapper/4 prev_pid=0 prev_prio=120 prev_state=R ==>
next_comm=trace-cmd next_pid=7717 next_prio=120
    page-7717    4.... 198178.665053: sched_process_exec:
filename=../one_page_play/page pid=7717 old_pid=7717
...
=> stub_execve (ffffffff8173afc9)
...
    page-7717    4d... 198178.665056: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x14
...
    page-7717    4d... 198178.665059: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x14
...

Here's how I read this:

There are a "bunch" of page faults that occur as the spawning process
quits. Then the execve() system call starts. After that, there are two
page faults.

The two page faults that occur after the exec() are exactly the number
that I would expect.

It seems like this is exactly the answer!

But, while I have your ear, I was wondering if you could direct me to
where the page_fault_user gathers its information (when tracing) and
where that information is formatted for printing (in trace-cmd). I'd
really like to investigate modifying that code so that it provides
"better" information than

address=__per_cpu_end ip=__per_cpu_end

Now that I'm into this, I want to really dig in. If you can give me a
pointer, that would really help me get started. After that I can
attempt to make some patches and see where it leads.

For the x86 architecture, the relevant files seem to be

mm/fault.c (with the trace_page_fault_entries function)

and

include/asm/trace/exceptions.h (where the exception class is built)

What seems odd there is that the exception *should* be responding with
"better" information -- address is set to the value of the cr2
register which should contain the address of the faulting instruction.
Perhaps I just don't understand what the symbol __per_cpu_end means.

Any information you can shed on this would be really great.

Thanks again for your patience with me! I think that I now understand
what is going on.

Will


>
>
> -- Steve