linux-kernel - Re: Help with trace-cmd/ftrace recording process ID information

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADx9qWj_QkVcU0NV9_5VSuWEYRN41juS6uBd6wOMD2Et7y3igw@mail.gmail.com>
Date:   Mon, 17 Jul 2017 16:08:10 -0400
From:   Will Hawkins <hawkinsw@...laugic.com>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     linux-kernel@...r.kernel.org
Subject: Re: Help with trace-cmd/ftrace recording process ID information

On Mon, Jul 17, 2017 at 4:06 PM, Will Hawkins <hawkinsw@...laugic.com> wrote:
> Thank you for your response!
>
> On Mon, Jul 17, 2017 at 3:53 PM, Steven Rostedt <rostedt@...dmis.org> wrote:
>>
>> On Mon, 17 Jul 2017 15:18:18 -0400
>> Will Hawkins <hawkinsw@...laugic.com> wrote:
>>
>> > Hello everyone, especially Mr. Rostedt,
>> >
>> > I have had great success with ftrace debugging performance issues on
>> > Linux systems. The combination of ftrace and trace-cmd are absolutely
>> > amazing tools for digging in to exactly what is going on in a system and
>> > where performance problems exist.
>> >
>> > I recently switched to a different host and attempted to run trace-cmd
>> > record to get a record of page faults:
>> >
>> > /path/to//trace-cmd/trace-cmd record -e page_fault_user /bin/ls
>> >
>> > When I "report" on that trace, I get entries like the following:
>> >
>> > <...>-41850 [010] 27484983.185200: page_fault_user:
>> > address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>> >
>> > It's exactly what I want. However, it does not list the process that
>> > generated that fault. Instead, it uses <...>. I dug into the trace-cmd
>> > code and see where this is generated and why it is generated.
>> >
>> > What I don't understand is why on a different system, when I run the
>> > same record command, I get the following output:
>> >
>> >   ls-19887 [005] 2438162.263793: page_fault_user:
>> > address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>> >
>> > Again, it's exactly what I want and it lists the process name that
>> > generated the fault.
>> >
>> >  From the code, I see that the <...> is printed instead of the name of
>> > the process when the pid is not in the pevent's command lines. What I
>> > can't seem to figure out is why the process would be in that list on one
>> > host and not on the other.
>>
>> Are you using the same kernel version and trace-cmd version on both
>> hosts?
>
> Roughly, yes.
>
> Both are running (old) kernels 3.13 but are running from HEAD of
> master from the main git repository of trace-cmd.
>
>>
>> >
>> > When I looked at the trace.dat file directly, I did notice that on the
>> > "good" host, there are a list of pids and names. On the "bad" host,
>> > there is no such list in the trace.dat file. I am sure that is the
>> > reason for the <...>s being printed, but I can't figure out why that
>> > list is not getting in the trace.dat file.
>> >
>> > I gave a quick look to try to find where that pid/comm list is generated
>> > and written to the trace.dat file, but couldn't find anything.
>> >
>> > I figured that I would send an email before I dug any further in case
>> > someone has seen this already. I am happy to pass along other pertinent
>> > information if it is helpful to debug the problem. I just don't want to
>> > spam the list with information that is irrelevant.
>> >
>> > Again, the combination of ftrace/trace-cmd is borderline magic. I
>> > appreciate all the work that has gone into it!
>> >
>> > Thanks in advance for helping me sort through this issue!
>>
>> The comm (the program name) is not saved at each event. Instead,
>> there's a "cache" of them. On a schedule switch, when tracing is
>> active, it will store a comm in a table. The trace file uses this list
>> too. When trace-cmd is finished tracing, it will read that table which
>> is located in the tracefs directory and the file is called
>> saved_cmdlines.
>
> This seems to be the problem:
>
> On the "good" system, that file is up-to-date with cached PIDs and
> comms. On the bad host, there are no cached entries from any of the
> traces that I've run.
>
> Because these are running old kernels, there is no saved_cmdlines_size
> knob to turn. Do you have any idea why the saved_cmdlines would not be
> getting updated appropriately on the "bad" host? I know this is not
> ideal, but I can try to reboot that host and see if something is
> simply wedged. The system has been online for almost a year, so it's
> possible that something has gone wrong.
>
> Any help you can offer would be great! Thank you, again, for your response!

I failed to mention that the "bad" host does include the (limited and
stale) information from saved_cmdlines in the trace.dat file. I just
verified that. So, now the only open question is why that file is not
being updated appropriately.

Thanks again!
Will

>
> Will
>
>>
>> By default, it saves 128 comms. If you want more or less, you can
>> change the size by echoing in the new size number into the file
>> saved_cmdlines_size.
>>
>> I'm not sure why trace-cmd didn't save that file, unless it was an
>> older version that did the recording.
>>
>>
>> -- Steve
>>