linux-kernel - Re: Help with trace-cmd/ftrace recording process ID information

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADx9qWj2+T0bd1GN1DOV+9_uKPc3FMqkqGJCmeTLL8y=FNMLCg@mail.gmail.com>
Date:   Mon, 17 Jul 2017 16:06:37 -0400
From:   Will Hawkins <hawkinsw@...laugic.com>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     linux-kernel@...r.kernel.org
Subject: Re: Help with trace-cmd/ftrace recording process ID information

Thank you for your response!

On Mon, Jul 17, 2017 at 3:53 PM, Steven Rostedt <rostedt@...dmis.org> wrote:
>
> On Mon, 17 Jul 2017 15:18:18 -0400
> Will Hawkins <hawkinsw@...laugic.com> wrote:
>
> > Hello everyone, especially Mr. Rostedt,
> >
> > I have had great success with ftrace debugging performance issues on
> > Linux systems. The combination of ftrace and trace-cmd are absolutely
> > amazing tools for digging in to exactly what is going on in a system and
> > where performance problems exist.
> >
> > I recently switched to a different host and attempted to run trace-cmd
> > record to get a record of page faults:
> >
> > /path/to//trace-cmd/trace-cmd record -e page_fault_user /bin/ls
> >
> > When I "report" on that trace, I get entries like the following:
> >
> > <...>-41850 [010] 27484983.185200: page_fault_user:
> > address=__per_cpu_end ip=__per_cpu_end error_code=0x14
> >
> > It's exactly what I want. However, it does not list the process that
> > generated that fault. Instead, it uses <...>. I dug into the trace-cmd
> > code and see where this is generated and why it is generated.
> >
> > What I don't understand is why on a different system, when I run the
> > same record command, I get the following output:
> >
> >   ls-19887 [005] 2438162.263793: page_fault_user:
> > address=__per_cpu_end ip=__per_cpu_end error_code=0x14
> >
> > Again, it's exactly what I want and it lists the process name that
> > generated the fault.
> >
> >  From the code, I see that the <...> is printed instead of the name of
> > the process when the pid is not in the pevent's command lines. What I
> > can't seem to figure out is why the process would be in that list on one
> > host and not on the other.
>
> Are you using the same kernel version and trace-cmd version on both
> hosts?

Roughly, yes.

Both are running (old) kernels 3.13 but are running from HEAD of
master from the main git repository of trace-cmd.

>
> >
> > When I looked at the trace.dat file directly, I did notice that on the
> > "good" host, there are a list of pids and names. On the "bad" host,
> > there is no such list in the trace.dat file. I am sure that is the
> > reason for the <...>s being printed, but I can't figure out why that
> > list is not getting in the trace.dat file.
> >
> > I gave a quick look to try to find where that pid/comm list is generated
> > and written to the trace.dat file, but couldn't find anything.
> >
> > I figured that I would send an email before I dug any further in case
> > someone has seen this already. I am happy to pass along other pertinent
> > information if it is helpful to debug the problem. I just don't want to
> > spam the list with information that is irrelevant.
> >
> > Again, the combination of ftrace/trace-cmd is borderline magic. I
> > appreciate all the work that has gone into it!
> >
> > Thanks in advance for helping me sort through this issue!
>
> The comm (the program name) is not saved at each event. Instead,
> there's a "cache" of them. On a schedule switch, when tracing is
> active, it will store a comm in a table. The trace file uses this list
> too. When trace-cmd is finished tracing, it will read that table which
> is located in the tracefs directory and the file is called
> saved_cmdlines.

This seems to be the problem:

On the "good" system, that file is up-to-date with cached PIDs and
comms. On the bad host, there are no cached entries from any of the
traces that I've run.

Because these are running old kernels, there is no saved_cmdlines_size
knob to turn. Do you have any idea why the saved_cmdlines would not be
getting updated appropriately on the "bad" host? I know this is not
ideal, but I can try to reboot that host and see if something is
simply wedged. The system has been online for almost a year, so it's
possible that something has gone wrong.

Any help you can offer would be great! Thank you, again, for your response!

Will

>
> By default, it saves 128 comms. If you want more or less, you can
> change the size by echoing in the new size number into the file
> saved_cmdlines_size.
>
> I'm not sure why trace-cmd didn't save that file, unless it was an
> older version that did the recording.
>
>
> -- Steve
>