[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87a9igcj8d.fsf@sejong.aot.lge.com>
Date: Fri, 11 Oct 2013 17:19:46 +0900
From: Namhyung Kim <namhyung@...nel.org>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Frederic Weisbecker <fweisbec@...il.com>,
Ingo Molnar <mingo@...nel.org>,
LKML <linux-kernel@...r.kernel.org>,
Namhyung Kim <namhyung.kim@....com>,
Martin Schwidefsky <schwidefsky@...ibm.com>,
Heiko Carstens <heiko.carstens@...ibm.com>
Subject: Re: [RFC/PATCH] ftrace: add set_graph_notrace filter
Hi Steve,
On Fri, 11 Oct 2013 00:17:17 -0400, Steven Rostedt wrote:
> Sorry for the very late reply, finally got some time to look at other
> peoples code.
Thank you for taking your time to review this carefully. :)
>
> On Tue, 3 Sep 2013 14:05:08 +0900
> Namhyung Kim <namhyung@...nel.org> wrote:
>
>> From: Namhyung Kim <namhyung.kim@....com>
>>
>> The set_graph_notrace filter is analogous to set_ftrace_notrace and
>> can be used for eliminating uninteresting part of function graph trace
>> output. It also works with set_graph_function nicely.
>>
>> # cd /sys/kernel/debug/tracing/
>> # echo do_page_fault > set_graph_function
>> # perf ftrace live true
>> 2) | do_page_fault() {
>> 2) | __do_page_fault() {
>> 2) 0.381 us | down_read_trylock();
>> 2) 0.055 us | __might_sleep();
>> 2) 0.696 us | find_vma();
>> 2) | handle_mm_fault() {
>> 2) | handle_pte_fault() {
>> 2) | __do_fault() {
>> 2) | filemap_fault() {
>> 2) | find_get_page() {
>> 2) 0.033 us | __rcu_read_lock();
>> 2) 0.035 us | __rcu_read_unlock();
>> 2) 1.696 us | }
>> 2) 0.031 us | __might_sleep();
>> 2) 2.831 us | }
>> 2) | _raw_spin_lock() {
>> 2) 0.046 us | add_preempt_count();
>> 2) 0.841 us | }
>> 2) 0.033 us | page_add_file_rmap();
>> 2) | _raw_spin_unlock() {
>> 2) 0.057 us | sub_preempt_count();
>> 2) 0.568 us | }
>> 2) | unlock_page() {
>> 2) 0.084 us | page_waitqueue();
>> 2) 0.126 us | __wake_up_bit();
>> 2) 1.117 us | }
>> 2) 7.729 us | }
>> 2) 8.397 us | }
>> 2) 8.956 us | }
>> 2) 0.085 us | up_read();
>> 2) + 12.745 us | }
>> 2) + 13.401 us | }
>> ...
>>
>> # echo handle_mm_fault > set_graph_notrace
>> # perf ftrace live true
>> 1) | do_page_fault() {
>> 1) | __do_page_fault() {
>> 1) 0.205 us | down_read_trylock();
>> 1) 0.041 us | __might_sleep();
>> 1) 0.344 us | find_vma();
>> 1) 0.069 us | up_read();
>> 1) 4.692 us | }
>> 1) 5.311 us | }
>> ...
>
> Concept looks good.
hehe, thanks.
[SNIP]
>
>> @@ -3910,8 +3971,6 @@ out:
>> if (fail)
>> return -EINVAL;
>>
>> - ftrace_graph_filter_enabled = !!(*idx);
>> -
>> return 0;
>> }
>>
>> @@ -3946,6 +4005,50 @@ ftrace_graph_write(struct file *file, const char __user *ubuf,
>>
>> ret = read;
>>
>> + ftrace_graph_filter_enabled = ftrace_graph_count > 0;
>
> You modified the ftrace_graph_filter code without mentioning this in
> the change log. Actually, this should be a separate patch (clean up)
> and not just included with the ftrace_graph_notrace code, as it is a
> separate change.
Okay, I'll make it a separate patch. And I think we can get rid of
ftrace_graph_{filter,notrace}_enabled altogether and just use _count
variable instead.
>
>> +
>> +out_free:
>> + trace_parser_put(&parser);
>> +out_unlock:
>> + mutex_unlock(&graph_lock);
>> +
>> + return ret;
>> +}
>> +
>> +static ssize_t
>> +ftrace_graph_notrace_write(struct file *file, const char __user *ubuf,
>> + size_t cnt, loff_t *ppos)
>> +{
>> + struct trace_parser parser;
>> + ssize_t read, ret;
>> +
>> + if (!cnt)
>> + return 0;
>> +
>> + mutex_lock(&graph_lock);
>> +
>> + if (trace_parser_get_init(&parser, FTRACE_BUFF_MAX)) {
>> + ret = -ENOMEM;
>> + goto out_unlock;
>> + }
>> +
>> + read = trace_get_user(&parser, ubuf, cnt, ppos);
>> +
>> + if (read >= 0 && trace_parser_loaded((&parser))) {
>> + parser.buffer[parser.idx] = 0;
>> +
>> + /* we allow only one expression at a time */
>> + ret = ftrace_set_func(ftrace_graph_notrace_funcs,
>> + &ftrace_graph_notrace_count,
>> + parser.buffer);
>> + if (ret)
>> + goto out_free;
>> + }
>> +
>> + ret = read;
>> +
>> + ftrace_graph_notrace_enabled = ftrace_graph_notrace_count > 0;
>> +
>
> ftrace_graph_notrace_write() and ftrace_graph_write() are basically
> identical. Please make a helper function that does the work and passes
> in the different variables. That is:
>
> static ssize_t
> ftrace_graph_notrace_write(struct file *file, const char __user *ubuf,
> size_t cnt, loff_t *ppos)
> {
> return __ftrace_graph_write(file, ubuf, cnt, ppos,
> ftrace_graph_notrace_funcs,
> &ftrace_graph_notrace_count)
> }
>
> and do the same with ftrace_graph_write(). This way if we find a bug,
> or update the code, it gets done to both and not just one.
Agreed, will do that.
[SNIP]
>
>> diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
>> index b5c09242683d..3ba240daa5e0 100644
>> --- a/kernel/trace/trace_functions_graph.c
>> +++ b/kernel/trace/trace_functions_graph.c
>> @@ -114,16 +114,22 @@ ftrace_push_return_trace(unsigned long ret, unsigned long func, int *depth,
>> return -EBUSY;
>> }
>>
>> + /* The function was filtered out */
>> + if (current->curr_ret_stack < -1)
>> + return -EBUSY;
>
> I would be a bit more specific in your comment. Explain that
> curr_ret_stack is negative when we hit a function in the
> set_graph_notrace file.
How about this:
/*
* curr_ret_stack is initialized to -1 and gets increased in
* this function. So it can be less than -1 only if it was
* filtered out via ftrace_graph_notrace_addr() which can be
* set from set_graph_notrace file in debugfs by user.
*/
>
>> +
>> calltime = trace_clock_local();
>>
>> index = ++current->curr_ret_stack;
>> + if (ftrace_graph_notrace_addr(func))
>> + current->curr_ret_stack -= FTRACE_NOTRACE_DEPTH;
>
> I really hate this double call to ftrace_graph_notrace_addr(). The
> first one in trace_graph_entry(), and then here too.
>
> Isn't there a way we could pass the state? Hmm, I think we could use
> depth to do that. As depth is a pointer to trace.depth and not used
> before then. We could make it negative and then check that.
>
> /me looks at other archs.
>
> Darn it, s390 calls ftrace_push_return_trace() before
> ftrace_graph_entry(). They got fancy, as they must have noticed that
> trace.depth is set by ftrace_push_return_trace() and they just figured
> to let the ftrace_push_return_trace() do the work.
Yes, I thought about it before but as you can see other archs go to the
other way around so I just ended up to call it twice.
>
> Hmm, we could add a config to do the check on one side or the other
> depending on how the arch handles it.
>
> It needs to be well commented though.
Hmm.. maybe. But it'd better if this function call order is fixed
IMHO. Anyway, I'll add comments.
>
>
>
>> barrier();
>> current->ret_stack[index].ret = ret;
>> current->ret_stack[index].func = func;
>> current->ret_stack[index].calltime = calltime;
>> current->ret_stack[index].subtime = 0;
>> current->ret_stack[index].fp = frame_pointer;
>> - *depth = index;
>> + *depth = current->curr_ret_stack;
>>
>> return 0;
>> }
>> @@ -137,6 +143,9 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret,
>>
>> index = current->curr_ret_stack;
>>
>> + if (index < 0)
>> + index += FTRACE_NOTRACE_DEPTH;
>
> Speaking of comments. This needs to be commented, and not only that,
> perhaps this is where another ftrace_graph_notrace_addr() belongs.
>
> bool check = false;
>
> if (index < 0) {
> index += FTRACE_NOTRACE_DEPTH;
> check = true;
> }
>
> [...]
>
> if (check && !ftrace_graph_notrace_addr(trace->func))
> [ do the same bug as index < 0 ]
>
>
>
> Otherwise if the stack does get corrupted, this will just think we are
> in a notrace and fail someplace else, making it much harder to debug
> what went wrong.
>
> Hmm, this may be racy, as the user could change the notrace file and
> this could happen after that, causing a false positive.
>
> Well, we could do other tricks, like adding a flag in
> current->ret_stack[index].func (LSB set).
>
> But for now, just comment this better. We can come up with a proper
> check later. (/me is rambling as I've been on and off reviewing this
> patch all day, and it's now past my bedtime).
Sorry for interrupting your bedtime. :)
Hmm.. never thought about the corruption. And yes, simply checking it
again here might cause a trouble, I guess. I'll add a comment for this.
/*
* A negative index here means that it's just returned from a
* notrace'd function. Recover index to get an original return
* address. See ftrace_push_return_trace().
*
* TODO: Need to check whether the stack gets corrupted.
*/
>
>
>> +
>> if (unlikely(index < 0)) {
>> ftrace_graph_stop();
>> WARN_ON(1);
>> @@ -193,6 +202,10 @@ unsigned long ftrace_return_to_handler(unsigned long frame_pointer)
>> trace.rettime = trace_clock_local();
>> barrier();
>> current->curr_ret_stack--;
>> + if (current->curr_ret_stack < -1) {
>> + current->curr_ret_stack += FTRACE_NOTRACE_DEPTH;
>> + return ret;
>> + }
>>
>> /*
>> * The trace should run after decrementing the ret counter
>> @@ -259,10 +272,14 @@ int trace_graph_entry(struct ftrace_graph_ent *trace)
>>
>> /* trace it when it is-nested-in or is a function enabled. */
>> if ((!(trace->depth || ftrace_graph_addr(trace->func)) ||
>> - ftrace_graph_ignore_irqs()) ||
>> + ftrace_graph_ignore_irqs()) || (trace->depth < 0) ||
>> (max_depth && trace->depth >= max_depth))
>> return 0;
>>
>> + /* need to reserve a ret_stack entry to recover the depth */
>> + if (ftrace_graph_notrace_addr(trace->func))
>> + return 1;
>> +
>
> Also, I understand what you are doing here, with making curr_ret_stack
> negative to denote we are in a state to not do tracing. But it's more
> of a hack (not a bad one), and really needs to be documented somewhere.
> That is, the design should be in the comments where it's used, and 5
> years from now, nobody will understand how the notrace works without
> spending days trying to figure it out. Let's be nice to that poor soul,
> and write up what is going on so they don't need to pray to the holy
> tuna hoping to get a fish of enlightenment on how turning
> curr_ret_stack negative magically makes the function graph tracing not
> trace down functions, and magically starts tracing again when it comes
> back up.
Fully agreed. How about this:
/*
* The curr_ret_stack is an index to ftrace return stack of current
* task. Its value should be in [0, FTRACE_RETFUNC_DEPTH) when the
* function graph tracer is used. To support filtering out specific
* functions, it makes the index negative by subtracting huge value
* (FTRACE_NOTRACE_DEPTH) so when it sees a negative index the ftrace
* will ignore the record. And the index gets recovered when returning
* from the filtered function by adding the FTRACE_NOTRACE_DEPTH and
* then it will continue to record functions normally.
*/
Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists