[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100906174642.GG14891@linux.vnet.ibm.com>
Date: Mon, 6 Sep 2010 23:16:42 +0530
From: Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Ingo Molnar <mingo@...e.hu>, Steven Rostedt <rostedt@...dmis.org>,
Arnaldo Carvalho de Melo <acme@...radead.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Christoph Hellwig <hch@...radead.org>,
Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>,
Oleg Nesterov <oleg@...hat.com>,
Mark Wielaard <mjw@...hat.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Naren A Devaiah <naren.devaiah@...ibm.com>,
Jim Keniston <jkenisto@...ux.vnet.ibm.com>,
Frederic Weisbecker <fweisbec@...il.com>,
"Frank Ch. Eigler" <fche@...hat.com>,
Ananth N Mavinakayanahalli <ananth@...ibm.com>,
LKML <linux-kernel@...r.kernel.org>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>
Subject: Re: [PATCHv11 2.6.36-rc2-tip 5/15] 5: uprobes: Uprobes
(un)registration and exception handling.
* Peter Zijlstra <peterz@...radead.org> [2010-09-03 19:19:09]:
> >
> > However I would have an issue with making inode based probing the
> > default.
> > 1. Making all probing based on inode can be a performance hog.
>
> How so? You can add filters if you want.
The breakpoint exception and singlestep account for a substaintial time
of the uprobes probe handling. With increasing number of breakpoint hits and singlesteps, wouldnt the overall load increase?
>
> > 2. Since unlike kernel space, every process has a different space, so
> > why would we have to insert breakpoints in each of its process space if
> > we are not interested in them.
>
> You don't have to, but you can. The problem I have with this stuff is
> that it makes the pid thing a primary interface, whereas it should be
> one of many filter possibilities.
I think the otherway,
Why instrument a process and filter it out, if we are not interested in it.
While instrumenting kernel, we dont have this flexibility. So
having a pid based filter is the right thing to do for kernel
based tracing.
If we can get the per process based tracing right, we can build
higher lever stuff including the file based tracing easily.
All tools/debuggers in the past have all worked with process based
tracing.
Tools like gdb can actually use the displaced singlestepping
feature that uprobes provides. Some gdb developers have told on LKML
earlier that they would be willing to use displaced singlestepping if
the kernel provides an API that they can use.
Also about the security perspective when allowing normal users use
perf to trace their applications. Using this model, we dont have to
write extra filters to limit them. These filters might allow uprobe
handlers on only tasks belonging to that user. However it still
interrupts task of other users. And as I said earlier, breakpoint
exception and singlestepping actually make a very very substantial part
of the handling. The actual uprobe handler depending on what it does
might actually be very minimal part of the uprobe probe handling.
> > > > + */
> > > > + pid_t pid;
> > >
> > > Would fail for pid namespaces I guess...
> >
> > Can you explain a bit more?
>
> What pid namespace will you interpret that number in?
Same namespace as the requestor. i.e whichever name space
register_uprobe/unregister_uprobe is called.
> >
> > Unlike kernel probing, uprobes has a disadvantage.
> > Lets assume that the request for removing a probepoint when some of the
> > threads have actually hit the probe. Because the handlers in uprobes can
> > sleep, we cant remove the probepoint at the same time as the request for
> > the removing the probe. This is where refcount steps in and helps us to
> > decide when we can remove the probepoint. Even inoode based
> > tracing or file based tracing would need it.
>
> Stick the refcount in the actual struct uprobe instead?
What if the called does something like this when one or more
threads are processing the breakpoint.
unregister_uprobe(u);
kfree(u);
In the current implementation, the probepoint structure might be
released much later after the uprobe structure is released.
Unlike uprobe struct, probepoint structure is allocated by uprobes
sub-system and it knows how to release it cleanly. However we dont have
any idea about when and how uprobe struct can be released.
> >
> > state is needed so that we know at do_notify_resume time whether the
> > thread should singlestep or has already singlestep.
> > Do you see a way to determine in do_notify_resume if the thread is
> > singlestepping or not. I dont think the instruction pointer can
> > determine if we are about to singlestep or have already singlestepped.
>
> The to singlestep or not would be implied by the IP pointing to the
> start of a slot or not, but yes, I guess that as long as you do
> singlestep you need some state.. sucks though. Boosted probes are much
> nicer, they don't need that extra arch storage either, they can simply
> push/pop the register.
Yes, I agree, we may not need the state after boosted probes.
I am not sure at this time if we can do boosted probes for all
instructions. I think we can discuss about boosted probes later.
> > > > +
> > > > +/* Guards lookup, creation, and deletion of uproc. */
> > > > +static DEFINE_MUTEX(uprobe_mutex);
> > >
> > > uproc is per mm, why a global mutex?
> >
> > yes, uproc is per mm but gets created when the first probe on that
> > process gets registered.
> > Do you have anyother lock that you have in mind?
> > I could have used the mmap_sem, but I am not sure if we should be taking
> > a write lock on the mmap_sem.
>
> No particular other lock in mind, you could cmpxchg the pointer if
> that's all you need it for. The problem is that if you want inode based
> probes without a filter this will become a rather hot lock on fork().
>
I can certainly try cmpxchg().
> >
> > We dropped running the handler in interrupt context based on comments to
> > my previous postings. It not only makes it complex at the time of
> > handling the probe but we didnt see a difference when we compare the
> > time it took to run the handler in task context to running the handler
> > in interrupt context. I had even posted the timing info when
> > register_uprobes was exported.
>
> Is that because of the singlestep overhead? With boosted probes I would
> think it'd be much faster to take 1 trap, deal with it and continue
> execution, than to frob tons of kernel code in between.
>
The difference between running handlers in task context and running in
interrupt context is the extra do_notify_resume() that gets called from
task context. But we have more guarantees, flexibility and less
complexity with running the handler from do_notify_resume than
running the handler in interrupt context. Since do_notify_resume is the
only extra thing that gets called (on the way out of the
interrupt), the difference in the penalties between running the
handler in task context and running the handler in interrupt
context is negligible.
I would want to stick with running handlers in task context for now.
We could optimize this later on depending on further results and
uprobes maturing.
>
>
> A bit more about these filter thingies, add a method to struct uprobe,
> something like int uprobe::wants_probe(struct task_struct *p) and add a
> single bit to task_struct (there's a few bitfields with holes in there).
>
> The on clone()/mmap() call the relevant wants_probe() methods, if one is
> true, set the task_struct::has_uprobe flag and install the probes.
>
> If nothing in the process wants probing, you'll never install the probes
> and nothing ever triggers, of only one of many tasks in the process gets
> tagged, you'll have to look up the probe anyway to know where to
> continue, but you can avoid calling the handler.
>
Yes, we can avoid calling the handler. but we still have the other
penalties of breakpoint hit, searching the probe, and
singlestepping. Just to clarify, I am not looking at probing per task
but probing per process. So the pid field in uprobe refers to the
tgid.
--
Thanks and Regards
Srikar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists