[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4C5528A6.1060008@dcl.info.waseda.ac.jp>
Date: Sun, 01 Aug 2010 16:56:22 +0900
From: Hitoshi Mitake <mitake@....info.waseda.ac.jp>
To: Frederic Weisbecker <fweisbec@...il.com>,
Jens Axboe <jaxboe@...ionio.com>
CC: Steven Rostedt <rostedt@...dmis.org>, linux-kernel@...r.kernel.org,
h.mitake@...il.com, Ingo Molnar <mingo@...e.hu>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Paul Mackerras <paulus@...ba.org>,
Arnaldo Carvalho de Melo <acme@...hat.com>,
Jason Baron <jbaron@...hat.com>,
Xiao Guangrong <xiaoguangrong@...fujitsu.com>
Subject: Re: [PATCH RFC] perf lock: Add new event "lock_acquired_waittime"
for contention analysis
On 2010年08月01日 14:36, Hitoshi Mitake wrote:
> On 07/29/10 01:39, Frederic Weisbecker wrote:
> > On Sun, Jul 25, 2010 at 01:55:29AM +0900, Hitoshi Mitake wrote:
> >> Recording every lock events produces too large overhead
> >> both in CPU time and memory space.
> >> This overhead is a big problem which should be solved,
> >> but locks are important factor today.
> >>
> >> So I'd like to add a new lock event "lock_acquired_waittime"
> >> for contention analysis. lock_acquired() calculates time consumed
> >> for waiting the locks with held_lock->waittime_stamp set by
> lock_contended().
> >> If this value is 0, no contention occured. If this value is not 0,
> >> there is occurence of contention.
> >> So I inserted the branch based on this value into lock_acquired().
> >> New lock_acquired() calls trace_lock_acquired() when waittime is 0,
> >> and calls trace_lock_acquired_waittime(), function for the new event,
> >> when waittime is not 0.
> >>
> >> The traditional event, lock_acquired, is too high frequency event.
> >> But the event lock_acquired_waittime is not so high frequency,
> >> so overhead produced by tracing this event is not so large.
> >>
> >> This is an example:
> >> | mitake@...1i:~/linux/.../tools/perf% ./perf bench sched messaging
> >> | # Running sched/messaging benchmark...
> >> | # 20 sender and receiver processes per group
> >> | # 10 groups == 400 processes run
> >> |
> >> | Total time: 1.712 [sec]
> >> | mitake@...1i:~/linux/.../tools/perf% sudo ./perf record -R -f -m
> 1024 -c 1 -e lock:lock_acquired_waittime:r ./perf bench sched messaging
> >> | # Running sched/messaging benchmark...
> >> | # 20 sender and receiver processes per group
> >> | # 10 groups == 400 processes run
> >> |
> >> | Total time: 1.754 [sec]
> >> | [ perf record: Woken up 1 times to write data ]
> >> | [ perf record: Captured and wrote 0.958 MB perf.data (~41837
> samples) ]
> >>
> >> "perf bench sched messaging" is a kernel bound program, but this result
> >> shows that overhead is, at least on my Core i3 laptop, small enough
> for discarding
> >> especially from CPU time perspective.
> >>
> >> This new event is not enough for analyzing complete lock behaviour.
> >> But measuring quantity of time consumed for lock contention will be
> possible.
> >>
> >> So I think this patch is useful for real world lock analysis.
> >> I'd like to write in new analyzer which use this event in Python.
> >>
> >> Cc: Ingo Molnar<mingo@...e.hu>
> >> Cc: Peter Zijlstra<a.p.zijlstra@...llo.nl>
> >> Cc: Paul Mackerras<paulus@...ba.org>
> >> Cc: Arnaldo Carvalho de Melo<acme@...hat.com>
> >> Cc: Jens Axboe<jens.axboe@...cle.com>
> >> Cc: Jason Baron<jbaron@...hat.com>
> >> Cc: Xiao Guangrong<xiaoguangrong@...fujitsu.com>
> >> Signed-off-by: Hitoshi Mitake<mitake@....info.waseda.ac.jp>
> >>
> >> ---
> >
> >
> >
> > But those who only want to trace the time waited can just trace
> > lock_acquire and lock_acquired event. It seems it would be enough
> > to reduce the lock tracing overhead a bit.
> >
> > Also the problem with:
> >
> > if (cond)
> > trace_x()
> > else
> > trace_y()
> >
> > It adds a new condition, only used by tracing, that will always be
> > checked, whenever the tracing is on or off. So we lose a cool feature
> > of tracepoint here: any tracing code is hidden behind a boolean wall,
> > so that the tracing off path is optimized with a single branch check:
> >
> > static inline void trace_##name(proto) \
> > { \
> > if (unlikely(__tracepoint_##name.state)) \
> > __DO_TRACE(&__tracepoint_##name, \
> > TP_PROTO(data_proto), \
> > TP_ARGS(data_args)); \
> > } \
> >
> >
> > When tracing is off, ie: __tracepoint_##name.state == 0, the only
> > extra code executed is the check of __tracepoint_##name.state.
> >
> > With your new tracepoint, you've added another branch that is
> > always checked (unless !CONFIG_TRACEPOINTS, due to compiler
> > optimizations).
>
> Sorry, this branch is not a good choice.
> But for my purpose, this tracing enable/disable feature is not enough.
> Because I wanted to suppress the frequency of lock event for contention
> analysis,
> so branch for judging occurence of contention had to be added.
>
> >
> > But well, that's not very important. Lockdep is already debugging
> > code not enabled in production kernel (at least not supposed to).
> >
>
> I want to make lockdep production kernel feature with dynamic patching.
> Current lock event tracing and in kernel lock statistics (/proc/lock_stat)
> depend on lockdep. Overhead of these features must be devided into
> individual.
>
> # BTW, I completely misunderstood the principle of lockdep,
> # I might understood it recently :( So I can work on it now.
>
> > The more important point is: do we really need this? Have you observed
> > lost events with only lock_acquire + lock_acquired events?
> >
> > You can check lost events with: perf trace -d
> > You can also tune the size of the buffers with the -m option in
> > perf record. The default size of perf record is not enough for lock
> > events.
>
> The purpose of this patch is suppressing the frequency of events,
> lost of events is out of scope of this patch.
>
> >
> >
> > Optimizing the size of the records is definetly something we want. But
> > I'm not sure this is the good way.
> >
> > Something like tuning the fields we want in a record seems to me a
> > more generic and wide tracing optimization, a gain that would be
> > propagated in every tracepoints and not only a specific and precise
> > tracing workload on specific tracepoints.
> >
> > Look at this example:
> >
> > fweisbec@...here:~$ cat /debug/tracing/events/sched/sched_switch/format
> > name: sched_switch
> > ID: 29
> > format:
> > field:unsigned short common_type; offset:0; size:2; signed:0;
> > field:unsigned char common_flags; offset:2; size:1; signed:0;
> > field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> > field:int common_pid; offset:4; size:4; signed:1;
> > field:int common_lock_depth; offset:8; size:4; signed:1;
> >
> > field:char prev_comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
> > field:pid_t prev_pid; offset:28; size:4; signed:1;
> > field:int prev_prio; offset:32; size:4; signed:1;
> > field:long prev_state; offset:40; size:8; signed:1;
> > field:char next_comm[TASK_COMM_LEN]; offset:48; size:16; signed:1;
> > field:pid_t next_pid; offset:64; size:4; signed:1;
> > field:int next_prio; offset:68; size:4; signed:1;
> >
> > print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s ==>
> next_comm=%s next_pid=%d next_prio=%d",
> > REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state ?
> __print_flags(REC->prev_state, "|", { 1, "S"} , {
> > 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x"
> }, { 128, "W" }) : "R", REC->next_comm,
> > REC->next_pid, REC->next_prio
> >
> >
> > In typical scheduler tracing workloads, we don't need:
> > - common_flags (irq flags)
> > - common_preempt_count (if preempt_count != 0 on sched switch, it's
> a bug)
> > - common_pid (prev_pid == common_pid, may be prev_pid should be
> dropped in fact)
> > - common_lock_depth (not used yet in perf tools, and the BKL is dying)
> > - prio (not used yet in perf sched)
> > - comm (we already have some comm resolution in perf)
> >
> > I just talking about some tracing workloads, typically perf sched
> won't use the
> > aboves. But other kinds of instrumentation may need the prio for
> example.
> >
> > I let you imagine how much space we can save if we can bring some
> choice on the
> > fields we want, depending on what the user wants and so.
>
> Yeah, this will contribute to reducing overhead of tracing.
>
> >
> > If someone is willing to work on that, it would be definitely welcome.
> > I think we could use a kind of mask of fields we want.
> >
> > Note I think that's easy for common_* fields as they are always the same
> > and filled from tracing_generic_entry_update(), for the others it
> doesn't
> > looks easy, we would need to add more macros inside TP_fast_assign().
> > But having the support for common_* fields would be already a good
> > thing.
> >
> >
>
> After I finish current pending works like lockdep enabling/disabling,
> and anyone don't do it, I'd like to work on it.
> Because tracing overhead is really critical problem for lock events.
> I want to try any method for reducing overhead.
> # But I don't have enough power, this will be very slow :(
>
Oops, I received the mail says that "jens.axboe@...cle.com" is invalid
from mailer daemon of oracle.com .
I added the email address "jaxboe@...ionio.com", sorry Jens.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists