lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 01 Aug 2010 14:36:50 +0900
From:	Hitoshi Mitake <mitake@....info.waseda.ac.jp>
To:	Frederic Weisbecker <fweisbec@...il.com>
CC:	Steven Rostedt <rostedt@...dmis.org>, linux-kernel@...r.kernel.org,
	h.mitake@...il.com, Ingo Molnar <mingo@...e.hu>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Paul Mackerras <paulus@...ba.org>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	Jens Axboe <jens.axboe@...cle.com>,
	Jason Baron <jbaron@...hat.com>,
	Xiao Guangrong <xiaoguangrong@...fujitsu.com>
Subject: Re: [PATCH RFC] perf lock: Add new event "lock_acquired_waittime"
 for contention analysis

On 07/29/10 01:39, Frederic Weisbecker wrote:
 > On Sun, Jul 25, 2010 at 01:55:29AM +0900, Hitoshi Mitake wrote:
 >> Recording every lock events produces too large overhead
 >> both in CPU time and memory space.
 >> This overhead is a big problem which should be solved,
 >> but locks are important factor today.
 >>
 >> So I'd like to add a new lock event "lock_acquired_waittime"
 >> for contention analysis. lock_acquired() calculates time consumed
 >> for waiting the locks with held_lock->waittime_stamp set by
lock_contended().
 >> If this value is 0, no contention occured. If this value is not 0,
 >> there is occurence of contention.
 >> So I inserted the branch based on this value into lock_acquired().
 >> New lock_acquired() calls trace_lock_acquired() when waittime is 0,
 >> and calls trace_lock_acquired_waittime(), function for the new event,
 >> when waittime is not 0.
 >>
 >> The traditional event, lock_acquired, is too high frequency event.
 >> But the event lock_acquired_waittime is not so high frequency,
 >> so overhead produced by tracing this event is not so large.
 >>
 >> This is an example:
 >> | mitake@...1i:~/linux/.../tools/perf% ./perf bench sched messaging
 >> | # Running sched/messaging benchmark...
 >> | # 20 sender and receiver processes per group
 >> | # 10 groups == 400 processes run
 >> |
 >> |      Total time: 1.712 [sec]
 >> | mitake@...1i:~/linux/.../tools/perf% sudo ./perf record -R -f -m
1024 -c 1 -e lock:lock_acquired_waittime:r ./perf bench sched messaging
 >> | # Running sched/messaging benchmark...
 >> | # 20 sender and receiver processes per group
 >> | # 10 groups == 400 processes run
 >> |
 >> |      Total time: 1.754 [sec]
 >> | [ perf record: Woken up 1 times to write data ]
 >> | [ perf record: Captured and wrote 0.958 MB perf.data (~41837
samples) ]
 >>
 >> "perf bench sched messaging" is a kernel bound program, but this result
 >> shows that overhead is, at least on my Core i3 laptop, small enough
for discarding
 >> especially from CPU time perspective.
 >>
 >> This new event is not enough for analyzing complete lock behaviour.
 >> But measuring quantity of time consumed for lock contention will be
possible.
 >>
 >> So I think this patch is useful for real world lock analysis.
 >> I'd like to write in new analyzer which use this event in Python.
 >>
 >> Cc: Ingo Molnar<mingo@...e.hu>
 >> Cc: Peter Zijlstra<a.p.zijlstra@...llo.nl>
 >> Cc: Paul Mackerras<paulus@...ba.org>
 >> Cc: Arnaldo Carvalho de Melo<acme@...hat.com>
 >> Cc: Jens Axboe<jens.axboe@...cle.com>
 >> Cc: Jason Baron<jbaron@...hat.com>
 >> Cc: Xiao Guangrong<xiaoguangrong@...fujitsu.com>
 >> Signed-off-by: Hitoshi Mitake<mitake@....info.waseda.ac.jp>
 >>
 >> ---
 >
 >
 >
 > But those who only want to trace the time waited can just trace
 > lock_acquire and lock_acquired event. It seems it would be enough
 > to reduce the lock tracing overhead a bit.
 >
 > Also the problem with:
 >
 > if (cond)
 > 	trace_x()
 > else
 > 	trace_y()
 >
 > It adds a new condition, only used by tracing, that will always be
 > checked, whenever the tracing is on or off. So we lose a cool feature
 > of tracepoint here: any tracing code is hidden behind a boolean wall,
 > so that the tracing off path is optimized with a single branch check:
 >
 > static inline void trace_##name(proto)				\
 > {								\
 > 	if (unlikely(__tracepoint_##name.state))		\
 > 		__DO_TRACE(&__tracepoint_##name,		\
 > 			TP_PROTO(data_proto),			\
 > 			TP_ARGS(data_args));			\
 > }								\
 >
 >
 > When tracing is off, ie: __tracepoint_##name.state == 0, the only
 > extra code executed is the check of __tracepoint_##name.state.
 >
 > With your new tracepoint, you've added another branch that is
 > always checked (unless !CONFIG_TRACEPOINTS, due to compiler
 > optimizations).

Sorry, this branch is not a good choice.
But for my purpose, this tracing enable/disable feature is not enough.
Because I wanted to suppress the frequency of lock event for contention 
analysis,
so branch for judging occurence of contention had to be added.

 >
 > But well, that's not very important. Lockdep is already debugging
 > code not enabled in production kernel (at least not supposed to).
 >

I want to make lockdep production kernel feature with dynamic patching.
Current lock event tracing and in kernel lock statistics (/proc/lock_stat)
depend on lockdep. Overhead of these features must be devided into 
individual.

# BTW, I completely misunderstood the principle of lockdep,
# I might understood it recently :( So I can work on it now.

 > The more important point is: do we really need this? Have you observed
 > lost events with only lock_acquire + lock_acquired events?
 >
 > You can check lost events with: perf trace -d
 > You can also tune the size of the buffers with the -m option in
 > perf record. The default size of perf record is not enough for lock
 > events.

The purpose of this patch is suppressing the frequency of events,
lost of events is out of scope of this patch.

 >
 >
 > Optimizing the size of the records is definetly something we want. But
 > I'm not sure this is the good way.
 >
 > Something like tuning the fields we want in a record seems to me a
 > more generic and wide tracing optimization, a gain that would be
 > propagated in every tracepoints and not only a specific and precise
 > tracing workload on specific tracepoints.
 >
 > Look at this example:
 >
 > fweisbec@...here:~$ cat /debug/tracing/events/sched/sched_switch/format
 > name: sched_switch
 > ID: 29
 > format:
 > 	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
 > 	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
 > 	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
 > 	field:int common_pid;	offset:4;	size:4;	signed:1;
 > 	field:int common_lock_depth;	offset:8;	size:4;	signed:1;
 >
 > 	field:char prev_comm[TASK_COMM_LEN];	offset:12;	size:16;	signed:1;
 > 	field:pid_t prev_pid;	offset:28;	size:4;	signed:1;
 > 	field:int prev_prio;	offset:32;	size:4;	signed:1;
 > 	field:long prev_state;	offset:40;	size:8;	signed:1;
 > 	field:char next_comm[TASK_COMM_LEN];	offset:48;	size:16;	signed:1;
 > 	field:pid_t next_pid;	offset:64;	size:4;	signed:1;
 > 	field:int next_prio;	offset:68;	size:4;	signed:1;
 >
 > print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s ==>
next_comm=%s next_pid=%d next_prio=%d",
 > REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state ?
__print_flags(REC->prev_state, "|", { 1, "S"} , {
 > 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x"
}, { 128, "W" }) : "R", REC->next_comm,
 > REC->next_pid, REC->next_prio
 >
 >
 > In typical scheduler tracing workloads, we don't need:
 > 	- common_flags (irq flags)
 > 	- common_preempt_count (if preempt_count != 0 on sched switch, it's
a bug)
 > 	- common_pid (prev_pid == common_pid, may be prev_pid should be
dropped in fact)
 > 	- common_lock_depth (not used yet in perf tools, and the BKL is dying)
 > 	- prio (not used yet in perf sched)
 > 	- comm (we already have some comm resolution in perf)
 >
 > I just talking about some tracing workloads, typically perf sched
won't use the
 > aboves. But other kinds of instrumentation may need the prio for example.
 >
 > I let you imagine how much space we can save if we can bring some
choice on the
 > fields we want, depending on what the user wants and so.

Yeah, this will contribute to reducing overhead of tracing.

 >
 > If someone is willing to work on that, it would be definitely welcome.
 > I think we could use a kind of mask of fields we want.
 >
 > Note I think that's easy for common_* fields as they are always the same
 > and filled from tracing_generic_entry_update(), for the others it doesn't
 > looks easy, we would need to add more macros inside TP_fast_assign().
 > But having the support for common_* fields would be already a good
 > thing.
 >
 >

After I finish current pending works like lockdep enabling/disabling,
and anyone don't do it, I'd like to work on it.
Because tracing overhead is really critical problem for lock events.
I want to try any method for reducing overhead.
# But I don't have enough power, this will be very slow :(
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ