linux-kernel - Re: [patch 0/3] [Announcement] Performance Counters for Linux

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20081205080813.GA2030@elte.hu>
Date:	Fri, 5 Dec 2008 09:08:13 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Paul Mackerras <paulus@...ba.org>
Cc:	Thomas Gleixner <tglx@...utronix.de>,
	LKML <linux-kernel@...r.kernel.org>, linux-arch@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>,
	Stephane Eranian <eranian@...glemail.com>,
	Eric Dumazet <dada1@...mosbay.com>,
	Robert Richter <robert.richter@....com>,
	Arjan van de Veen <arjan@...radead.org>,
	Peter Anvin <hpa@...or.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Steven Rostedt <rostedt@...dmis.org>,
	David Miller <davem@...emloft.net>
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

* Paul Mackerras <paulus@...ba.org> wrote:

> Ingo Molnar writes:
> > 
> > * Paul Mackerras <paulus@...ba.org> wrote:
> [snip]
> > > One thing that this sort of thing can't do is to get values from 
> > > multiple counters that correlate with each other.  For instance, we 
> > > would often want to count, say, L2 cache misses and instructions 
> > > completed at the same time, and be able to read both counters at very 
> > > close to the same time, so that we can measure average L2 cache misses 
> > > per instruction completed, which is useful.
> > 
> > This can be done in a very natural way with our abstraction, and the 
> > "hello.c" example happens to do exactly that:
> 
> Has hello.c been posted?  I can't find it in any of the posts from you 
> or Thomas.  Am I just being blind? :)

Sorry, was late at night when we did the release - monitor.c was posted - 
and i just posted hello.c it half an hour ago :)

> >   aldebaran:~/perf-counter-test> ./hello
> >   doing perf_counter_open() call:
> >   counter[0]... fd: 3.
> >   counter[1]... fd: 4.
> >   counter[0] delta: 10866 cycles
> >   counter[1] delta: 414 cycles
> >   counter[0] delta: 23640 cycles
> >   counter[1] delta: 3673 cycles
> >   counter[0] delta: 28225 cycles
> >   counter[1] delta: 3695 cycles
> > 
> > This counts cycles executed and instructions executed, and reads the two 
> > counters out at the same time.
> 
> Isn't it two separate read() calls to read the two counters?  If so, 
> the only way the two values are actually going to correspond to the 
> same point in time is if the task being monitored is stopped.  In which 
> case the monitoring task needs to use ptrace or something similar in 
> order to make sure that the monitored task is actually stopped.

It doesnt matter in practice.

Also, look at our code: we buffer notification events and do not have to 
stop the thread for recording the context information.

Also, if you _do_ care about getting immediate readouts, the _monitoring_ 
task can be set to higher priority. (not that i'd advocate it in general: 
any task stopping or scheduling can destroy a workload's true behavior)

> If the monitored task is not stopped, then the interval between the two 
> reads will be sufficient to render the results useless - particularly 
> since the monitoring task could get preempted for an arbitrary length 
> of time between the two reads.  But even if it doesn't, the hundreds of 
> cycles between the two reads will introduce considerable imprecision in 
> the results.

Even if the two read()s are done apart, stopping a task is _far_ more 
intrusive to the event flow of a single application. Most workloads are 
multithreaded - so stopping a task causes another task to be scheduled 
in, which would not have occured were the profiling more transparent and 
less intrusive.

Furthermore, even for the special case of single task monitoring, a 
context-switch is more expensive than a system call.

Furthermore, in most of the practical cases there's very few events 
happening between two read()s. The interval of profiling versus the 
interval between two reads()s is a couple of orders of magnitude.

This 'task has to be stopped' aspect is a red herring that has no 
technical basis.

> There really is value in being able to read all the counters you're 
> using in one system call.

It's possible with our code too: what you are asking for is in essence a 
sys_read_fds() system call extension - a bit like readv(), but from a 
vector of separate fds.

Such kind of 'group system call facility' has been suggested several 
times in the past - but ... never got anywhere because system calls are 
cheap enough, it really does not count in practice.

It could be implemented, and note that because our code uses a proper 
Linux file descriptor abstraction, such a sys_read_fds() facility would 
help _other_ applications as well, not just performance counters.

But it brings complications: demultiplexing of error conditions on 
individual counters is a real pain with any compound abstraction. We very 
consciously went with the 'one fd, one object, one counter' design.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/