lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20071114130909.GB6557@frankl.hpl.hp.com>
Date:	Wed, 14 Nov 2007 05:09:09 -0800
From:	Stephane Eranian <eranian@....hp.com>
To:	Andi Kleen <andi@...stfloor.org>
Cc:	akpm@...l.org, Robert Richter <robert.richter@....com>,
	gregkh@...e.de, linux-kernel@...r.kernel.org,
	William Cohen <wcohen@...hat.com>
Subject: Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Andi,

On Wed, Nov 14, 2007 at 03:07:02AM +0100, Andi Kleen wrote:
> 
> [dropped all these bouncing email lists. Adding closed lists to public
> cc lists is just a bad idea]
> 

Just want to make sure perfmon2 users participate in this discussion.

> > int
> > main(int argc, char **argv)
> > {
> > 	int ctx_fd;
> > 	pfarg_pmd_t pd[1];
> > 	pfarg_pmc_t pc[1];
> > 	pfarg_ctx_t ctx;
> > 	pfarg_load_t load_args;
> > 
> > 	memset(&ctx, 0, sizeof(ctx));
> > 	memset(pc, 0, sizeof(pc));
> > 	memset(pd, 0, sizeof(pd));
> > 
> > 	/* create session (context) and get file descriptor back (identifier) */
> > 	ctx_fd = pfm_create_context(&ctx, NULL, NULL, 0);
> 
> There's nothing in your example that makes the file descriptor needed.
> 

Partially true. The file descriptor becomes really useful when you sample.
You leverage the file descriptor to receive notifications of counter overflows
and full sampling buffer. You extract notification messages via read() and you can
use SIGIO, select/poll.

The example shows how you can leverage existing mechanisms to destroy the session, i.e.,
free the associated kernel resources. For that, you use close() instead of adding yet
another syscall. It also provides a resource limitation mechanisms to control consumption
of kernel memory, i.e., you can only create as many sessions as you can have open files.

> > 
> > 	/* setup one config register (PMC0) */
> > 	pc[0].reg_num   = 0
> > 	pc[0].reg_value = 0x1234;
> 
> That would be nicer if it was just two arguments.
> 
Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?

That would be quite expensive when you have lots of registers to setup: one
syscall per register. The perfmon syscalls to read/write registers accept vector
of arguments to amortize the cost of the syscall over multiple registers
(similar to poll(2)).

With many tools, registers are not just setup once. During certain measurements,
data registers may be read multiple times. When you sample or multiplex at
the user level, you do need to reprogram the PMU state and that is on the critical
path.

You do not want a call that programs the entire PMU state all at once either. Many times,
you only want to modify a small subset. Having the full state does also cause some portability
problems.


> > 
> > 	/* setup one data register (PMD0) */
> > 	pd[0].reg_num = 0;
> > 	pd[0].reg_value = 0;
> 
> Why do you need to set the data register? Wouldn't it make
> more sense to let the kernel handle that and just return one.
> 
It depends on what you are doing. Here, this was not really necessary. It was
meant to show how you can program the data registers as well. Perfmon2 provides
default values for all data registers. For counters, the value is guaranteed to
be zero.

But it is important to note that not all data registers are counters. That is the
case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as
well, and some may need to be initialized to non zero value, i.e., the IBS sampling
period.

With event-based sampling,  the period is expressed as the number of occurrences
of an event. For instance, you can say: " take a sample every 2000 L2 cache misses".
The way you express this with perfmon2 is that you program a counter to measure
L2 cache misses, and then you initialize the corresponding data register (counter)
to overflow after 2000 occurrences. Given that the interface guarantees all counters
are 64-bit regardless of the hardware, you simply have to program the counter to -2000.
Thus you see that you need a call to actual program the data registers.

> > 
> > 	/* program the registers */
> > 	pfm_write_pmcs(ctx_fd, pc, 1);
> > 	pfm_write_pmds(ctx_fd, pd, 1);
> > 
> > 	/* attach the context to self */
> > 	load_args.load_pid = getpid();
> > 	pfm_load_context(ctx_fd, &load_args);
> 
> My replacement would be to just add a flags argument to write_pmcs 
> with one flag bit meaning "GLOBAL CONTEXT" versus "MY CONTEXT"
> > 

You are mixing PMU programming with the type of measurement you want to do.

Perfmon2 decouples the two operations. In fact, no PMU hardware is actually touched
before you attach to either a CPU or a thread. This way, you can prepare your measurement
and then attach-and-go. Thus is is possible to create batches of ready-to-go sessions.
That is useful, for instance, when you are trying to measure across fork, pthread_create
which you can catch on-the-fly.

Take the per-thread example, you can setup your session before you fork/exec the program
you want to measure.

Note also that perfmon2 supports attaching to an already running thread. So there is
more than "GLOBAL CONTEXT" versus "MY CONTEXT".


> > 	/* activate monitoring */
> > 	pfm_start(ctx_fd, NULL);
> 
> Why can't that be done by the call setting up the register?
> 

Good question. If you do what say, you assume that the start/stop bit lives in the
config (or data) registers of the PMU. This is not true on all hardware. On Itanium
for instance, the start/stop bit is part of the Processor Status Register (psr).
That is not a PMU register.

On X86, you set the enable bit the PERFEVTSEL, but nothing really happens until you issue
pfm_start(), i.e., the PERFEVTSEL registers are not touched until then.

> Or if someone needs to do it for a specific region they can read
> the register before and then afterwards.
> 
> > 
> > 	/*
> > 	 * run code to measure
> > 	 */
> > 
> > 	/* stop monitoring */
> > 	pfm_stop(ctx_fd);
> > 
> > 	/* read data register */
> > 	pfm_read_pmds(ctx_fd, pd, 1);
> 
> On x86 i think it would be much simpler to just let the set/alloc
> register call return a number and then use RDPMC directly. That would
> be actually faster and be much simpler too.
> 
One approach does not prevent the other. Assuming you allow cr4.pce, then nothing prevents
a self-monitoring thread from reading the counters directly. You'll just get the
lower 32-bit of it. So if you read frequently enough, you should not have a problem.

But keep in mind that we do want a uniform interface across all hardware and all type
of sessions (self-monitoring, CPU-wide, monitoring of another thread). You don't want
an interface that says on x86 you have to use rdpmc, on Itanium pfm_read_pmds() and so
on. You want an interface that guarantees that with pfm_read_pmds() you'll be able to
read on any hardware platforms, then on some you may be able to use a more efficient
method, e.g., rdpmc on X86.

Reducing performance monitoring to self-monitoring is not what we want. In fact, there
are only a few domains where you can actually do this and HPC is one of them. But in 
many other situations, you cannot and don't want to have to instrument applications
or libraries to collect performance data. It is quite handy to be able to do:
	$ pfmon /bin/ls
or
	$ pfmon --attach-task=`pidof sshd` -timeout=10s


Also note that there is no guarantee that RDPMC allows you to access all data registers
on a PMU. For instance, on AMD Barcelona, it seems you cannot read the IBS register using
RDPMC.


> I suppose most architectures have similar facilities, if not a call could be 
> added for them but it's not really essential. The call might be also needed
> for event multiplexing, but frankly I would just leave that out for now.
> 
Itanium does allow user level read of data registers. It also allows start/stop.
Perfmon2 allows this only for self-monitoring per-thread sessions.

I think restricting per-thread mode to only self-monitoring is just too limiting
even for a start.


> e.g. here is one use case I would personally see as useful. We need
> a replacement for simple cycle counting since RDTSC doesn't do that anymore
> on modern x86 CPUs.  It could be something like:
> 
You can do exactly this with the perfmon2 interface as it exists today.
Your example is perfectly fine, your interface works in your case.

But you are driving the design of the interface from your very specific need
and you are ignoring all the other usage models. This has been a problem with so
many other interfaces and that explains the current situation. You have to
take a broader view, look at what the hardware (across the board) provides and
build from there. We do not need yet another interface to support one tool or one
type of measurement, we need a true programming interface with a uniform set
of calls. So sure, several calls may look overkill for basic measurements, but
they become necessary with others.

> 	/* 0 is the initial value */
> 
> 	/* could be either library or syscall */
> 	event = get_event(COUNTER_CYCLES); 
> 	if (event < 0) 
> 		/* CPU has no cycle counter */
> 
> 	reg = setup_perfctr(event, 0 /* value */, LOCAL_EVENT); /* syscall */
> 
> 	rdpmc(reg, start);
> 	.... some code to run ...
> 	rdpmc(reg, end);
> 
> 	free_perfctr(reg);	/* syscall */
> 
-- 
-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ