[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090114095743.GA1389@elte.hu>
Date: Wed, 14 Jan 2009 10:57:43 +0100
From: Ingo Molnar <mingo@...e.hu>
To: Paul Mackerras <paulus@...ba.org>
Cc: linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC PATCH] perf_counter: Add support for pinned and exclusive
counter groups
* Paul Mackerras <paulus@...ba.org> wrote:
> Impact: New perf_counter features
>
> A pinned counter group is one that the user wants to have on the CPU
> whenever possible, i.e. whenever the associated task is running, for a
> per-task group, or always for a per-cpu group. If the system cannot
> satisfy that, it puts the group into an error state where it is not
> scheduled any more and reads from it return EOF (i.e. 0 bytes read).
> The group can be released from error state and made readable again using
> prctl(PR_TASK_PERF_COUNTERS_ENABLE). When we have finer-grained
> enable/disable controls on counters we'll be able to reset the error
> state on individual groups.
ok, that looks like quite sensible semantics.
> An exclusive group is one that the user wants to be the only group using
> the CPU performance monitor hardware whenever it is on. The counter
> group scheduler will not schedule an exclusive group if there are
> already other groups on the CPU and will not schedule other groups onto
> the CPU if there is an exclusive group scheduled (that statement does
> not apply to groups containing only software counters, which can always
> go on and which do not prevent an exclusive group from going on). With
> an exclusive group, we will be able to let users program PMU registers
> at a low level without the concern that those settings will perturb
> other measurements.
ok, this sounds good too. The disadvantage is the reduction in utility.
Actual applications and users will decide which variant is more useful in
practice - it does not complicate the design unreasonably.
> Along the way this reorganizes things a little:
> - is_software_counter() is moved to perf_counter.h.
> - cpuctx->active_oncpu now records the number of hardware counters on
> the CPU, i.e. it now excludes software counters. Nothing was reading
> cpuctx->active_oncpu before, so this change is harmless.
(btw., the percpu allocation code seems to have bitrotten a bit - the
logic around perf_reserved_percpu looks wrong and somewhat complicated.)
> - A new cpuctx->exclusive field records whether we currently have an
> exclusive group on the CPU.
> - counter_sched_out moves higher up in perf_counter.c and gets called
> from __perf_counter_remove_from_context and __perf_counter_exit_task,
> where we used to have essentially the same code.
> - __perf_counter_sched_in now goes through the counter list twice, doing
> the pinned counters in the first loop and the non-pinned counters in
> the second loop, in order to give the pinned counters the best chance
> to be scheduled in.
hm, i guess this could be improved: by queueing pinned counters in front
of the list and unpinned counters to the tail.
> Note that only a group leader can be exclusive or pinned, and that
> attribute applies to the whole group. This avoids some awkwardness in
> some corner cases (e.g. where a group leader is closed and the other
> group members get added to the context list). If we want to relax that
> restriction later, we can, and it is easier to relax a restriction than
> to apply a new one.
yeah, agreed.
> What this doesn't handle is when a pinned counter gets inherited and
> goes into error state in the child. The sensible thing would be to put
> the parent counter into error state in __perf_counter_exit_task, but
> that might mean taking it off the PMU on some other CPU, and I was
> nervous about doing anything substantial to parent_counter. Maybe what
> we need instead of an error value for counter->state is a separate error
> flag that can be set atomically by exiting children. BTW, I think we
> need an smp_wmb after updating parent_counter->count, so that when the
> parent sees the child has exited it is guaranteed to see the updated
> ->count value.
Yeah. Such artifacts at inheritance stem from the reduction in utility
that comes from any exclusive-resource-usage scheme, and are expected.
For example, right now it works just fine to nest 'timec' in itself [there
is a reduction in statistical value if we start round-robining, but
there's still full utility]. If it used exclusive or pinned counters that
might not work.
Again, which restrictions users/developers are more willing to live with
will be shown in actual usage of these facilities.
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists