[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABPqkBSyYjM7V8cy--WRBHJ8o9TU9WUD9mGE1E5HntYdm2BHyw@mail.gmail.com>
Date: Mon, 14 Nov 2011 13:55:04 +0100
From: Stephane Eranian <eranian@...gle.com>
To: Robert Richter <robert.richter@....com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"peterz@...radead.org" <peterz@...radead.org>,
"mingo@...e.hu" <mingo@...e.hu>,
"ming.m.lin@...el.com" <ming.m.lin@...el.com>,
"ak@...ux.intel.com" <ak@...ux.intel.com>
Subject: Re: [PATCH] perf_events: fix and improve x86 event scheduling
Robert,
On Thu, Nov 10, 2011 at 7:03 PM, Robert Richter <robert.richter@....com> wrote:
> On 07.11.11 06:01:49, Stephane Eranian wrote:
>>
>> Major rewrite of the x86 event scheduling routine.
>> The routine is shared between AMD and Intel.
>>
>> The existing code had an issue with certain combinations
>> of constraints. As Robert Richter @ AMD demonstrated on
>> AMD Bulldozer:
>>
>> e0 [3,0]
>> e1 [2:0]
>> e2 [2:0]
>> e3 [2:0]
>>
>> With this list of events, the existing scheduling algorithm
>> would never schedule e3. That's because it would always choose
>> counter 0 for e0:
>> e0 -> 0
>> e1 -> 1
>> e2 -> 2
>> e3 -> X
>>
>> Robert Richter proposed a fix to the algorithm which was based
>> on tagging constraints that overlap. He was using rollbacks to
>> simulate recursion.
>>
>> We propose an alternate solution which is simpler, faster. This
>> is a small modification to the existing algorithm. It adds some
>> smart in how a counter is chosen for a given event. The first
>
> Posting a link to my patch set for reference:
>
> https://lkml.org/lkml/2011/9/10/51
>
> What are the reasons to your alternate solution? Is it recursion, code
> complexity, or a use case it does not fit in? I see recursion as the
> main concern with my patch set, but its impact is known and limited.
> Anyway, a algorithm without recursion would be generally better.
>
>> available counter is not systemtically selected. Instead, the
>> algorithm checks how the constraints between the events overlap.
>> For each counter, it assigns a weight which is the number of events
>> which it can count for the event set. For each event, the algorithm
>> assigns the counter with the smallest weight first.
>
> But this algorithm does not work for all cases and does not solve the
> problem in general. Your idea to have weights for counters might be
> the right approach.
>
> E.g. the algorithm fails with (all weights are the same):
>
> c0 c1 c2
> e0 x x
> e1 x x
> e2 x x
>
> ... leading to:
>
> e0 -> c0
> e1 -> C1
> e3 -> X
>
> You basically have to recalculate the weights after you had assigned a
> counter.
>
Yes, it does not yield an assignment which maximizes the PMU usage.
I have been talking with co-workers experts in operational research
about our problem. They all pointed to me to the max flow algorithm from
Ford-Fulkerson (search for it on Wikipedia). I think it solves the complexity
and recursion problems. My understanding is that the complexity is also
more under control.
I started experimenting with this algorithm. I will report in a few days.
One thing for sure, it does provide the 'maximized' answer for your
example above and also for your initial BD example.
> But even if we do this, I am still not sure if that would cover all
> cases.
>
>>
>> In the example above, the new algoritm assigns:
>> e0 -> 3
>> e1 -> 0
>> e2 -> 1
>> e3 -> 2
>>
>> Because:
>> counter 0, weight = 4
>> counter 1, weight = 3
>> counter 2, weight = 3
>> counter 3, weight = 1
>>
>> We maximize PMU usage and ensure all events are measured.
>>
>> The patch also restructures the code to separate scheduling of
>> constrained vs. unconstrained events. An unconstrained event is
>> one that can be programmed into any of the generic counters. For
>> those, we now use the simplest algorihtm possible: use next free
>> counter. There is now a fast path which is beneficial on
>> processors such as AMD64 Fam10h.
>
> I don't see the need for a differentiation between constraint and
> unconstraint events. If loops are optimized in the constraint path
> there is not much overhead anymore. This could be done by specifying
> min and max limits for ranges. Special cases (unconstraint events,
> fixed counter, etc.) make the code more complex. I don't think a good
> algorithm will need this.
>
>> @@ -553,42 +530,179 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
>> if (x86_pmu.num_counters_fixed)
>> wmax++;
>>
>> - for (w = 1, num = n; num && w <= wmax; w++) {
>> - /* for each event */
>> - for (i = 0; num && i < n; i++) {
>> - c = constraints[i];
>> - hwc = &cpuc->event_list[i]->hw;
>> + /*
>> + * assign from most constrained to least constrained
>> + */
>> + for (w = 1, num = num_c; num && w <= wmax; w++) {
>> + /* for each constrained event */
>> + for (i = 0, e = c_events; i < num_c; i++, e++) {
>> +
>> + map_idx = (*e)->hw.map_idx;
>> + c = constraints[map_idx];
>>
>> if (c->weight != w)
>> continue;
>>
>> + min_wcnt = INT_MAX;
>
> Should be X86_PMC_IDX_MAX.
>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 1e9ebe5..2605244 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -564,6 +564,7 @@ struct hw_perf_event {
>> int idx;
>> int last_cpu;
>> struct hw_perf_event_extra extra_reg;
>> + int map_idx;
>
> This is not the right place. It is for all architectures but actually
> locally only used. A local array in x86_schedule_events() would work
> too.
>
> -Robert
>
>> };
>> struct { /* software */
>> struct hrtimer hrtimer;
>>
>
> --
> Advanced Micro Devices, Inc.
> Operating System Research Center
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists