[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALcN6miwpf1aFtD+hvJcNS9eLVotNM-jbBbFuppBgad-TUyF7A@mail.gmail.com>
Date: Tue, 10 Jan 2017 12:51:58 -0800
From: David Carrillo-Cisneros <davidcc@...gle.com>
To: Mark Rutland <mark.rutland@....com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Kan Liang <kan.liang@...el.com>,
linux-kernel <linux-kernel@...r.kernel.org>,
"x86@...nel.org" <x86@...nel.org>, Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Andi Kleen <ak@...ux.intel.com>, Borislav Petkov <bp@...e.de>,
Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Vince Weaver <vince@...ter.net>, Paul Turner <pjt@...gle.com>,
Stephane Eranian <eranian@...gle.com>
Subject: Re: [RFC 3/6] perf/core: use rb-tree to sched in event groups
On Tue, Jan 10, 2017 at 8:38 AM, Mark Rutland <mark.rutland@....com> wrote:
> Hi,
>
> On Tue, Jan 10, 2017 at 02:24:59AM -0800, David Carrillo-Cisneros wrote:
>> During sched in, only events that match the current CPU and cgroup can
>> be scheduled in. These events should be added to the PMU in increasing
>> timestamp order to guaratee fairness in the event rotation.
>>
>> In task contexts, events with no CPU affinity or with affinity to the
>> current CPU are eligible.
>> In CPU contexts, events with no cgroup restriction (CPU events) or with
>> a cgroup that is current (either current's task cgroup or one of its
>> ancestor) are eligible.
>>
>> For both context types, we need to iterate over events with different
>> rb-tree key groups (different CPUs or different cgroups) in timestamp
>> order.
>>
>> Finding the events in timestamp order is at odds with indexing by
>> by CPU/cgroup.
>
> That's a fair point. Sorting by CPU before runtime we'll get subtrees we
> won't get fairness unless we sort the events solely by runtime at
> sched_in time. If we sort by with runtime before CPU we'll have to skip
> events not targeting the current CPU when scheduling task events in. I
> note the latter is true today anyhow.
That's were ctx->inactive_groups comes in. That list is sorted by runtime
and the rb-tree is used to skip to the part of the list that has the events
that matter.
>
> In Peter's original suggestion we didn't sort by cgroup. IIRC there was
> some email thread where the cgroup was considered for the sort (maybe
> that was *only* for cpu contexts? I'm not too familiar with cgroups),
> though I can't find the relevant mail, if it existed. :/
FWIW, in this approach, we only sort by cgroup in CPU contexts, since cgroup
events are only installed in CPU contexts.
>
> Peter, did you have an idea as to how to handle that, or am I imagining
> things here?
>
> Kan, in your per-cpu event list patch you mentioned that you saw a large
> overhead in perf_iterate_ctx() when skipping events for other CPUs.
> Which callers of perf_iterate_ctx() specifically was that problematic
> for? Do those callers only care about the *active* events, for example?
>
> Maybe the overhead of skipping !current_cpu events is ok at sched_in
> time in most cases. If the overhead of skipping those only matters for a
> subset of perf_iterate_ctx() callers, then maybe we can optimise them in
> another fashion (e.g. use the active events lists, or a new list
> specific to that iterate user, depending on what they actually need).
> That way we can drop cpu from the sort.
>
>> The rb-tree allows us to find events with minimum and
>> maximum timestamp for a given CPU/cgroup + flexible type. The list
>> ctx->inactive_groups is sorted by timestamp.
>>
>> We could find a list position for the first event of each CPU/cgroup that
>> is to be scheduled and iterate over all of them, selecting events from
>> the list's head with the smallest timestampt, but it's too complicated.
>>
>> A simpler alternative is to find the smallest subinterval of
>> ctx->inactive_groups that contains all eligible events. Let's call this
>> minimum subinterval S.
>>
>> S is formed of smaller subintervals, no necessarily exclusive, intervals.
>> Each one has all the events that are eligible for a given CPU or cgroup.
>> We find S by searching for the start/end of each one of these CPU/cgroup
>> subintervals and combining them. The drawback is that there may be
>> events in S that are not eligible (since ctx->inactive_group is in stamp
>> order).
>
> The other drawback is that this is not fair, since CPU comes before
> runtime in the sort order. You'll always try some events before others
> (e.g. cpu == -1 before cpu == current), before considering runtime. I
> believe this means that events can be permanently starved.
>
> So either we need to fold those together somehow, or drop CPU from the
> sort order (assuming that we can, as above).
We do that by iterating over ctx->inactive_groups, that is sorted by timestamp.
That's how we keep the fairness.
>
> Thanks,
> Mark.
Powered by blists - more mailing lists