lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALcN6mhMMLz1j1e5p=xukYp8LbRXM2zsUZMr0zpjcU2G-WMB6g@mail.gmail.com>
Date:   Tue, 10 Jan 2017 12:20:00 -0800
From:   David Carrillo-Cisneros <davidcc@...gle.com>
To:     Mark Rutland <mark.rutland@....com>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>, Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Andi Kleen <ak@...ux.intel.com>,
        Kan Liang <kan.liang@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Borislav Petkov <bp@...e.de>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Vince Weaver <vince@...ter.net>, Paul Turner <pjt@...gle.com>,
        Stephane Eranian <eranian@...gle.com>
Subject: Re: [RFC 2/6] perf/core: add a rb-tree index to inactive_groups

On Tue, Jan 10, 2017 at 6:14 AM, Mark Rutland <mark.rutland@....com> wrote:
> On Tue, Jan 10, 2017 at 02:24:58AM -0800, David Carrillo-Cisneros wrote:
>> Add a rb-tree that indexes inactive events by {CPU/cgroup,flexible,stamp}.
>>
>> The original idea by Peter Z. was to sort task events in an rb-tree using
>> {pmu,cpu,timestamp} as key.
>>
>> Having the PMU as part of the key gets complicated for contexts that
>> share pmus (i.e. software context) because all events in a context should
>> rotate together irrespective of their pmu. It's also unclear to me that
>> there is any case where a seach by pmu is useful.
>
> Using the PMU as the key is essential for correct operation where
> heterogeneous HW PMUs share a context.
>
> For example, on a big.LITTLE system, big and little CPU PMUs share the
> same context, but their events are mutually incompatible. On big CPUs we
> only want to consider the sub-tree of big events, and on little CPUs we
> only want to consider little events. Hence, we need to be abel to search
> by PMU.

I see it now. So, if PMU were added to the rb-tree keys. How can the
generic code know what's the PMU of the current CPU?

>
> For SW PMUs, pmu::add() should never fail, and regardless of the order
> of the list we should be able to pmu::add() all events. Given that, why
> does the manner in which rotation occurs matter for SW PMUs?
>
>> Another complicatino is that using ctx->time (or timestamp) implies that
>> groups added during the same context switch may not have unique key.
>> This increases the complexity of that finds all events in the rb-tree
>> that are within a time interval.
>
> Could you elaborate on this? I don't understand what the problem is
> here. If we need uniqueness where {pmu,cpu,runtime} are equal, can't we
> extend the comparison to {pmu,cpu,runtime,event pointer}? That way
> everything we need is already implicit in the event, and we don't need
> perf_event::rbtree_key nor do we need
> perf_event_context::nr_inactive_added.

Yes, we could extend the comparison. But I am trying to keep the key a
u64 to speed up things.

I found it easier to simply create a counter and use it as an equivalent to
(timestamp, unique id). Both ways induce the same order of events.

>
> Using the runtime as part of the key was intended to provide fairer
> scheduling, even when scheduling intervals are not uniform, etc. Not
> using the runtime means we lose that fairness.

A counter that increases every time an event is added to the rbtree
induces the same order as a (timestamp, <some unique id>).

>
>> Lastly, it is useful to query pinned and flexible events separately since
>> they are scheduled in at different times.
>
> Using this as part of the sort makes sense to me.
>
>> For the reasons above, I created a rb-tree per context with key
>> {CPU,flexible,stamp} for task contexts and {cgroup,flexible,stamp} for
>> CPU contexts.
>>
>> The "flexible" boolean allows to query pinned or flexible events
>> separately.
>> The stamp is given by a non-decreasing counter: ctx->nr_events_added.
>> It increases every time a new inactive event is inserted. That choice of
>> stamp guarantees unique keys for all events and that events of the same
>> type (same {CPU/cgroup,flexible}) have the same order in the rb-tree.
>
> As above, I think we can use the event pointer for this, without having
> to maintain a new counter.

Same as above. I am trying to have a unique 64 bit key. I'd argue that
the new counter only adds a couple of lines of code and 32 bits of storage
but makes the rb-tree key comparison simpler and faster.

>
>> When events are scheduled in or rotated, all events in the context must be
>> iterated or rotated together, irrespective of the CPU/cgroup. To do that,
>> we add ctx->inactive_groups, a list that "threads" the rb-tree in total
>> ctx->nr_events_added order. Note that this order is the same as timestamp
>> order and ctx->inactive_groups is used for both scheduling and iteration.
>> The rb-tree can be seen as an index over ctx->inactive_groups.
>
> As above, I think we must use runtime as part of the key, and with that
> done, we only need the rb tree, and not the inactive_groups list.

The inactive_groups list makes the rb-tree "threaded" and follows
timestamp (or stamp) order. It makes iterating over events much
faster.

>
> With runtime as part of the key, rotation is not necessary, since
> insertion will sort the list, and then we can pick all the nodes with
> the shortest runtime. Hence, no special rotation code is necessary -- we
> simply make all events inactive, then reschedule.

Right, that's done later in this series.

>
>> Signed-off-by: David Carrillo-Cisneros <davidcc@...gle.com>
>> ---
>>  include/linux/perf_event.h |  5 +++
>>  kernel/events/core.c       | 82 ++++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 87 insertions(+)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 3fa18f05c9b0..fd32ecc37d33 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -564,6 +564,9 @@ struct perf_event {
>>       struct list_head                group_entry;
>>       struct list_head                sibling_list;
>>
>> +     u64                             rbtree_key;
>> +     struct rb_node                  rbtree_node;
>> +
>>       /*
>>        * We need storage to track the entries in perf_pmu_migrate_context; we
>>        * cannot use the event_entry because of RCU and we want to keep the
>> @@ -736,6 +739,8 @@ struct perf_event_context {
>>       struct list_head                pinned_groups;
>>       struct list_head                flexible_groups;
>>
>> +     struct rb_root                  rbtree_root;
>> +     u32                             nr_inactive_added;
>>       struct list_head                active_pinned_groups;
>>       struct list_head                active_flexible_groups;
>>       struct list_head                inactive_groups;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index b744b5a8dbd0..623d81c0ca93 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -1462,19 +1462,98 @@ ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
>>               return &ctx->flexible_groups;
>>  }
>>
>> +/*
>> + * The bits perf_event::kbtree_key represent:
>> + *   - 63:33 an unique identifier for CPU (if a task context) or a cgroup
>> + *     (if a CPU context).
>> + *   - 32    a boolean to indicate if eventt is flexible (vs  pinnned).
>> + *   - 31:0  a unique "stamp" that follows the last time the event was
>> + *   scheduled.
>> + * The 64 bits value groups event of the same type (CPU/cgroup + flexible)
>> + * together in the rb-tree.
>> + */
>> +#define RBTREE_KEY_STAMP_WIDTH               32
>> +#define RBTREE_KEY_STAMP_MASK                GENMASK_ULL(RBTREE_KEY_STAMP_WIDTH - 1, 0)
>> +#define RBTREE_KEY_FLEXIBLE_MASK     BIT_ULL(RBTREE_KEY_STAMP_WIDTH)
>> +
>> +static u64 taskctx_rbtree_key(int cpu, bool flexible)
>> +{
>> +     /*
>> +      * Use CPU only. PMU is never used in schedule in/out and, since  some
>> +      * contexts share PMU, iterate over them would make things complicated.
>> +      * I could not find a case where an ordered iteration over all PMU
>> +      * events in one context is useful.
>> +      */
>> +     return ((u64)cpu << (RBTREE_KEY_STAMP_WIDTH + 1)) |
>> +             (flexible ? RBTREE_KEY_FLEXIBLE_MASK : 0);
>> +}
>> +
>> +static u64 cpuctx_rbtree_key(struct perf_cgroup *cgrp, bool flexible)
>> +{
>> +     u64 k;
>> +
>> +     if (cgrp)
>> +             /* A cheap way to obtain an identifier for a cgroup. Suggestions appreciated. */
>> +             k = (u64)cgrp->css.id << (RBTREE_KEY_STAMP_WIDTH + 1);
>> +     else
>> +             k = GENMASK_ULL(63, RBTREE_KEY_STAMP_WIDTH + 1);
>> +     return k | (flexible ? RBTREE_KEY_FLEXIBLE_MASK : 0);
>> +}
>
> Can we turn these info {task,cpu}ctx_rbtree_cmp() functions which
> compare two events dynamically?
>
> Given that event->task can distinguish the two cases, we could just have
> a event_rbtree_cmp() (with an assert that event_a->task ==
> event_b->task).

I tried to avoid recreating the key every time there is a comparison.
Currently the key is stored into the u64 event->rbtree_key and only recreated
when the stamp changes.

The cmp wrapper makes sense if we go away from the u64 keys and add
more fields to the comparison.

In summary:

  - The stamp (i.e. ctx->nr_inactive_added) induces the same order than the
  timestamp at the time the event is added to the rb-tree. But it guarantees
  uniqueness and is denser (fits in 32bits).

  - The inactive_groups list threads the rb-tree and is sorted in stamp order
 (same as timestamp order). This is the list used to iterate during
ctx_sched_{in,out}.
 The rb-tree only creates a index (or skiplist) on these list so that we skip
  most of the events that do not mach the current CPU/cgroup (or pmu,
optionally).

  - I see now why PMU is useful for big.LITTLE. If there is a way for
generic code
  to retrieve the PMU of a CPU, I can add it to the key for those PMUs with
  PERF_PMU_CAP_HETEROGENEOUS_CPUS capability.

>
> Thanks,
> Mark.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ