[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b4b4afde-744e-b6a1-4822-b09078607cb7@linux.intel.com>
Date: Mon, 19 Jun 2017 16:08:32 +0300
From: Alexey Budankov <alexey.budankov@...ux.intel.com>
To: Mark Rutland <mark.rutland@....com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Andi Kleen <ak@...ux.intel.com>,
Kan Liang <kan.liang@...el.com>,
Dmitri Prokhorov <Dmitry.Prohorov@...el.com>,
Valery Cherepennikov <valery.cherepennikov@...el.com>,
David Carrillo-Cisneros <davidcc@...gle.com>,
Stephane Eranian <eranian@...gle.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/n] perf/core: addressing 4x slowdown during
per-process profiling of STREAM benchmark on Intel Xeon Phi
On 16.06.2017 1:10, Alexey Budankov wrote:
> Hi,
>
> On 15.06.2017 22:56, Mark Rutland wrote:
>> Hi,
>>
>> On Thu, Jun 15, 2017 at 08:41:42PM +0300, Alexey Budankov wrote:
>>> This series of patches continues v2 and addresses captured comments.
>>
>> As a general note, please keep any change log stuff out of the commit
>> message, and mvoe that just before the diff. It generally doesn't maek
>> sense for that to go in the commit message.
>
> Accepted.
>
>>
>>> Specifically this patch replaces pinned_groups and flexible_groups
>>> lists of perf_event_context by red-black cpu indexed trees avoiding
>>> data structures duplication and introducing possibility to iterate
>>> event groups for a specific CPU only.
>>
>> If you use --per-thread, I take it the overhead is significantly
>> lowered?
>
> Please ask more.
>
>>
>> As another general thing, please aim to write the commit message so that
>> it'll make sense in N years' time, in the git history. Describe the
>> whole problem, and the intended solution.
>>
>> I had to go diving into mail archives to understand the problem you're
>> trying to solve, and I shouldn't have to.
>>
>> For example, this commit message could be something like:
>>
>> perf/core: use rb trees for pinned/flexible groups
>>
>> By default, the userspace perf tool opens per-cpu task-bound events
>> when sampling, so for N logical events requested by the user, the
>> tool
>> will open N * NR_CPUS events.
>>
>> In the kernel, we mux events with a hrtimer, periodically
>> rotating the
>> event list and trying to schedule each in turn. We skip events whose
>> cpu filter doesn't match. So when we get unlucky, we can walk N *
>> (NR_CPUS - 1) events pointlessly for each hrtimer invocation.
>>
>> This has been observed to result in significant overhead when
>> running
>> the STREAM benchmark on 272 core Xeon Phi systems.
>>
>> One way to avoid this is to place our events into an rb tree
>> sorted by
>> CPU filter, so that our hrtimer can skip to the current CPU's
>> sub-tree, and ignore everything else.
>>
>> As a step towards that, this patch replaces our event lists with rb
>> trees.
>> ... which hopefully makes sense now, and will in 2, 5, 10 years' time,
>>
>
> Accepted. Thanks.
>
>>>
>>> Signed-off-by: Alexey Budankov <alexey.budankov@...ux.intel.com>
>>
>> Have you thrown Vince's perf fuzzer at this?
>>
>> If you haven't, please do. It can be found in the fuzzer directory of:
>>
>> https://github.com/deater/perf_event_tests
>>
>
> Accepted.
I run the test suite and it revealed no additional regressions in
comparison to what I have on the clean kernel.
However the fuzzer constantly reports some strange stacks that are not
seen on the clean kernel and I have no idea how that might be caused by
the patches.
>
>>> ---
>>> include/linux/perf_event.h | 34 +++-
>>> kernel/events/core.c | 386
>>> +++++++++++++++++++++++++++++++++------------
>>> 2 files changed, 315 insertions(+), 105 deletions(-)
>>
>> <changelog goes here>
>>
>>>
>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>> index 24a6358..2c1dcf1 100644
>>> --- a/include/linux/perf_event.h
>>> +++ b/include/linux/perf_event.h
>>> @@ -574,6 +574,27 @@ struct perf_event {
>>> struct list_head sibling_list;
>>>
>>> /*
>>> + * Node on the pinned or flexible tree located at the event
>>> context;
>>> + * the node may be empty in case its event is not directly attached
>>> + * to the tree but to group_list list of the event directly
>>> + * attached to the tree;
>>> + */
>>> + struct rb_node group_node;
>>> + /*
>>> + * List keeps groups allocated for the same cpu;
>>> + * the list may be empty in case its event is not directly
>>> + * attached to the tree but to group_list list of the event
>>> directly
>>> + * attached to the tree;
>>> + */
>>> + struct list_head group_list;
>>> + /*
>>> + * Entry into the group_list list above;
>>> + * the entry may be attached to the self group_list list above
>>> + * in case the event is directly attached to the pinned or
>>> + * flexible tree;
>>> + */
>>> + struct list_head group_list_entry;
>>
>> We already have group_entry for threading the RB tree. i don't see why
>> we need two new lists heads.
>
> Ok. For a group leader event its group_entry may be used instead of new
> group_list_entry.
>
>>
>>> + /*
>>> * We need storage to track the entries in
>>> perf_pmu_migrate_context; we
>>> * cannot use the event_entry because of RCU and we want to
>>> keep the
>>> * group in tact which avoids us using the other two entries.
>>> @@ -741,8 +762,17 @@ struct perf_event_context {
>>> struct mutex mutex;
>>>
>>> struct list_head active_ctx_list;
>>> - struct list_head pinned_groups;
>>> - struct list_head flexible_groups;
>>> + /*
>>> + * RB tree for pinned groups; keeps event's group_node
>>> + * nodes of attached pinned groups;
>>> + */
>>> + struct rb_root pinned_tree;
>>> + /*
>>> + * RB tree for flexible groups; keeps event's group_node
>>> + * nodes of attached flexible groups;
>>> + */
>>> + struct rb_root flexible_tree;
>>
>> We didn't need these commesnts before, and I don't think we need them
>> now.
>>
>> Any reason not to stick with the existing names?
>>
>> The existing {pinned,flexible}_groups names are fine regardless of
>> whether a list or tree is used, and makes it clear that the elements are
>> the group leaders, not all events.
>
> Yes, makes sense. Just changing the type of data structures.
>
>>
>>> +
>>> struct list_head event_list;
>>> int nr_events;
>>> int nr_active;
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index bc63f8d..a3531f9 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -1458,13 +1458,142 @@ static enum event_type_t
>>> get_event_type(struct perf_event *event)
>>> return event_type;
>>> }
>>>
>>> -static struct list_head *
>>> -ctx_group_list(struct perf_event *event, struct perf_event_context
>>> *ctx)
>>> +static struct rb_root *
>>> +ctx_group_cpu_tree(struct perf_event *event, struct
>>> perf_event_context *ctx)
>>> {
>>> if (event->attr.pinned)
>>> - return &ctx->pinned_groups;
>>> + return &ctx->pinned_tree;
>>> else
>>> - return &ctx->flexible_groups;
>>> + return &ctx->flexible_tree;
>>> +}
>>> +
>>> +/*
>>> + * Insert a group into a tree using event->cpu as a key. If
>>> event->cpu node
>>> + * is already attached to the tree then the event is added to the
>>> attached
>>> + * group's group_list list.
>>> + */
>>> +static void
>>> +perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event)
>>> +{
>>> + struct rb_node **node;
>>> + struct rb_node *parent;
>>> +
>>> + WARN_ON_ONCE(!tree || !event);
>>> +
>>> + node = &tree->rb_node;
>>> + parent = *node;
>>
>> The first iteration of the loop handles this, so it can go.
>
> If tree is empty parent will be uninitialized what is harmful.
>
>>
>>> +
>>> + while (*node) {
>>> + struct perf_event *node_event = container_of(*node,
>>> + struct perf_event, group_node);
>>> +
>>> + parent = *node;
>>> +
>>> + if (event->cpu < node_event->cpu) {
>>> + node = &parent->rb_left;
>>> + } else if (event->cpu > node_event->cpu) {
>>> + node = &parent->rb_right;
>>> + } else {
>>> + list_add_tail(&event->group_list_entry,
>>> + &node_event->group_list);
>>> + return;
>>
>> Why aren't we putting all group leaders into the tree?
>
> Because we index a tree by cpu only and if two groups share cpu we link
> them into the group_list of the group directly attached to the tree via
> group_node.
>
>>
>> I don't think this is what Peter meant when he suggested a threaded rb
>> tree.
>
> Yes, that's right. This implements cpu indexed rb trees with linked
> lists attached to every tree's node. The lists contain event groups
> allocated for the same cpu only.
>
>>
>> My understanding was that every group leader should be in the rb tree,
>> and every group leader should be in the same list that threads the whole
>> tree. Both the rb tree and list have to be maintained with the same
>> order.
>>
>> Thus, you can walk the rb tree to find the first event in the sub-list
>> that you care about, then walk that sub-list.
>>
>> Note that this means we need to shuffle the list and rb tree to rotate
>> events in each sub-tree.
>
> Now we also rotate a list but only containing groups for the appropriate
> cpu and we don't touch the tree.
>
>>
>>> + }
>>> + }
>>> +
>>> + list_add_tail(&event->group_list_entry, &event->group_list);
>>> +
>>> + rb_link_node(&event->group_node, parent, node);
>>> + rb_insert_color(&event->group_node, tree);
>>> +}
>>> +
>>> +/*
>>> + * Delete a group from a tree. If the group is directly attached to
>>> the tree
>>> + * it also detaches all groups on the group's group_list list.
>>> + */
>>> +static void
>>> +perf_cpu_tree_delete(struct rb_root *tree, struct perf_event *event)
>>> +{
>>> + WARN_ON_ONCE(!tree || !event);
>>> +
>>> + if (RB_EMPTY_NODE(&event->group_node)) {
>>> + list_del_init(&event->group_list_entry);
>>> + } else {
>>> + struct perf_event *list_event, *list_next;
>>> +
>>> + rb_erase(&event->group_node, tree);
>>> + list_for_each_entry_safe(list_event, list_next,
>>> + &event->group_list, group_list_entry)
>>> + list_del_init(&list_event->group_list_entry);
>>
>> At this point, all the other group leaders withthe same filter are gone
>> from the rb tree, since we didn't insert any of them into the rb tree in
>> place of the leader we deleted.
>> >> + }
>>> +}
>>> +
>>> +/*
>>> + * Find group_list list by a cpu key and call provided callback for
>>> every
>>> + * group on the list. Iteration stops if the callback returns non zero.
>>> + */
>>> +
>>> +typedef int (*perf_cpu_tree_callback_t)(struct perf_event *, void *);
>>> +
>>> +static int
>>> +perf_cpu_tree_iterate_cpu(struct rb_root *tree, int cpu,
>>> + perf_cpu_tree_callback_t callback, void *data)
>>> +{
>>> + int ret = 0;
>>> + struct rb_node *node;
>>> + struct perf_event *event;
>>> +
>>> + WARN_ON_ONCE(!tree);
>>> +
>>> + node = tree->rb_node;
>>> +
>>> + while (node) {
>>> + struct perf_event *node_event = container_of(node,
>>> + struct perf_event, group_node);
>>> +
>>> + if (cpu < node_event->cpu) {
>>> + node = node->rb_left;
>>> + } else if (cpu > node_event->cpu) {
>>> + node = node->rb_right;
>>> + } else {
>>> + list_for_each_entry(event, &node_event->group_list,
>>> + group_list_entry) {
>>> + ret = callback(event, data);
>>> + if (ret)
>>> + return ret;
>>> + }
>>> + }
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +/*
>>> + * Iterate a tree and call provided callback for every group in the
>>> tree.
>>> + * Iteration stops if the callback returns non zero.
>>> + */
>>> +static int
>>> +perf_cpu_tree_iterate(struct rb_root *tree,
>>> + perf_cpu_tree_callback_t callback, void *data)
>>> +{
>>> + int ret = 0;
>>> + struct rb_node *node;
>>> + struct perf_event *event;
>>> +
>>> + WARN_ON_ONCE(!tree);
>>> +
>>> + for (node = rb_first(tree); node; node = rb_next(node)) {
>>> + struct perf_event *node_event = container_of(node,
>>> + struct perf_event, group_node);
>>> +
>>> + list_for_each_entry(event, &node_event->group_list,
>>> + group_list_entry) {
>>> + ret = callback(event, data);
>>> + if (ret)
>>> + return ret;
>>> + }
>>> + }
>>> +
>>> + return 0;
>>> }
>>
>> If you need to iterate over every event, you can use the list that
>> threads the whole tree.
>>
>>>
>>> /*
>>> @@ -1485,12 +1614,12 @@ list_add_event(struct perf_event *event,
>>> struct perf_event_context *ctx)
>>> * perf_group_detach can, at all times, locate all siblings.
>>> */
>>> if (event->group_leader == event) {
>>> - struct list_head *list;
>>> + struct rb_root *tree;
>>>
>>> event->group_caps = event->event_caps;
>>>
>>> - list = ctx_group_list(event, ctx);
>>> - list_add_tail(&event->group_entry, list);
>>> + tree = ctx_group_cpu_tree(event, ctx);
>>> + perf_cpu_tree_insert(tree, event);
>>
>> Can we wrap this in a helper so that finds the sub-tree and performs
>> the insert?
>
> Ok. add_to_groups().
>
>>
>>> }
>>>
>>> list_update_cgroup_event(event, ctx, true);
>>> @@ -1680,8 +1809,12 @@ list_del_event(struct perf_event *event,
>>> struct perf_event_context *ctx)
>>>
>>> list_del_rcu(&event->event_entry);
>>>
>>> - if (event->group_leader == event)
>>> - list_del_init(&event->group_entry);
>>> + if (event->group_leader == event) {
>>> + struct rb_root *tree;
>>> +
>>> + tree = ctx_group_cpu_tree(event, ctx);
>>> + perf_cpu_tree_delete(tree, event);
>>
>> ... likewise?
>
> Ok. del_from_groups().
>
>>
>> Thanks,
>> Mark.
>>
>
> Thanks,
> Alexey
>
Powered by blists - more mailing lists