linux-kernel - Re: [PATCH v3 1/n] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <07a76338-4c71-569a-d36e-7d6bcd10bd74@linux.intel.com>
Date:   Fri, 16 Jun 2017 01:10:10 +0300
From:   Alexey Budankov <alexey.budankov@...ux.intel.com>
To:     Mark Rutland <mark.rutland@....com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Kan Liang <kan.liang@...el.com>,
        Dmitri Prokhorov <Dmitry.Prohorov@...el.com>,
        Valery Cherepennikov <valery.cherepennikov@...el.com>,
        David Carrillo-Cisneros <davidcc@...gle.com>,
        Stephane Eranian <eranian@...gle.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/n] perf/core: addressing 4x slowdown during
 per-process profiling of STREAM benchmark on Intel Xeon Phi

Hi,

On 15.06.2017 22:56, Mark Rutland wrote:
> Hi,
> 
> On Thu, Jun 15, 2017 at 08:41:42PM +0300, Alexey Budankov wrote:
>> This series of patches continues v2 and addresses captured comments.
> 
> As a general note, please keep any change log stuff out of the commit
> message, and mvoe that just before the diff. It generally doesn't maek
> sense for that to go in the commit message.

Accepted.

> 
>> Specifically this patch replaces pinned_groups and flexible_groups
>> lists of perf_event_context by red-black cpu indexed trees avoiding
>> data structures duplication and introducing possibility to iterate
>> event groups for a specific CPU only.
> 
> If you use --per-thread, I take it the overhead is significantly
> lowered?

Please ask more.

> 
> As another general thing, please aim to write the commit message so that
> it'll make sense in N years' time, in the git history. Describe the
> whole problem, and the intended solution.
> 
> I had to go diving into mail archives to understand the problem you're
> trying to solve, and I shouldn't have to.
> 
> For example, this commit message could be something like:
> 
>      perf/core: use rb trees for pinned/flexible groups
> 
>      By default, the userspace perf tool opens per-cpu task-bound events
>      when sampling, so for N logical events requested by the user, the tool
>      will open N * NR_CPUS events.
> 
>      In the kernel, we mux events with a hrtimer, periodically rotating the
>      event list and trying to schedule each in turn. We skip events whose
>      cpu filter doesn't match. So when we get unlucky, we can walk N *
>      (NR_CPUS - 1) events pointlessly for each hrtimer invocation.
> 
>      This has been observed to result in significant overhead when running
>      the STREAM benchmark on 272 core Xeon Phi systems.
> 
>      One way to avoid this is to place our events into an rb tree sorted by
>      CPU filter, so that our hrtimer can skip to the current CPU's
>      sub-tree, and ignore everything else.
> 
>      As a step towards that, this patch replaces our event lists with rb
>      trees.
>   
> ... which hopefully makes sense now, and will in 2, 5, 10 years' time,
> 

Accepted. Thanks.

>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@...ux.intel.com>
> 
> Have you thrown Vince's perf fuzzer at this?
> 
> If you haven't, please do. It can be found in the fuzzer directory of:
> 
> https://github.com/deater/perf_event_tests
> 

Accepted.

>> ---
>>   include/linux/perf_event.h |  34 +++-
>>   kernel/events/core.c       | 386
>> +++++++++++++++++++++++++++++++++------------
>>   2 files changed, 315 insertions(+), 105 deletions(-)
> 
> <changelog goes here>
> 
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 24a6358..2c1dcf1 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -574,6 +574,27 @@ struct perf_event {
>>   	struct list_head		sibling_list;
>>
>>   	/*
>> +	 * Node on the pinned or flexible tree located at the event context;
>> +	 * the node may be empty in case its event is not directly attached
>> +	 * to the tree but to group_list list of the event directly
>> +	 * attached to the tree;
>> +	 */
>> +	struct rb_node			group_node;
>> +	/*
>> +	 * List keeps groups allocated for the same cpu;
>> +	 * the list may be empty in case its event is not directly
>> +	 * attached to the tree but to group_list list of the event directly
>> +	 * attached to the tree;
>> +	 */
>> +	struct list_head		group_list;
>> +	/*
>> +	 * Entry into the group_list list above;
>> +	 * the entry may be attached to the self group_list list above
>> +	 * in case the event is directly attached to the pinned or
>> +	 * flexible tree;
>> +	 */
>> +	struct list_head		group_list_entry;
> 
> We already have group_entry for threading the RB tree. i don't see why
> we need two new lists heads.

Ok. For a group leader event its group_entry may be used instead of new 
group_list_entry.

> 
>> +	/*
>>   	 * We need storage to track the entries in perf_pmu_migrate_context; we
>>   	 * cannot use the event_entry because of RCU and we want to keep the
>>   	 * group in tact which avoids us using the other two entries.
>> @@ -741,8 +762,17 @@ struct perf_event_context {
>>   	struct mutex			mutex;
>>
>>   	struct list_head		active_ctx_list;
>> -	struct list_head		pinned_groups;
>> -	struct list_head		flexible_groups;
>> +	/*
>> +	 * RB tree for pinned groups; keeps event's group_node
>> +	 * nodes of attached pinned groups;
>> +	 */
>> +	struct rb_root			pinned_tree;
>> +	/*
>> +	 * RB tree for flexible groups; keeps event's group_node
>> +	 * nodes of attached flexible groups;
>> +	 */
>> +	struct rb_root			flexible_tree;
> 
> We didn't need these commesnts before, and I don't think we need them
> now.
> 
> Any reason not to stick with the existing names?
> 
> The existing {pinned,flexible}_groups names are fine regardless of
> whether a list or tree is used, and makes it clear that the elements are
> the group leaders, not all events.

Yes, makes sense. Just changing the type of data structures.

> 
>> +
>>   	struct list_head		event_list;
>>   	int				nr_events;
>>   	int				nr_active;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index bc63f8d..a3531f9 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -1458,13 +1458,142 @@ static enum event_type_t
>> get_event_type(struct perf_event *event)
>>   	return event_type;
>>   }
>>
>> -static struct list_head *
>> -ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
>> +static struct rb_root *
>> +ctx_group_cpu_tree(struct perf_event *event, struct
>> perf_event_context *ctx)
>>   {
>>   	if (event->attr.pinned)
>> -		return &ctx->pinned_groups;
>> +		return &ctx->pinned_tree;
>>   	else
>> -		return &ctx->flexible_groups;
>> +		return &ctx->flexible_tree;
>> +}
>> +
>> +/*
>> + * Insert a group into a tree using event->cpu as a key. If event->cpu node
>> + * is already attached to the tree then the event is added to the attached
>> + * group's group_list list.
>> + */
>> +static void
>> +perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event)
>> +{
>> +	struct rb_node **node;
>> +	struct rb_node *parent;
>> +
>> +	WARN_ON_ONCE(!tree || !event);
>> +
>> +	node = &tree->rb_node;
>> +	parent = *node;
> 
> The first iteration of the loop handles this, so it can go.

If tree is empty parent will be uninitialized what is harmful.

> 
>> +
>> +	while (*node) {
>> +		struct perf_event *node_event =	container_of(*node,
>> +				struct perf_event, group_node);
>> +
>> +		parent = *node;
>> +
>> +		if (event->cpu < node_event->cpu) {
>> +			node = &parent->rb_left;
>> +		} else if (event->cpu > node_event->cpu) {
>> +			node = &parent->rb_right;
>> +		} else {
>> +			list_add_tail(&event->group_list_entry,
>> +					&node_event->group_list);
>> +			return;
> 
> Why aren't we putting all group leaders into the tree?

Because we index a tree by cpu only and if two groups share cpu we link 
them into the group_list of the group directly attached to the tree via 
group_node.

> 
> I don't think this is what Peter meant when he suggested a threaded rb
> tree.

Yes, that's right. This implements cpu indexed rb trees with linked 
lists attached to every tree's node. The lists contain event groups 
allocated for the same cpu only.

> 
> My understanding was that every group leader should be in the rb tree,
> and every group leader should be in the same list that threads the whole
> tree. Both the rb tree and list have to be maintained with the same
> order.
> 
> Thus, you can walk the rb tree to find the first event in the sub-list
> that you care about, then walk that sub-list.
> 
> Note that this means we need to shuffle the list and rb tree to rotate
> events in each sub-tree.

Now we also rotate a list but only containing groups for the appropriate 
cpu and we don't touch the tree.

> 
>> +		}
>> +	}
>> +
>> +	list_add_tail(&event->group_list_entry, &event->group_list);
>> +
>> +	rb_link_node(&event->group_node, parent, node);
>> +	rb_insert_color(&event->group_node, tree);
>> +}
>> +
>> +/*
>> + * Delete a group from a tree. If the group is directly attached to
>> the tree
>> + * it also detaches all groups on the group's group_list list.
>> + */
>> +static void
>> +perf_cpu_tree_delete(struct rb_root *tree, struct perf_event *event)
>> +{
>> +	WARN_ON_ONCE(!tree || !event);
>> +
>> +	if (RB_EMPTY_NODE(&event->group_node)) {
>> +		list_del_init(&event->group_list_entry);
>> +	} else {
>> +		struct perf_event *list_event, *list_next;
>> +
>> +		rb_erase(&event->group_node, tree);
>> +		list_for_each_entry_safe(list_event, list_next,
>> +				&event->group_list, group_list_entry)
>> +			list_del_init(&list_event->group_list_entry);
> 
> At this point, all the other group leaders withthe same filter are gone
> from the rb tree, since we didn't insert any of them into the rb tree in
> place of the leader we deleted.
>  >> +	}
>> +}
>> +
>> +/*
>> + * Find group_list list by a cpu key and call provided callback for every
>> + * group on the list. Iteration stops if the callback returns non zero.
>> + */
>> +
>> +typedef int (*perf_cpu_tree_callback_t)(struct perf_event *, void *);
>> +
>> +static int
>> +perf_cpu_tree_iterate_cpu(struct rb_root *tree, int cpu,
>> +		perf_cpu_tree_callback_t callback, void *data)
>> +{
>> +	int ret = 0;
>> +	struct rb_node *node;
>> +	struct perf_event *event;
>> +
>> +	WARN_ON_ONCE(!tree);
>> +
>> +	node = tree->rb_node;
>> +
>> +	while (node) {
>> +		struct perf_event *node_event = container_of(node,
>> +				struct perf_event, group_node);
>> +
>> +		if (cpu < node_event->cpu) {
>> +			node = node->rb_left;
>> +		} else if (cpu > node_event->cpu) {
>> +			node = node->rb_right;
>> +		} else {
>> +			list_for_each_entry(event, &node_event->group_list,
>> +					group_list_entry) {
>> +				ret = callback(event, data);
>> +				if (ret)
>> +					return ret;
>> +			}
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Iterate a tree and call provided callback for every group in the tree.
>> + * Iteration stops if the callback returns non zero.
>> + */
>> +static int
>> +perf_cpu_tree_iterate(struct rb_root *tree,
>> +		perf_cpu_tree_callback_t callback, void *data)
>> +{
>> +	int ret = 0;
>> +	struct rb_node *node;
>> +	struct perf_event *event;
>> +
>> +	WARN_ON_ONCE(!tree);
>> +
>> +	for (node = rb_first(tree); node; node = rb_next(node)) {
>> +		struct perf_event *node_event = container_of(node,
>> +				struct perf_event, group_node);
>> +
>> +		list_for_each_entry(event, &node_event->group_list,
>> +				group_list_entry) {
>> +			ret = callback(event, data);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +	}
>> +
>> +	return 0;
>>   }
> 
> If you need to iterate over every event, you can use the list that
> threads the whole tree.
> 
>>
>>   /*
>> @@ -1485,12 +1614,12 @@ list_add_event(struct perf_event *event,
>> struct perf_event_context *ctx)
>>   	 * perf_group_detach can, at all times, locate all siblings.
>>   	 */
>>   	if (event->group_leader == event) {
>> -		struct list_head *list;
>> +		struct rb_root *tree;
>>
>>   		event->group_caps = event->event_caps;
>>
>> -		list = ctx_group_list(event, ctx);
>> -		list_add_tail(&event->group_entry, list);
>> +		tree = ctx_group_cpu_tree(event, ctx);
>> +		perf_cpu_tree_insert(tree, event);
> 
> Can we wrap this in a helper so that finds the sub-tree and performs
> the insert?

Ok. add_to_groups().

> 
>>   	}
>>
>>   	list_update_cgroup_event(event, ctx, true);
>> @@ -1680,8 +1809,12 @@ list_del_event(struct perf_event *event,
>> struct perf_event_context *ctx)
>>
>>   	list_del_rcu(&event->event_entry);
>>
>> -	if (event->group_leader == event)
>> -		list_del_init(&event->group_entry);
>> +	if (event->group_leader == event) {
>> +		struct rb_root *tree;
>> +
>> +		tree = ctx_group_cpu_tree(event, ctx);
>> +		perf_cpu_tree_delete(tree, event);
> 
> ... likewise?

Ok. del_from_groups().

> 
> Thanks,
> Mark.
> 

Thanks,
Alexey