linux-kernel - Re: [RFC v2] perf: Rewrite core context handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YwSWhXW+BUA3WkIE@worktop.programming.kicks-ass.net>
Date:   Tue, 23 Aug 2022 10:57:41 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Ravi Bangoria <ravi.bangoria@....com>
Cc:     acme@...nel.org, alexander.shishkin@...ux.intel.com,
        jolsa@...hat.com, namhyung@...nel.org, songliubraving@...com,
        eranian@...gle.com, alexey.budankov@...ux.intel.com,
        ak@...ux.intel.com, mark.rutland@....com, megha.dey@...el.com,
        frederic@...nel.org, maddy@...ux.ibm.com, irogers@...gle.com,
        kim.phillips@....com, linux-kernel@...r.kernel.org,
        santosh.shukla@....com
Subject: Re: [RFC v2] perf: Rewrite core context handling

On Tue, Aug 02, 2022 at 11:46:32AM +0530, Ravi Bangoria wrote:
> On 13-Jun-22 8:13 PM, Peter Zijlstra wrote:
> > On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:

> >> +static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
> >>  {
> >> +	struct perf_event_pmu_context *pmu_ctx;
> >>  	int can_add_hw = 1;
> >>  
> >> -	if (ctx != &cpuctx->ctx)
> >> -		cpuctx = NULL;
> >> -
> >> -	visit_groups_merge(cpuctx, &ctx->pinned_groups,
> >> -			   smp_processor_id(),
> >> -			   merge_sched_in, &can_add_hw);
> >> +	if (pmu) {
> >> +		visit_groups_merge(ctx, &ctx->pinned_groups,
> >> +				   smp_processor_id(), pmu,
> >> +				   merge_sched_in, &can_add_hw);
> >> +	} else {
> >> +		/*
> >> +		 * XXX: This can be optimized for per-task context by calling
> >> +		 * visit_groups_merge() only once with:
> >> +		 *   1) pmu=NULL
> >> +		 *   2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
> >> +		 *   3) Making can_add_hw a per-pmu variable
> >> +		 *
> >> +		 * Though, it can not be opimized for per-cpu context because
> >> +		 * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
> >> +		 * consist of cgroup-subtrees. i.e. a cgroup events of same
> >> +		 * cgroup but different pmus are seperated out into respective
> >> +		 * pmu-subtrees.
> >> +		 */
> >> +		list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> >> +			can_add_hw = 1;
> >> +			visit_groups_merge(ctx, &ctx->pinned_groups,
> >> +					   smp_processor_id(), pmu_ctx->pmu,
> >> +					   merge_sched_in, &can_add_hw);
> >> +		}
> >> +	}
> >>  }
> > 
> > I'm not sure I follow.. task context can have multiple PMUs just the
> > same as CPU context can, that's more or less the entire point of the
> > patch.
> 
> Current rbtree key is {cpu, cgroup_id, group_idx}. However, effective key for
> task specific context is {cpu, group_idx} because cgroup_id is always 0. And
> effective key for cpu specific context is {cgroup_id, group_idx} because cpu
> is same for entire rbtree.
> 
> With New design, rbtree key will be {cpu, pmu, cgroup_id, group_idx}. But as
> explained above, effective key for task specific context will be {cpu, pmu,
> group_idx}. Thus, we can handle pmu=NULL in visit_groups_merge(), same as you
> did in the very first RFC[1]. (This may make things more complicated though
> because we might also need to increase heap size to accommodate all pmu events
> in single heap. Current heap size is 2 for task specific context, which is
> sufficient if we iterate over all pmus).
> 
> Same optimization won't work for cpu specific context because, it's effective
> key would be {pmu, cgroup_id, group_idx} i.e. each pmu subtree is made up of
> cgroup subtrees.

Agreed, new order is: {cpu, pmu, cgroup_id, group_idx}

Event scheduling looks at the {cpu, pmu, cgroup_id} subtree to find the
leftmost group_idx event to schedule next.

However, since cgroup events are per-cpu events, per-task events will
always have cgroup=NULL. Resulting in the subtrees:

  {-1, pmu, NULL} and {cpu, pmu, NULL}

Which is what the code does, it iterates ctx->pmu_ctx_list to find all
@pmu values and then for each does the schedule dance.

Now, I suppose making that:

  {-1, NULL, NULL}, {cpu, NULL, NULL}

could work, but wouldn't iterating the the tree be more expensive than
just finding the sub-trees as we do now?

You also talk about extending extending the heap, which I read like
doing the heap-merge over:

 {-1, pmu0, NULL}, {-1, pmu1, NULL}, ...
 {cpu, pmu0, NULL}, ...

But that doesn't make sense, the schedule dance is per-pmu.

Or am I just still not getting it?