linux-kernel - Re: [PATCH] perf/core: Handle generic events normally instead of trying all pmu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zeib-e1RYm2Dlk8c@FVFF77S0Q05N>
Date: Wed, 6 Mar 2024 16:38:17 +0000
From: Mark Rutland <mark.rutland@....com>
To: "JiaLong.Yang" <jialong.yang@...ngroup.cn>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Namhyung Kim <namhyung@...nel.org>,
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
	Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
	Adrian Hunter <adrian.hunter@...el.com>,
	linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] perf/core: Handle generic events normally instead of
 trying all pmu

On Wed, Mar 06, 2024 at 11:10:15PM +0800, JiaLong.Yang wrote:
> We have pay more effort on handling generic events. We treat them
> specially caused us paying more.

Are you seeing an actual problem in practice? ... or is this just intended as a
cleanup/improvement?

What problem are you actually trying to solve here?

> Now time have told us that PMU type between 0 and PERF_TYPE_MAX only
> corresponds to one PMU.

There can be at most one PMU registered for each type, but several PMUs may
need to handle events for that type.

Trivially, there are a number of PMUs which share the PERF_TYPE_SOFTWARE type,
e.g.

* perf_swevent, handled by perf_swevent_init()
* perf_cpu_clock, handled by cpu_clock_event_init()
* perf_task_clock, handled by task_clock_event_init()

.. which have non-overlapping events in the PERF_TYPE_SOFTWARE event
namespace.

On systems with heterogeneous PMUs (e.g. ARM big.LITTLE systems), several PMUs
can handle the PERF_TYPE_{HARDWARE,RAW,CACHE} event namespaces, and there's
existing software that depends upon that working *without* the extended HW type
IDs.

> So we can handle the special when registering PMU not in event opening. And
> we can know which PMU is used to the generic event.

No we cannot. There may be several PMUs which need to handle a given generic
event type.

> The added capabilities PERF_PMU_CAP_EVENT_HARDWARE and
> PERF_PMU_CAP_EVENT_HW_CACHE will alloc idr for PERF_TYPE_HARDWARE and
> PERF_TYPE_HW_CACHE with same PMU.

When I suggested using capabilities in:

  https://lore.kernel.org/lkml/ZecStMBA4YgQaBEZ@FVFF77S0Q05N/

.. I had meant that the core code could check just before calling the
pmu::event_init() function, in order to simplify the implementation of the
event_init() functions. I did not mean for this to change the way we
manipulated the IDR.

> We'd better handle uncore-task event before calling pmu->event_init().

This is a nice-to-have, but it's logically separate from the rest of this
patch.

> 
> The code is compatible with PERF_PMU_TYPE_SHIFT.
> /*
>  * attr.config layout for type PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
>  * PERF_TYPE_HARDWARE:			0xEEEEEEEE000000AA
>  *					AA: hardware event ID
>  *					EEEEEEEE: PMU type ID
>  * PERF_TYPE_HW_CACHE:			0xEEEEEEEE00DDCCBB
>  *					BB: hardware cache ID
>  *					CC: hardware cache op ID
>  *					DD: hardware cache op result ID
>  *					EEEEEEEE: PMU type ID
>  * If the PMU type ID is 0, the PERF_TYPE_RAW will be applied.
>  */
> 
> But the drivers have to give the corresponding capabilities obviously.
> 
> Signed-off-by: JiaLong.Yang <jialong.yang@...ngroup.cn>
> ---
>  include/linux/perf_event.h |  4 +-
>  kernel/events/core.c       | 83 +++++++++++++++++++++++---------------
>  2 files changed, 54 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d2a15c0c6f8a..edf4365ab7cc 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -290,7 +290,9 @@ struct perf_event_pmu_context;
>  #define PERF_PMU_CAP_ITRACE			0x0020
>  #define PERF_PMU_CAP_NO_EXCLUDE			0x0040
>  #define PERF_PMU_CAP_AUX_OUTPUT			0x0080
> -#define PERF_PMU_CAP_EXTENDED_HW_TYPE		0x0100
> +#define PERF_PMU_CAP_EVENT_HARDWARE             0x0100
> +#define PERF_PMU_CAP_EVENT_HW_CACHE             0x0200
> +#define PERF_PMU_CAP_EXTENDED_HW_TYPE		0x0300
>  
>  struct perf_output_handle;
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f0f0f71213a1..02f14ae09d09 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -11516,6 +11516,7 @@ static struct lock_class_key cpuctx_lock;
>  int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>  {
>  	int cpu, ret, max = PERF_TYPE_MAX;
> +	int cap = pmu->capabilities;
>  
>  	mutex_lock(&pmus_lock);
>  	ret = -ENOMEM;
> @@ -11523,7 +11524,6 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>  	if (!pmu->pmu_disable_count)
>  		goto unlock;
>  
> -	pmu->type = -1;
>  	if (WARN_ONCE(!name, "Can not register anonymous pmu.\n")) {
>  		ret = -EINVAL;
>  		goto free_pdc;
> @@ -11538,15 +11538,34 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>  	if (ret < 0)
>  		goto free_pdc;
>  
> +	/*
> +	 * Ensure one type ([0, PERF_TYPE_MAX)) correspond to one PMU.
> +	 */
>  	WARN_ON(type >= 0 && ret != type);
>  
>  	type = ret;
>  	pmu->type = type;
>  
> +	if ((type != PERF_TYPE_HARDWARE) &&
> +	    (cap & PERF_PMU_CAP_EVENT_HARDWARE)) {
> +		ret = idr_alloc(&pmu_idr, pmu, PERF_TYPE_HARDWARE,
> +				PERF_TYPE_HARDWARE + 1, GFP_KERNEL);
> +		if (ret < 0)
> +			goto free_idr;
> +	}

This means that if I try to register more than one PMU with
PERF_PMU_CAP_EVENT_HARDWARE, the second will fail IDR allocation and fail to
register.

.. so this is entirely useless for systems with heterogeneous PMUs, where each
of those *should* be able to handle plain PERF_TYPE_HARDWARE events for legacy reasons.

I agree that the existing event initialization code is grotty, but that's
largely an artifact of maintaining existing ABI. AFAICT this patch breaks that,
and makes the code complex in a different way rather than actually simlifying anything.

Therefore, NAK to this patch.

As above, I think it does make sense to have the core code own some common
checks (e.g. rejecting task-bound uncore events, filting generic events from
PMUs that don't support those), so I'm open to patches doing that.

Mark.

> +
> +	if ((type != PERF_TYPE_HW_CACHE) &&
> +	    (cap & PERF_PMU_CAP_EVENT_HW_CACHE)) {
> +		ret = idr_alloc(&pmu_idr, pmu, PERF_TYPE_HW_CACHE,
> +				PERF_TYPE_HW_CACHE + 1, GFP_KERNEL);
> +		if (ret < 0)
> +			goto free_idr_hw;
> +	}
> +
>  	if (pmu_bus_running && !pmu->dev) {
>  		ret = pmu_dev_alloc(pmu);
>  		if (ret)
> -			goto free_idr;
> +			goto free_idr_hw_cache;
>  	}
>  
>  	ret = -ENOMEM;
> @@ -11604,6 +11623,14 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>  		put_device(pmu->dev);
>  	}
>  
> +free_idr_hw_cache:
> +	if (cap & PERF_PMU_CAP_EVENT_HW_CACHE)
> +		idr_remove(&pmu_idr, PERF_TYPE_HW_CACHE);
> +
> +free_idr_hw:
> +	if (cap & PERF_PMU_CAP_EVENT_HARDWARE)
> +		idr_remove(&pmu_idr, PERF_TYPE_HARDWARE);
> +
>  free_idr:
>  	idr_remove(&pmu_idr, pmu->type);
>  
> @@ -11648,6 +11675,7 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
>  {
>  	struct perf_event_context *ctx = NULL;
>  	int ret;
> +	bool uncore_pmu = false, extded_pmu = false;
>  
>  	if (!try_module_get(pmu->module))
>  		return -ENODEV;
> @@ -11668,6 +11696,20 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
>  		BUG_ON(!ctx);
>  	}
>  
> +	if (perf_invalid_context == pmu->task_ctx_nr)
> +		uncore_pmu = true;
> +
> +	if (pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE)
> +		extded_pmu = true;
> +
> +	/* Disallow uncore-task events. */
> +	if (uncore_pmu && (event->attach_state & PERF_ATTACH_TASK))
> +		return -EINVAL;
> +
> +	/* Ensure pmu not supporting generic events will not be passed such event. */
> +	if (!extded_pmu && (event->attr.type & PERF_PMU_CAP_EXTENDED_HW_TYPE))
> +		return -EINVAL;
> +
>  	event->pmu = pmu;
>  	ret = pmu->event_init(event);
>  
> @@ -11695,7 +11737,6 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
>  
>  static struct pmu *perf_init_event(struct perf_event *event)
>  {
> -	bool extended_type = false;
>  	int idx, type, ret;
>  	struct pmu *pmu;
>  
> @@ -11720,36 +11761,15 @@ static struct pmu *perf_init_event(struct perf_event *event)
>  	 * are often aliases for PERF_TYPE_RAW.
>  	 */
>  	type = event->attr.type;
> -	if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE) {
> -		type = event->attr.config >> PERF_PMU_TYPE_SHIFT;
> -		if (!type) {
> -			type = PERF_TYPE_RAW;
> -		} else {
> -			extended_type = true;
> -			event->attr.config &= PERF_HW_EVENT_MASK;
> -		}
> -	}
> +	if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE)
> +		event->attr.config &= PERF_HW_EVENT_MASK;
>  
> -again:
>  	rcu_read_lock();
>  	pmu = idr_find(&pmu_idr, type);
>  	rcu_read_unlock();
> -	if (pmu) {
> -		if (event->attr.type != type && type != PERF_TYPE_RAW &&
> -		    !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
> -			goto fail;
>  
> -		ret = perf_try_init_event(pmu, event);
> -		if (ret == -ENOENT && event->attr.type != type && !extended_type) {
> -			type = event->attr.type;
> -			goto again;
> -		}
> -
> -		if (ret)
> -			pmu = ERR_PTR(ret);
> -
> -		goto unlock;
> -	}
> +	if (!pmu || perf_try_init_event(pmu, event))
> +		goto fail;
>  
>  	list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) {
>  		ret = perf_try_init_event(pmu, event);
> @@ -12026,11 +12046,10 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  	}
>  
>  	/*
> -	 * Disallow uncore-task events. Similarly, disallow uncore-cgroup
> -	 * events (they don't make sense as the cgroup will be different
> -	 * on other CPUs in the uncore mask).
> +	 * Disallow uncore-cgroup events (they don't make sense
> +	 * as the cgroup will be different on other CPUs in the uncore mask).
>  	 */
> -	if (pmu->task_ctx_nr == perf_invalid_context && (task || cgroup_fd != -1)) {
> +	if (task || cgroup_fd != -1) {
>  		err = -EINVAL;
>  		goto err_pmu;
>  	}
> -- 
> 2.25.1
>