linux-kernel - Re: [PATCH 1/7] perf: Generic hotplug support for a PMU with a scope

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f856d105-5463-4b8b-8715-0e6871165616@kernel.org>
Date: Wed, 23 Oct 2024 19:09:47 +0200
From: Matthieu Baerts <matttbe@...nel.org>
To: kan.liang@...ux.intel.com
Cc: linux-kernel@...r.kernel.org, irogers@...gle.com, peterz@...radead.org,
 mingo@...hat.com, acme@...nel.org, namhyung@...nel.org,
 linux-perf-users@...r.kernel.org
Subject: Re: [PATCH 1/7] perf: Generic hotplug support for a PMU with a scope

Hi Kan Liang,

(+ cc Perf ML)

On 02/08/2024 17:16, kan.liang@...ux.intel.com wrote:
> From: Kan Liang <kan.liang@...ux.intel.com>
> 
> The perf subsystem assumes that the counters of a PMU are per-CPU. So
> the user space tool reads a counter from each CPU in the system wide
> mode. However, many PMUs don't have a per-CPU counter. The counter is
> effective for a scope, e.g., a die or a socket. To address this, a
> cpumask is exposed by the kernel driver to restrict to one CPU to stand
> for a specific scope. In case the given CPU is removed,
> the hotplug support has to be implemented for each such driver.
> 
> The codes to support the cpumask and hotplug are very similar.
> - Expose a cpumask into sysfs
> - Pickup another CPU in the same scope if the given CPU is removed.
> - Invoke the perf_pmu_migrate_context() to migrate to a new CPU.
> - In event init, always set the CPU in the cpumask to event->cpu
> 
> Similar duplicated codes are implemented for each such PMU driver. It
> would be good to introduce a generic infrastructure to avoid such
> duplication.
> 
> 5 popular scopes are implemented here, core, die, cluster, pkg, and
> the system-wide. The scope can be set when a PMU is registered. If so, a
> "cpumask" is automatically exposed for the PMU.
> 
> The "cpumask" is from the perf_online_<scope>_mask, which is to track
> the active CPU for each scope. They are set when the first CPU of the
> scope is online via the generic perf hotplug support. When a
> corresponding CPU is removed, the perf_online_<scope>_mask is updated
> accordingly and the PMU will be moved to a new CPU from the same scope
> if possible.

Thank you for the patch.

It looks like this modification causes the following warning on my side
when shutting down a VM running a kernel built with a debug config
including CONFIG_PROVE_RCU_LIST=y (and CONFIG_RCU_EXPERT=y):


> # /usr/lib/klibc/bin/poweroff
> 
> =============================
> WARNING: suspicious RCU usage
> 6.12.0-rc3+ #3 Not tainted
> -----------------------------
> kernel/events/core.c:13962 RCU-list traversed in non-reader section!!
> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 3 locks held by poweroff/11748:
>  #0: ffffffff9b441e28 (system_transition_mutex){+.+.}-{3:3}, at: __do_sys_reboot (kernel/reboot.c:770)
>  #1: ffffffff9b43eab0 ((reboot_notifier_list).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain (kernel/notifier.c:388)
>  #2: ffffffff9b6d06c8 (pmus_lock){+.+.}-{3:3}, at: perf_event_exit_cpu_context (kernel/events/core.c:13983)
> 
> stack backtrace:
> CPU: 0 UID: 0 PID: 11748 Comm: poweroff Not tainted 6.12.0-rc3+ #3
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> Call Trace:
>  <TASK>
>  dump_stack_lvl (lib/dump_stack.c:123)
>  lockdep_rcu_suspicious (kernel/locking/lockdep.c:6822)
>  perf_event_clear_cpumask (kernel/events/core.c:13962 (discriminator 9))
>  ? __pfx_perf_event_clear_cpumask (kernel/events/core.c:13939)
>  ? acpi_execute_simple_method (drivers/acpi/utils.c:679)
>  ? __pfx_acpi_execute_simple_method (drivers/acpi/utils.c:679)
>  ? md_notify_reboot (drivers/md/md.c:9860)
>  perf_event_exit_cpu_context (kernel/events/core.c:13984 (discriminator 1))
>  perf_reboot (kernel/events/core.c:14066 (discriminator 3))
>  ? trace_notifier_run (include/trace/events/notifier.h:59 (discriminator 2))
>  notifier_call_chain (kernel/notifier.c:93)
>  blocking_notifier_call_chain (kernel/notifier.c:389)
>  kernel_power_off (kernel/reboot.c:294)
>  __do_sys_reboot (kernel/reboot.c:771)
>  ? __pfx___do_sys_reboot (kernel/reboot.c:717)
>  ? __pfx_ksys_sync (fs/sync.c:98)
>  do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1))
>  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)


It is very easy for me to reproduce it: simply by stopping the VM. Just
in case, here are the steps I used to have the same VM:

  $ cd [kernel source code]
  $ echo true > .virtme-exec-run
  $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-debug -e RCU_EXPERT -e PROVE_RCU_LIST


I have one question below about the modification you did here.

(...)

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index aa3450bdc227..5e1877c4cb4c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c

(...)

> @@ -13730,6 +13816,40 @@ static void __perf_event_exit_context(void *__info)
>  	raw_spin_unlock(&ctx->lock);
>  }
>  
> +static void perf_event_clear_cpumask(unsigned int cpu)
> +{
> +	int target[PERF_PMU_MAX_SCOPE];
> +	unsigned int scope;
> +	struct pmu *pmu;
> +
> +	cpumask_clear_cpu(cpu, perf_online_mask);
> +
> +	for (scope = PERF_PMU_SCOPE_NONE + 1; scope < PERF_PMU_MAX_SCOPE; scope++) {
> +		const struct cpumask *cpumask = perf_scope_cpu_topology_cpumask(scope, cpu);
> +		struct cpumask *pmu_cpumask = perf_scope_cpumask(scope);
> +
> +		target[scope] = -1;
> +		if (WARN_ON_ONCE(!pmu_cpumask || !cpumask))
> +			continue;
> +
> +		if (!cpumask_test_and_clear_cpu(cpu, pmu_cpumask))
> +			continue;
> +		target[scope] = cpumask_any_but(cpumask, cpu);
> +		if (target[scope] < nr_cpu_ids)
> +			cpumask_set_cpu(target[scope], pmu_cpumask);
> +	}
> +
> +	/* migrate */
> +	list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) {


Here is the line causing the warning, because rcu_read_lock() is not
used before.

I don't know this code, but I guess you are not only doing read
operations here if you are migrating something, right? This operation is
done under the "pmus_lock", maybe the "_rcu" variant is not needed here?

So just using this instead is maybe enough?

  list_for_each_entry(pmu, &pmus, entry) {


> +		if (pmu->scope == PERF_PMU_SCOPE_NONE ||
> +		    WARN_ON_ONCE(pmu->scope >= PERF_PMU_MAX_SCOPE))
> +			continue;
> +
> +		if (target[pmu->scope] >= 0 && target[pmu->scope] < nr_cpu_ids)
> +			perf_pmu_migrate_context(pmu, cpu, target[pmu->scope]);
> +	}
> +}
> +
>  static void perf_event_exit_cpu_context(int cpu)
>  {
>  	struct perf_cpu_context *cpuctx;
> @@ -13737,6 +13857,11 @@ static void perf_event_exit_cpu_context(int cpu)
>  
>  	// XXX simplify cpuctx->online
>  	mutex_lock(&pmus_lock);
> +	/*
> +	 * Clear the cpumasks, and migrate to other CPUs if possible.
> +	 * Must be invoked before the __perf_event_exit_context.
> +	 */
> +	perf_event_clear_cpumask(cpu);
>  	cpuctx = per_cpu_ptr(&perf_cpu_context, cpu);
>  	ctx = &cpuctx->ctx;
>  
> @@ -13744,7 +13869,6 @@ static void perf_event_exit_cpu_context(int cpu)
>  	smp_call_function_single(cpu, __perf_event_exit_context, ctx, 1);
>  	cpuctx->online = 0;
>  	mutex_unlock(&ctx->mutex);
> -	cpumask_clear_cpu(cpu, perf_online_mask);
>  	mutex_unlock(&pmus_lock);
>  }
>  #else
(...)

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.