linux-kernel - Re: [PATCH 2/2] x86/resctrl: Don't workqueue local event counter reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b3a52af9-e298-7baa-19b3-8931d03731d1@intel.com>
Date: Mon, 4 Nov 2024 14:36:15 -0800
From: Fenghua Yu <fenghua.yu@...el.com>
To: Peter Newman <peternewman@...gle.com>, Reinette Chatre
	<reinette.chatre@...el.com>
CC: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
	Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
	<x86@...nel.org>, "H . Peter Anvin" <hpa@...or.com>, Tony Luck
	<tony.luck@...el.com>, Babu Moger <babu.moger@....com>, James Morse
	<james.morse@....com>, Martin Kletzander <nert.pinx@...il.com>, Shaopeng Tan
	<tan.shaopeng@...itsu.com>, <linux-kernel@...r.kernel.org>,
	<eranian@...gle.com>
Subject: Re: [PATCH 2/2] x86/resctrl: Don't workqueue local event counter
 reads

Hi, Peter,

On 10/31/24 07:25, Peter Newman wrote:
> Performance-conscious users may use threads bound to CPUs within a
> specific monitoring domain to ensure that all bandwidth counters can be
> read efficiently. The hardware counters are only accessible to CPUs
> within the domain, so requests from CPUs outside the domain are
> forwarded to a kernel worker or IPI handler, incurring a substantial
> performance penalty on each read.  Recently, this penalty was observed
> to be paid by local reads as well.
> 
> To support blocking implementations of resctrl_arch_rmid_read(),
> mon_event_read() switched to smp_call_on_cpu() in most cases to read
> event counters using a kernel worker thread. Unlike
> smp_call_function_any(), which optimizes to a local function call when
> the calling CPU is in the target cpumask, smp_call_on_cpu() queues the
> work unconditionally.
> 
> Add a fast-path to ensure that requests bound to within the monitoring
> domain are read using a simple function call into mon_event_count()
> regardless of whether all CPUs in the target domain are using nohz_full.
> 
> This is significant when supporting configurations such as a dual-socket
> AMD Zen2, with 32 L3 monitoring domains and 256 RMIDs. To read both MBM
> counters for all groups on all domains requires 32768 (32*256*2) counter
> reads. The resolution of global, per-group MBM data which can be
> provided is therefore sensitive to the cost of each counter read.
> Furthermore, redirecting this much work to IPI handlers or worker
> threads at a regular interval is disruptive to the present workload.
> 
> The test program fastcat, which was introduced in an earlier path, was
> used to simulate the impact of this change on an optimized event
> counter-reading procedure. The goal is to maximize the frequency at
> which MBM counters can be dumped, so the benchmark determines the cost
> of an additional global MBM counter sample.
> 
> The total number of cycles needed to read all local and total MBM
> counters for a large number of monitoring groups was collected using the
> perf tool. The test was run bound to a single CPU: once targeting
> counters in the local domain and again for counters in a remote domain.
> The cost of a dry-run reading no counters was substracted from the total
> of each run to remove one-time setup costs.
> 
> AMD EPYC 7B12 64-Core Processor (250 mon groups)
> 
> Local Domain:   3.25M -> 1.22M (-62.5%)
> Remote Domain:  7.91M -> 8.05M (+2.9%)
> 
> Intel(R) Xeon(R) Gold 6268CL CPU @ 2.80GHz (190 mon groups)
> 
> Local Domain:   2.98M -> 2.21M (-25.8%)
> Remote Domain:  4.49M -> 4.62M (+3.1%)
> 
> Note that there is a small increase in overhead for remote domains,
> which results from the introduction of a put_cpu() call to reenable
> preemption after determining whether the fast path can be used. Users
> sensitive to this cost should consider avoiding the remote counter read
> penalty completely.
> 
> Also note that the Remote Domain results and the baseline Local Domain
> results only measure cycles in the test program. Because all counter
> reading work was carried out in kernel worker threads, the total system
> cost of the operation is greater.
> 
> Fixes: 09909e098113 ("x86/resctrl: Queue mon_event_read() instead of sending an IPI")
> 
> Signed-off-by: Peter Newman <peternewman@...gle.com>
> ---
>   arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 28 +++++++++++++++++++++++
>   1 file changed, 28 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 200d89a640270..daaff1cfd3f24 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -541,6 +541,31 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>   		return;
>   	}
>   
> +	/*
> +	 * If a performance-conscious caller has gone to the trouble of binding
> +	 * their thread to the monitoring domain of the event counter, ensure
> +	 * that the counters are read directly. smp_call_on_cpu()
> +	 * unconditionally uses a work queue to read the counter, substantially
> +	 * increasing the cost of the read.
> +	 *
> +	 * Preemption must be disabled to prevent a migration out of the domain
> +	 * after the CPU is checked, which would result in reading the wrong
> +	 * counters. Note that this makes the (slow) remote path a little slower
> +	 * by requiring preemption to be reenabled when redirecting the request
> +	 * to another domain was in fact necessary.
> +	 *
> +	 * In the case where all eligible target CPUs are nohz_full and
> +	 * smp_call_function_any() is used, keep preemption disabled to avoid
> +	 * the cost of reenabling it twice in the same read.
> +	 */
> +	cpu = get_cpu();
> +	if (cpumask_test_cpu(cpu, cpumask)) {
> +		mon_event_count(rr);
> +		resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
> +		put_cpu();
> +		return;
> +	}

This fast path code is a duplicate part of smp_call_funcion_any().

In nohz_full() case, the fast path doesn't gain much and even hurts 
remote domain performance:
1. On local domain, it may gain a little bit because it has a few lines 
less than directly calling smp_call_function_any(). But the gain is 
minor due to a lines less code, not due to heavy weight queued work.

2. On remote domain, it degrades performance because get_cpu() and 
put_cpu() are both called twice: one in the fast path code and one in 
smp_call_function_any(). As you mentioned earlier, put_cpu() impacts 
performance. I think get_cpu() has same impact too.

The fast path only gains in none nohz_full() case.

So maybe it's better to move the fast path code into the non nohz_full() 
case? With this change, you may have the following benefits:

1. No performance impact on nohz_full() case (either local or remote 
domain).
2. Improve performance on non nohz_full() case as you intended in this 
patch.
3. The fast path focuses on fixing the right performance bottleneck.

> +
>   	cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU);
>   
>   	/*
> @@ -554,6 +579,9 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>   	else
>   		smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
>   
> +	/* If smp_call_function_any() was used, preemption is reenabled here. */
> +	put_cpu();
> +
>   	resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
>   }
>   

Thanks.

-Fenghua