[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b3a52af9-e298-7baa-19b3-8931d03731d1@intel.com>
Date: Mon, 4 Nov 2024 14:36:15 -0800
From: Fenghua Yu <fenghua.yu@...el.com>
To: Peter Newman <peternewman@...gle.com>, Reinette Chatre
<reinette.chatre@...el.com>
CC: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
<x86@...nel.org>, "H . Peter Anvin" <hpa@...or.com>, Tony Luck
<tony.luck@...el.com>, Babu Moger <babu.moger@....com>, James Morse
<james.morse@....com>, Martin Kletzander <nert.pinx@...il.com>, Shaopeng Tan
<tan.shaopeng@...itsu.com>, <linux-kernel@...r.kernel.org>,
<eranian@...gle.com>
Subject: Re: [PATCH 2/2] x86/resctrl: Don't workqueue local event counter
reads
Hi, Peter,
On 10/31/24 07:25, Peter Newman wrote:
> Performance-conscious users may use threads bound to CPUs within a
> specific monitoring domain to ensure that all bandwidth counters can be
> read efficiently. The hardware counters are only accessible to CPUs
> within the domain, so requests from CPUs outside the domain are
> forwarded to a kernel worker or IPI handler, incurring a substantial
> performance penalty on each read. Recently, this penalty was observed
> to be paid by local reads as well.
>
> To support blocking implementations of resctrl_arch_rmid_read(),
> mon_event_read() switched to smp_call_on_cpu() in most cases to read
> event counters using a kernel worker thread. Unlike
> smp_call_function_any(), which optimizes to a local function call when
> the calling CPU is in the target cpumask, smp_call_on_cpu() queues the
> work unconditionally.
>
> Add a fast-path to ensure that requests bound to within the monitoring
> domain are read using a simple function call into mon_event_count()
> regardless of whether all CPUs in the target domain are using nohz_full.
>
> This is significant when supporting configurations such as a dual-socket
> AMD Zen2, with 32 L3 monitoring domains and 256 RMIDs. To read both MBM
> counters for all groups on all domains requires 32768 (32*256*2) counter
> reads. The resolution of global, per-group MBM data which can be
> provided is therefore sensitive to the cost of each counter read.
> Furthermore, redirecting this much work to IPI handlers or worker
> threads at a regular interval is disruptive to the present workload.
>
> The test program fastcat, which was introduced in an earlier path, was
> used to simulate the impact of this change on an optimized event
> counter-reading procedure. The goal is to maximize the frequency at
> which MBM counters can be dumped, so the benchmark determines the cost
> of an additional global MBM counter sample.
>
> The total number of cycles needed to read all local and total MBM
> counters for a large number of monitoring groups was collected using the
> perf tool. The test was run bound to a single CPU: once targeting
> counters in the local domain and again for counters in a remote domain.
> The cost of a dry-run reading no counters was substracted from the total
> of each run to remove one-time setup costs.
>
> AMD EPYC 7B12 64-Core Processor (250 mon groups)
>
> Local Domain: 3.25M -> 1.22M (-62.5%)
> Remote Domain: 7.91M -> 8.05M (+2.9%)
>
> Intel(R) Xeon(R) Gold 6268CL CPU @ 2.80GHz (190 mon groups)
>
> Local Domain: 2.98M -> 2.21M (-25.8%)
> Remote Domain: 4.49M -> 4.62M (+3.1%)
>
> Note that there is a small increase in overhead for remote domains,
> which results from the introduction of a put_cpu() call to reenable
> preemption after determining whether the fast path can be used. Users
> sensitive to this cost should consider avoiding the remote counter read
> penalty completely.
>
> Also note that the Remote Domain results and the baseline Local Domain
> results only measure cycles in the test program. Because all counter
> reading work was carried out in kernel worker threads, the total system
> cost of the operation is greater.
>
> Fixes: 09909e098113 ("x86/resctrl: Queue mon_event_read() instead of sending an IPI")
>
> Signed-off-by: Peter Newman <peternewman@...gle.com>
> ---
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 28 +++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 200d89a640270..daaff1cfd3f24 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -541,6 +541,31 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> return;
> }
>
> + /*
> + * If a performance-conscious caller has gone to the trouble of binding
> + * their thread to the monitoring domain of the event counter, ensure
> + * that the counters are read directly. smp_call_on_cpu()
> + * unconditionally uses a work queue to read the counter, substantially
> + * increasing the cost of the read.
> + *
> + * Preemption must be disabled to prevent a migration out of the domain
> + * after the CPU is checked, which would result in reading the wrong
> + * counters. Note that this makes the (slow) remote path a little slower
> + * by requiring preemption to be reenabled when redirecting the request
> + * to another domain was in fact necessary.
> + *
> + * In the case where all eligible target CPUs are nohz_full and
> + * smp_call_function_any() is used, keep preemption disabled to avoid
> + * the cost of reenabling it twice in the same read.
> + */
> + cpu = get_cpu();
> + if (cpumask_test_cpu(cpu, cpumask)) {
> + mon_event_count(rr);
> + resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
> + put_cpu();
> + return;
> + }
This fast path code is a duplicate part of smp_call_funcion_any().
In nohz_full() case, the fast path doesn't gain much and even hurts
remote domain performance:
1. On local domain, it may gain a little bit because it has a few lines
less than directly calling smp_call_function_any(). But the gain is
minor due to a lines less code, not due to heavy weight queued work.
2. On remote domain, it degrades performance because get_cpu() and
put_cpu() are both called twice: one in the fast path code and one in
smp_call_function_any(). As you mentioned earlier, put_cpu() impacts
performance. I think get_cpu() has same impact too.
The fast path only gains in none nohz_full() case.
So maybe it's better to move the fast path code into the non nohz_full()
case? With this change, you may have the following benefits:
1. No performance impact on nohz_full() case (either local or remote
domain).
2. Improve performance on non nohz_full() case as you intended in this
patch.
3. The fast path focuses on fixing the right performance bottleneck.
> +
> cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU);
>
> /*
> @@ -554,6 +579,9 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> else
> smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
>
> + /* If smp_call_function_any() was used, preemption is reenabled here. */
> + put_cpu();
> +
> resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
> }
>
Thanks.
-Fenghua
Powered by blists - more mailing lists