[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4e2f31bf-1035-40a6-b16d-081d6e45ead8@intel.com>
Date: Tue, 5 Nov 2024 15:20:27 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: Peter Newman <peternewman@...gle.com>, Fenghua Yu <fenghua.yu@...el.com>
CC: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
<x86@...nel.org>, "H . Peter Anvin" <hpa@...or.com>, Tony Luck
<tony.luck@...el.com>, Babu Moger <babu.moger@....com>, James Morse
<james.morse@....com>, Martin Kletzander <nert.pinx@...il.com>, Shaopeng Tan
<tan.shaopeng@...itsu.com>, <linux-kernel@...r.kernel.org>,
<eranian@...gle.com>
Subject: Re: [PATCH 2/2] x86/resctrl: Don't workqueue local event counter
reads
Hi Peter,
On 11/5/24 3:25 AM, Peter Newman wrote:
> Hi Fenghua,
>
> On Mon, Nov 4, 2024 at 11:36 PM Fenghua Yu <fenghua.yu@...el.com> wrote:
>>
>> Hi, Peter,
>>
>> On 10/31/24 07:25, Peter Newman wrote:
>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> index 200d89a640270..daaff1cfd3f24 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> @@ -541,6 +541,31 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>> return;
>>> }
>>>
>>> + /*
>>> + * If a performance-conscious caller has gone to the trouble of binding
>>> + * their thread to the monitoring domain of the event counter, ensure
>>> + * that the counters are read directly. smp_call_on_cpu()
>>> + * unconditionally uses a work queue to read the counter, substantially
>>> + * increasing the cost of the read.
>>> + *
>>> + * Preemption must be disabled to prevent a migration out of the domain
>>> + * after the CPU is checked, which would result in reading the wrong
>>> + * counters. Note that this makes the (slow) remote path a little slower
>>> + * by requiring preemption to be reenabled when redirecting the request
>>> + * to another domain was in fact necessary.
>>> + *
>>> + * In the case where all eligible target CPUs are nohz_full and
>>> + * smp_call_function_any() is used, keep preemption disabled to avoid
>>> + * the cost of reenabling it twice in the same read.
>>> + */
>>> + cpu = get_cpu();
>>> + if (cpumask_test_cpu(cpu, cpumask)) {
>>> + mon_event_count(rr);
>>> + resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
>>> + put_cpu();
>>> + return;
>>> + }
>>
>> This fast path code is a duplicate part of smp_call_funcion_any().
>>
>> In nohz_full() case, the fast path doesn't gain much and even hurts
>> remote domain performance:
>> 1. On local domain, it may gain a little bit because it has a few lines
>> less than directly calling smp_call_function_any(). But the gain is
>> minor due to a lines less code, not due to heavy weight queued work.
>>
>> 2. On remote domain, it degrades performance because get_cpu() and
>> put_cpu() are both called twice: one in the fast path code and one in
>> smp_call_function_any(). As you mentioned earlier, put_cpu() impacts
>> performance. I think get_cpu() has same impact too.
>
> get_cpu() and put_cpu() nest, so only the put_cpu() that reduces the
> preempt count to 0 will call into the scheduler. See the source
> comment I had added below.
>
> But... note that below smp_call_on_cpu() is now called with preemption
> disabled. (Looks like I only benchmarked and never ran a debug
> build...) I will have to change the patch to make sure put_cpu() is
> called before smp_call_on_cpu().
>
>
>>
>> The fast path only gains in none nohz_full() case.
>>
>> So maybe it's better to move the fast path code into the non nohz_full()
>> case? With this change, you may have the following benefits:
>>
>> 1. No performance impact on nohz_full() case (either local or remote
>> domain).
>> 2. Improve performance on non nohz_full() case as you intended in this
>> patch.
>> 3. The fast path focuses on fixing the right performance bottleneck.
>
> The consequence of reusing the current-cpu-in-mask check in
> smp_call_function_any() is that if the check fails, it could cause
> resctrl_arch_rmid_read() to fail by invoking it in an IPI handler when
> it would have succeeded if invoked on a kernel worker, undoing James's
> original work.
I think this change already undoes the motivation for 09909e098113
("x86/resctrl: Queue mon_event_read() instead of sending an IPI")? As you mention in
changelog the goal of that work was to enable resctrl_arch_rmid_read() to sleep.
This change will call resctrl_arch_rmid_read() with preemption disabled if
it happens to be called on CPU in monitoring domain. Would that not cause
MPAM monitor count reads from CPU in domain to be a bug?
Could you please try out this patch with CONFIG_DEBUG_ATOMIC_SLEEP=y?
Reinette
Powered by blists - more mailing lists