linux-kernel - Re: [PATCH 2/2] x86/resctrl: Don't workqueue local event counter reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4e2f31bf-1035-40a6-b16d-081d6e45ead8@intel.com>
Date: Tue, 5 Nov 2024 15:20:27 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: Peter Newman <peternewman@...gle.com>, Fenghua Yu <fenghua.yu@...el.com>
CC: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
	Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
	<x86@...nel.org>, "H . Peter Anvin" <hpa@...or.com>, Tony Luck
	<tony.luck@...el.com>, Babu Moger <babu.moger@....com>, James Morse
	<james.morse@....com>, Martin Kletzander <nert.pinx@...il.com>, Shaopeng Tan
	<tan.shaopeng@...itsu.com>, <linux-kernel@...r.kernel.org>,
	<eranian@...gle.com>
Subject: Re: [PATCH 2/2] x86/resctrl: Don't workqueue local event counter
 reads

Hi Peter,

On 11/5/24 3:25 AM, Peter Newman wrote:
> Hi Fenghua,
> 
> On Mon, Nov 4, 2024 at 11:36 PM Fenghua Yu <fenghua.yu@...el.com> wrote:
>>
>> Hi, Peter,
>>
>> On 10/31/24 07:25, Peter Newman wrote:
> 
>>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> index 200d89a640270..daaff1cfd3f24 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> @@ -541,6 +541,31 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>>               return;
>>>       }
>>>
>>> +     /*
>>> +      * If a performance-conscious caller has gone to the trouble of binding
>>> +      * their thread to the monitoring domain of the event counter, ensure
>>> +      * that the counters are read directly. smp_call_on_cpu()
>>> +      * unconditionally uses a work queue to read the counter, substantially
>>> +      * increasing the cost of the read.
>>> +      *
>>> +      * Preemption must be disabled to prevent a migration out of the domain
>>> +      * after the CPU is checked, which would result in reading the wrong
>>> +      * counters. Note that this makes the (slow) remote path a little slower
>>> +      * by requiring preemption to be reenabled when redirecting the request
>>> +      * to another domain was in fact necessary.
>>> +      *
>>> +      * In the case where all eligible target CPUs are nohz_full and
>>> +      * smp_call_function_any() is used, keep preemption disabled to avoid
>>> +      * the cost of reenabling it twice in the same read.
>>> +      */
>>> +     cpu = get_cpu();
>>> +     if (cpumask_test_cpu(cpu, cpumask)) {
>>> +             mon_event_count(rr);
>>> +             resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
>>> +             put_cpu();
>>> +             return;
>>> +     }
>>
>> This fast path code is a duplicate part of smp_call_funcion_any().
>>
>> In nohz_full() case, the fast path doesn't gain much and even hurts
>> remote domain performance:
>> 1. On local domain, it may gain a little bit because it has a few lines
>> less than directly calling smp_call_function_any(). But the gain is
>> minor due to a lines less code, not due to heavy weight queued work.
>>
>> 2. On remote domain, it degrades performance because get_cpu() and
>> put_cpu() are both called twice: one in the fast path code and one in
>> smp_call_function_any(). As you mentioned earlier, put_cpu() impacts
>> performance. I think get_cpu() has same impact too.
> 
> get_cpu() and put_cpu() nest, so only the put_cpu() that reduces the
> preempt count to 0 will call into the scheduler. See the source
> comment I had added below.
> 
> But... note that below smp_call_on_cpu() is now called with preemption
> disabled. (Looks like I only benchmarked and never ran a debug
> build...) I will have to change the patch to make sure put_cpu() is
> called before smp_call_on_cpu().
> 
> 
>>
>> The fast path only gains in none nohz_full() case.
>>
>> So maybe it's better to move the fast path code into the non nohz_full()
>> case? With this change, you may have the following benefits:
>>
>> 1. No performance impact on nohz_full() case (either local or remote
>> domain).
>> 2. Improve performance on non nohz_full() case as you intended in this
>> patch.
>> 3. The fast path focuses on fixing the right performance bottleneck.
> 
> The consequence of reusing the current-cpu-in-mask check in
> smp_call_function_any() is that if the check fails, it could cause
> resctrl_arch_rmid_read() to fail by invoking it in an IPI handler when
> it would have succeeded if invoked on a kernel worker, undoing James's
> original work.

I think this change already undoes the motivation for 09909e098113
("x86/resctrl: Queue mon_event_read() instead of sending an IPI")? As you mention in
changelog the goal of that work was to enable resctrl_arch_rmid_read() to sleep.
This change will call resctrl_arch_rmid_read() with preemption disabled if
it happens to be called on CPU in monitoring domain. Would that not cause
MPAM monitor count reads from CPU in domain to be a bug?

Could you please try out this patch with CONFIG_DEBUG_ATOMIC_SLEEP=y?

Reinette