linux-kernel - Re: [PATCH 2/2] x86/resctrl: Don't workqueue local event counter reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2e2636ef-eb0f-067d-ef8b-a95e762dbf9f@intel.com>
Date: Mon, 4 Nov 2024 19:29:00 -0800
From: Fenghua Yu <fenghua.yu@...el.com>
To: "Luck, Tony" <tony.luck@...el.com>, Peter Newman <peternewman@...gle.com>,
	"Chatre, Reinette" <reinette.chatre@...el.com>
CC: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
	Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
	"x86@...nel.org" <x86@...nel.org>, "H . Peter Anvin" <hpa@...or.com>, "Babu
 Moger" <babu.moger@....com>, James Morse <james.morse@....com>, "Martin
 Kletzander" <nert.pinx@...il.com>, Shaopeng Tan <tan.shaopeng@...itsu.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Eranian,
 Stephane" <eranian@...gle.com>
Subject: Re: [PATCH 2/2] x86/resctrl: Don't workqueue local event counter
 reads

Hi, Tony,

On 11/4/24 16:12, Luck, Tony wrote:
>> Whenever this function is called, the performance is degraded rather
>> than improved because extra get_cpu()/put_cpu() are called in the fast
>> path in the current patch.
> 
> But get_cpu()/put_cpu() aren't high overhead. Maybe costs less that the
> cpumask_any_housekeeping() call that is avoided by Peter's patch.

Quote from Peter:

"AMD EPYC 7B12 64-Core Processor (250 mon groups)

Local Domain:   3.25M -> 1.22M (-62.5%)
Remote Domain:  7.91M -> 8.05M (+2.9%)

Intel(R) Xeon(R) Gold 6268CL CPU @ 2.80GHz (190 mon groups)

Local Domain:   2.98M -> 2.21M (-25.8%)
Remote Domain:  4.49M -> 4.62M (+3.1%)

Note that there is a small increase in overhead for remote domains,
which results from the introduction of a put_cpu() call to reenable
preemption after determining whether the fast path can be used."

As his data shows, if the fast path is not taken, the extra put_cpu() 
itself costs +2.9% extra time on AMD machine and +3.1% extra time on 
Intel machine.

And this ~3% overhead is on top of queued work, which is more expensive 
than cpumask_any_housekeeping() IIUC.

> 
> Note that if Peter's patch doesn't take its fast path because the calling
> CPU was on the wrong domain, then the subsequent code is going to
> do an IPI whichever of the if/else path is taken.

In this case, actually IPI is only taken in smp_call_function_any() and 
smp_call_on_cpu() invokes a queued work instead of IPI.

My proposed change logically doesn't change Peter's fast path and 
performance for nohz_full/smp_call_on_cpu() case. It just utilizes the 
"built-in fast path already" inside smp_call_function_any() to save 
extra get_cpu() and put_cpu(). Hopefully the saved extra get_cpu() and 
put_cpu() can offset cost of cpumask_any_housekeeping().

 From Peter's commit message, seems nohz_full case is not 
called/measured a lot if any. If only one or a very few housekeeping 
CPUs on a large system, the nohz_full case will be called frequently and 
fast path will fail most of time and the extra get_cpu()/put_cpu() 
around the fast path might impact more on both local and total domain.

Thanks.

-Fenghua