linux-kernel - Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251205092140.48fa5271@pumpkin>
Date: Fri, 5 Dec 2025 09:21:40 +0000
From: david laight <david.laight@...box.com>
To: Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>
Cc: Dave Hansen <dave.hansen@...el.com>, Nikolay Borisov
 <nik.borisov@...e.com>, x86@...nel.org, David Kaplan
 <david.kaplan@....com>, "H. Peter Anvin" <hpa@...or.com>, Josh Poimboeuf
 <jpoimboe@...nel.org>, Sean Christopherson <seanjc@...gle.com>, Paolo
 Bonzini <pbonzini@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
 <dave.hansen@...ux.intel.com>, linux-kernel@...r.kernel.org,
 kvm@...r.kernel.org, Asit Mallick <asit.k.mallick@...el.com>, Tao Zhang
 <tao1.zhang@...el.com>, Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on
 newer CPUs

On Thu, 4 Dec 2025 13:56:02 -0800
Pawan Gupta <pawan.kumar.gupta@...ux.intel.com> wrote:

> On Thu, Dec 04, 2025 at 09:15:11AM +0000, david laight wrote:
> > On Wed, 3 Dec 2025 17:40:26 -0800
> > Pawan Gupta <pawan.kumar.gupta@...ux.intel.com> wrote:
> >   
> > > On Tue, Nov 25, 2025 at 11:34:07AM +0000, david laight wrote:  
> > > > On Mon, 24 Nov 2025 11:31:26 -0800
> > > > Pawan Gupta <pawan.kumar.gupta@...ux.intel.com> wrote:
> > > >     
> > > > > On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:    
> > > > ...    
> > > > > > For subtle reasons one of the mitigations that slows kernel entry caused
> > > > > > a doubling of the execution time of a largely single-threaded task that
> > > > > > spends almost all its time in userspace!
> > > > > > (I thought I'd disabled it at compile time - but the config option
> > > > > > changed underneath me...)      
> > > > > 
> > > > > That is surprising. If its okay, could you please share more details about
> > > > > this application? Or any other way I can reproduce this?    
> > > > 
> > > > The 'trigger' program is a multi-threaded program that wakes up every 10ms
> > > > to process RTP and TDM audio data.
> > > > So we have a low RT priority process with one thread per cpu.
> > > > Since they are RT they usually get scheduled on the same cpu as last lime.
> > > > I think this simple program will have the desired effect:
> > > > A main process that does:
> > > > 	syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
> > > > 	start_time += 1sec;
> > > > 	for (n = 1; n < num_cpu; n++)
> > > > 		pthread_create(thread_code, start_time);
> > > > 	thread_code(start_time);
> > > > with:
> > > > thread_code(ts)
> > > > {
> > > > 	for (;;) {
> > > > 		ts += 10ms;
> > > > 		syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> > > > 		do_work();
> > > > 	}
> > > > 
> > > > So all the threads wake up at exactly the same time every 10ms.
> > > > (You need to use syscall(), don't look at what glibc does.)
> > > > 
> > > > On my system the program wasn't doing anything, so do_work() was empty.
> > > > What matters is whether all the threads end up running at the same time.
> > > > I managed that using pthread_broadcast(), but the clock code above
> > > > ought to be worse (and I've since changed the daemon to work that way
> > > > to avoid all this issues with pthread_broadcast() being sequential
> > > > and threads not running because the target cpu is running an ISR or
> > > > just looping in kernel).
> > > > 
> > > > The process that gets 'hit' is anything cpu bound.
> > > > Even a shell loop (eg while :; do ;: done) but with a counter will do.
> > > > 
> > > > Without the 'trigger' program, it will (mostly) sit on one cpu and the
> > > > clock frequency of that cpu will increase to (say) 3GHz while the other
> > > > all run at 800Mhz.
> > > > But the 'trigger' program runs threads on all the cpu at the same time.
> > > > So the 'hit' program is pre-empted and is later rescheduled on a
> > > > different cpu - running at 800MHz.
> > > > The cpu speed increases, but 10ms later it gets bounced again.    
> > > 
> > > Sorry I haven't tried creating this test yet.
> > >   
> > > > The real issue is that the cpu speed is associated with the cpu, not
> > > > the process running on it.    
> > > 
> > > So if the 'hit' program gets scheduled to a CPU that is running at 3GHz
> > > then we don't expect a dramatic performance drop? Setting scaling_governor
> > > to "performance" would be an interesting test.  
> > 
> > I failed to find a way to lock the cpu frequency (for other testing) on
> > that system an i7-7xxx - and the system will start thermally throttling
> > if you aren't careful.  
> 
> i7-7xxx would be Kaby Lake gen, those shouldn't need to deploy BHB clear
> mitigation. I am guessing it is the legacy-IBRS mitigation in your case.
> 
> What you described looks very similar to the issue fixed by commit:
> 
>   aa1567a7e644 ("intel_idle: Add ibrs_off module parameter to force-disable IBRS")
> 
>     Commit bf5835bcdb96 ("intel_idle: Disable IBRS during long idle")
>     disables IBRS when the cstate is 6 or lower. However, there are
>     some use cases where a customer may want to use max_cstate=1 to
>     lower latency. Such use cases will suffer from the performance
>     degradation caused by the enabling of IBRS in the sibling idle thread.
>     Add a "ibrs_off" module parameter to force disable IBRS and the
>     CPUIDLE_FLAG_IRQ_ENABLE flag if set.
> 
>     In the case of a Skylake server with max_cstate=1, this new ibrs_off
>     option will likely increase the IRQ response latency as IRQ will now
>     be disabled.
> 
>     When running SPECjbb2015 with cstates set to C1 on a Skylake system.
> 
>     First test when the kernel is booted with: "intel_idle.ibrs_off":
> 
>       max-jOPS = 117828, critical-jOPS = 66047
> 
>     Then retest when the kernel is booted without the "intel_idle.ibrs_off"
>     added:
> 
>       max-jOPS = 116408, critical-jOPS = 58958
> 
>     That means booting with "intel_idle.ibrs_off" improves performance by:
> 
>       max-jOPS:      +1.2%, which could be considered noise range.
>       critical-jOPS: +12%,  which is definitely a solid improvement.

No, it wasn't anything to do with sibling threads.
It was the simple issue of the single-threaded 'busy in userspace' program
getting migrated to an idle cpu running at a low priority.
The IBRS mitigation just affected the timings of the other processes in the
system enough to force the user thread be pre-empted and rescheduled.

So it was not directly related to this code - even though it caused it.
The real issues is the cpu speed being tied to the physical cpu, not the
thread running on it.

> 
> > ISTR that the hardware does most of the work.
> > So I'm not sure what difference "performance" makes (and can't remember what
> > might be set for that system - could set set anyway.)  
> 
> > We did have to disable some of the low power states, waking the cpu from those
> > just takes far too long.  
> 
> Seems like you have a workaround in place already.

I just needed to find out why my fpga compile had gone out from 12 minutes
to over 20 with a kernel update.
Fixing that was easy, but the 'busy thread being migrated to an idle cpu'
is a separate issue that could affect a lot of workloads.
(Whether or not these mitigations are in place.)
Diagnosing it required looking at the scheduler ftrace events and then
realising what effect they would have.
It wouldn't surprise me if people haven't 'fixed' the problem by pinning
a process to a specific cpu, I couldn't try that because the fpga compiler
has some multithreaded parts.

	David