linux-kernel - Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <smt7yrupcypkjsfrtlwp6kznol3mrgrer63plubwfp2hcunoul@yi5rbq5r3w5j>
Date: Thu, 4 Dec 2025 13:56:02 -0800
From: Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>
To: david laight <david.laight@...box.com>
Cc: Dave Hansen <dave.hansen@...el.com>, 
	Nikolay Borisov <nik.borisov@...e.com>, x86@...nel.org, David Kaplan <david.kaplan@....com>, 
	"H. Peter Anvin" <hpa@...or.com>, Josh Poimboeuf <jpoimboe@...nel.org>, 
	Sean Christopherson <seanjc@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>, 
	Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, Asit Mallick <asit.k.mallick@...el.com>, 
	Tao Zhang <tao1.zhang@...el.com>, Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH v4 04/11] x86/bhi: Make clear_bhb_loop() effective on
 newer CPUs

On Thu, Dec 04, 2025 at 09:15:11AM +0000, david laight wrote:
> On Wed, 3 Dec 2025 17:40:26 -0800
> Pawan Gupta <pawan.kumar.gupta@...ux.intel.com> wrote:
> 
> > On Tue, Nov 25, 2025 at 11:34:07AM +0000, david laight wrote:
> > > On Mon, 24 Nov 2025 11:31:26 -0800
> > > Pawan Gupta <pawan.kumar.gupta@...ux.intel.com> wrote:
> > >   
> > > > On Sat, Nov 22, 2025 at 11:05:58AM +0000, david laight wrote:  
> > > ...  
> > > > > For subtle reasons one of the mitigations that slows kernel entry caused
> > > > > a doubling of the execution time of a largely single-threaded task that
> > > > > spends almost all its time in userspace!
> > > > > (I thought I'd disabled it at compile time - but the config option
> > > > > changed underneath me...)    
> > > > 
> > > > That is surprising. If its okay, could you please share more details about
> > > > this application? Or any other way I can reproduce this?  
> > > 
> > > The 'trigger' program is a multi-threaded program that wakes up every 10ms
> > > to process RTP and TDM audio data.
> > > So we have a low RT priority process with one thread per cpu.
> > > Since they are RT they usually get scheduled on the same cpu as last lime.
> > > I think this simple program will have the desired effect:
> > > A main process that does:
> > > 	syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &start_time);
> > > 	start_time += 1sec;
> > > 	for (n = 1; n < num_cpu; n++)
> > > 		pthread_create(thread_code, start_time);
> > > 	thread_code(start_time);
> > > with:
> > > thread_code(ts)
> > > {
> > > 	for (;;) {
> > > 		ts += 10ms;
> > > 		syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> > > 		do_work();
> > > 	}
> > > 
> > > So all the threads wake up at exactly the same time every 10ms.
> > > (You need to use syscall(), don't look at what glibc does.)
> > > 
> > > On my system the program wasn't doing anything, so do_work() was empty.
> > > What matters is whether all the threads end up running at the same time.
> > > I managed that using pthread_broadcast(), but the clock code above
> > > ought to be worse (and I've since changed the daemon to work that way
> > > to avoid all this issues with pthread_broadcast() being sequential
> > > and threads not running because the target cpu is running an ISR or
> > > just looping in kernel).
> > > 
> > > The process that gets 'hit' is anything cpu bound.
> > > Even a shell loop (eg while :; do ;: done) but with a counter will do.
> > > 
> > > Without the 'trigger' program, it will (mostly) sit on one cpu and the
> > > clock frequency of that cpu will increase to (say) 3GHz while the other
> > > all run at 800Mhz.
> > > But the 'trigger' program runs threads on all the cpu at the same time.
> > > So the 'hit' program is pre-empted and is later rescheduled on a
> > > different cpu - running at 800MHz.
> > > The cpu speed increases, but 10ms later it gets bounced again.  
> > 
> > Sorry I haven't tried creating this test yet.
> > 
> > > The real issue is that the cpu speed is associated with the cpu, not
> > > the process running on it.  
> > 
> > So if the 'hit' program gets scheduled to a CPU that is running at 3GHz
> > then we don't expect a dramatic performance drop? Setting scaling_governor
> > to "performance" would be an interesting test.
> 
> I failed to find a way to lock the cpu frequency (for other testing) on
> that system an i7-7xxx - and the system will start thermally throttling
> if you aren't careful.

i7-7xxx would be Kaby Lake gen, those shouldn't need to deploy BHB clear
mitigation. I am guessing it is the legacy-IBRS mitigation in your case.

What you described looks very similar to the issue fixed by commit:

  aa1567a7e644 ("intel_idle: Add ibrs_off module parameter to force-disable IBRS")

    Commit bf5835bcdb96 ("intel_idle: Disable IBRS during long idle")
    disables IBRS when the cstate is 6 or lower. However, there are
    some use cases where a customer may want to use max_cstate=1 to
    lower latency. Such use cases will suffer from the performance
    degradation caused by the enabling of IBRS in the sibling idle thread.
    Add a "ibrs_off" module parameter to force disable IBRS and the
    CPUIDLE_FLAG_IRQ_ENABLE flag if set.

    In the case of a Skylake server with max_cstate=1, this new ibrs_off
    option will likely increase the IRQ response latency as IRQ will now
    be disabled.

    When running SPECjbb2015 with cstates set to C1 on a Skylake system.

    First test when the kernel is booted with: "intel_idle.ibrs_off":

      max-jOPS = 117828, critical-jOPS = 66047

    Then retest when the kernel is booted without the "intel_idle.ibrs_off"
    added:

      max-jOPS = 116408, critical-jOPS = 58958

    That means booting with "intel_idle.ibrs_off" improves performance by:

      max-jOPS:      +1.2%, which could be considered noise range.
      critical-jOPS: +12%,  which is definitely a solid improvement.

> ISTR that the hardware does most of the work.
> So I'm not sure what difference "performance" makes (and can't remember what
> might be set for that system - could set set anyway.)

> We did have to disable some of the low power states, waking the cpu from those
> just takes far too long.

Seems like you have a workaround in place already.