lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 20 Mar 2024 17:15:48 +0100
From: Frederic Weisbecker <frederic@...nel.org>
To: "Paul E. McKenney" <paulmck@...nel.org>
Cc: Thomas Gleixner <tglx@...utronix.de>,
	LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...nel.org>,
	Anna-Maria Behnsen <anna-maria@...utronix.de>
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's
 enqueue in nohz_full

Le Wed, Mar 20, 2024 at 04:14:24AM -0700, Paul E. McKenney a écrit :
> On Tue, Mar 19, 2024 at 02:18:00AM -0700, Paul E. McKenney wrote:
> > On Tue, Mar 19, 2024 at 12:07:29AM +0100, Frederic Weisbecker wrote:
> > > While running in nohz_full mode, a task may enqueue a timer while the
> > > tick is stopped. However the only places where the timer wheel,
> > > alongside the timer migration machinery's decision, may reprogram the
> > > next event accordingly with that new timer's expiry are the idle loop or
> > > any IRQ tail.
> > > 
> > > However neither the idle task nor an interrupt may run on the CPU if it
> > > resumes busy work in userspace for a long while in full dynticks mode.
> > > 
> > > To solve this, the timer enqueue path raises a self-IPI that will
> > > re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
> > > avoids potential locking inversion.
> > > 
> > > This is supposed to happen both for local and global timers but commit:
> > > 
> > > 	b2cf7507e186 ("timers: Always queue timers on the local CPU")
> > > 
> > > broke the global timers case with removing the ->is_idle field handling
> > > for the global base. As a result, global timers enqueue may go unnoticed
> > > in nohz_full.
> > > 
> > > Fix this with restoring the idle tracking of the global timer's base,
> > > allowing self-IPIs again on enqueue time.
> > 
> > Testing with the previous patch (1/2 in this series) reduced the number of
> > problems by about an order of magnitude, down to two sched_tick_remote()
> > instances and one enqueue_hrtimer() instance, very good!
> > 
> > I have kicked off a test including this patch.  Here is hoping!  ;-)
> 
> And 22*100 hours of TREE07 got me one run with a sched_tick_remote()
> complaint and another run with a starved RCU grace-period kthread.
> So this is definitely getting more reliable, but still a little ways
> to go.

Right, there is clearly something else. Investigation continues...

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ