lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <9e1a3be7-839a-44fb-9d10-82784581f7a0@paulmck-laptop>
Date: Wed, 3 Apr 2024 11:05:11 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: Thomas Gleixner <tglx@...utronix.de>,
	LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...nel.org>,
	Anna-Maria Behnsen <anna-maria@...utronix.de>
Subject: Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's
 enqueue in nohz_full

On Tue, Apr 02, 2024 at 09:47:37AM -0700, Paul E. McKenney wrote:
> On Mon, Apr 01, 2024 at 05:04:10PM -0700, Paul E. McKenney wrote:
> > On Mon, Apr 01, 2024 at 11:56:36PM +0200, Frederic Weisbecker wrote:
> > > Le Mon, Apr 01, 2024 at 02:26:25PM -0700, Paul E. McKenney a écrit :
> > > > > > _ The RCU CPU Stall report. I strongly suspect the cause is the hrtimer
> > > > > >   enqueue to an offline CPU. Let's solve that and we'll see if it still
> > > > > >   triggers.
> > > > > 
> > > > > Sounds like a plan!
> > > > 
> > > > Just checking in on this one.  I did reproduce your RCU CPU stall report
> > > > and also saw a TREE03 OOM that might (or might not) be related.  Please
> > > > let me know if hammering TREE03 harder or adding some debug would help.
> > > > Otherwise, I will assume that you are getting sufficient bug reports
> > > > from your own testing to be getting along with.
> > > 
> > > Hehe, there are a lot indeed :-)
> > > 
> > > So there has been some discussion on CPUSET VS Hotplug, as a problem there
> > > is likely the cause of the hrtimer warning you saw, which in turn might
> > > be the cause of the RCU stalls.
> > > 
> > > Do you always see the hrtimer warning along the RCU stalls? Because if so, this
> > > might help:
> > > https://lore.kernel.org/lkml/20240401145858.2656598-1-longman@redhat.com/T/#m1bed4d298715d1a6b8289ed48e9353993c63c896
> > 
> > Not always, but why not give it a shot?
> 
> And no failures, though I would need to run much longer for this to
> mean much.  These were wide-spectrum tests, so my next step will be to
> run only TREE03 and TREE07.

And 600 hours each of TREE03 and TREE07 got me a single TREE07 instance
of the sched_tick_remote() failure.  This one:

	WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);

But this is just rcutorture testing out "short" 14-second stalls, which
can only be expected to trigger this from time to time.  The point of
this stall is to test the evasive actions that RCU takes when 50% of
the way to the RCU CPU stall timeout.

One approach would be to increase that "3" to "15", but that sounds
quite fragile.  Another would be for rcutorture to communicate the fact
that stall testing is in progress, and then this WARN_ON_ONCE() could
silence itself in that case.

But is there a better approach?

							Thanx, Paul

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ