[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200521093938.GG325280@hirez.programming.kicks-ass.net>
Date: Thu, 21 May 2020 11:39:38 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: Qian Cai <cai@....pw>, "Paul E. McKenney" <paulmck@...nel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Michael Ellerman <mpe@...erman.id.au>,
linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
Borislav Petkov <bp@...en8.de>
Subject: Re: Endless soft-lockups for compiling workload since next-20200519
On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> > On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > > Just a head up. Repeatedly compiling kernels for a while would trigger
> > > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > > .config are in,
> >
> > Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> > not seen anything like that myself. Let me go have a look.
> >
> >
> > In as far as the logs are readable (they're a wrapped mess, please don't
> > do that!), they contain very little useful, as is typical with IPIs :/
> >
> > > [ 1167.993773][ C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > > flush_smp_call_function_queue+0x1fa/0x2e0
>
> So I've tried to think of a race that could produce that and here is
> the only thing I could come up with. It's a bit complicated unfortunately:
This:
> smp_call_function_single_async() { smp_call_function_single_async() {
> // verified csd->flags != CSD_LOCK // verified csd->flags != CSD_LOCK
> csd->flags = CSD_LOCK csd->flags = CSD_LOCK
concurrent smp_call_function_single_async() using the same csd is what
I'm looking at as well. Now in the ILB case there is an easy cure:
(because there is only a single ilb target)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01f94cf52783..b6d8a7b991f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
* is idle. And the softirq performing nohz idle load balance
* will be run before returning from the IPI.
*/
- smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
+ smp_call_function_single_async(ilb_cpu, &this_rq()->nohz_csd);
}
/*
Qian, can you give that a spin?
But I'm still not convinced of your scenario:
> kick_ilb() {
> atomic_fetch_or(...., nohz_flags(0))
> atomic_fetch_or(...., nohz_flags(0)) #VMENTER
> smp_call_function_single_async() { smp_call_function_single_async() {
> // verified csd->flags != CSD_LOCK // verified csd->flags != CSD_LOCK
> csd->flags = CSD_LOCK csd->flags = CSD_LOCK
Note that we check the return value of atomic_fetch_or() and bail if
someone else set a flag in KICK_MASK before us.
Aah, I suppose you're saying this can happen when:
!(flags & NOHZ_KICK_MASK)
? That's not supposed to happen though.
Anyway, let me go stare at the remove wake-up case, because i'm afraid
that might have the same problem too...
Powered by blists - more mailing lists