[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e12e0933-d8a5-4659-9fea-3413e3b2374d@paulmck-laptop>
Date: Wed, 28 Aug 2024 06:03:31 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Valentin Schneider <vschneid@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org,
sfr@...b.auug.org.au, linux-next@...r.kernel.org,
kernel-team@...a.com, Chen Yu <yu.c.chen@...el.com>
Subject: Re: [BUG almost bisected] Splat in dequeue_rt_stack() and build error
On Wed, Aug 28, 2024 at 02:35:45PM +0200, Valentin Schneider wrote:
> On 27/08/24 13:36, Paul E. McKenney wrote:
> > On Tue, Aug 27, 2024 at 10:30:24PM +0200, Valentin Schneider wrote:
> >> On 27/08/24 11:35, Paul E. McKenney wrote:
> >> > On Tue, Aug 27, 2024 at 10:33:13AM -0700, Paul E. McKenney wrote:
> >> >> On Tue, Aug 27, 2024 at 05:41:52PM +0200, Valentin Schneider wrote:
> >> >> > I've taken tip/sched/core and shuffled hunks around; I didn't re-order any
> >> >> > commit. I've also taken out the dequeue from switched_from_fair() and put
> >> >> > it at the very top of the branch which should hopefully help bisection.
> >> >> >
> >> >> > The final delta between that branch and tip/sched/core is empty, so it
> >> >> > really is just shuffling inbetween commits.
> >> >> >
> >> >> > Please find the branch at:
> >> >> >
> >> >> > https://gitlab.com/vschneid/linux.git -b mainline/sched/eevdf-complete-builderr
> >> >> >
> >> >> > I'll go stare at the BUG itself now.
> >> >>
> >> >> Thank you!
> >> >>
> >> >> I have fired up tests on the "BROKEN?" commit. If that fails, I will
> >> >> try its predecessor, and if that fails, I wlll bisect from e28b5f8bda01
> >> >> ("sched/fair: Assert {set_next,put_prev}_entity() are properly balanced"),
> >> >> which has stood up to heavy hammering in earlier testing.
> >> >
> >> > And of 50 runs of TREE03 on the "BROKEN?" commit resulted in 32 failures.
> >> > Of these, 29 were the dequeue_rt_stack() failure. Two more were RCU
> >> > CPU stall warnings, and the last one was an oddball "kernel BUG at
> >> > kernel/sched/rt.c:1714" followed by an equally oddball "Oops: invalid
> >> > opcode: 0000 [#1] PREEMPT SMP PTI".
> >> >
> >> > Just to be specific, this is commit:
> >> >
> >> > df8fe34bfa36 ("BROKEN? sched/fair: Dequeue sched_delayed tasks when switching from fair")
> >> >
> >> > This commit's predecessor is this commit:
> >> >
> >> > 2f888533d073 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
> >> >
> >> > This predecessor commit passes 50 runs of TREE03 with no failures.
> >> >
> >> > So that addition of that dequeue_task() call to the switched_from_fair()
> >> > function is looking quite suspicious to me. ;-)
> >> >
> >> > Thanx, Paul
> >>
> >> Thanks for the testing!
> >>
> >> The WARN_ON_ONCE(!rt_se->on_list); hit in __dequeue_rt_entity() feels like
> >> a put_prev/set_next kind of issue...
> >>
> >> So far I'd assumed a ->sched_delayed task can't be current during
> >> switched_from_fair(), I got confused because it's Mond^CCC Tuesday, but I
> >> think that still holds: we can't get a balance_dl() or balance_rt() to drop
> >> the RQ lock because prev would be fair, and we can't get a
> >> newidle_balance() with a ->sched_delayed task because we'd have
> >> sched_fair_runnable() := true.
> >>
> >> I'll pick this back up tomorrow, this is a task that requires either
> >> caffeine or booze and it's too late for either.
> >
> > Thank you for chasing this, and get some sleep! This one is of course
> > annoying, but it is not (yet) an emergency. I look forward to seeing
> > what you come up with.
> >
> > Also, I would of course be happy to apply debug patches.
> >
> > Thanx, Paul
>
> Chen Yu made me realize [1] that dequeue_task() really isn't enough; the
> dequeue_task() in e.g. __sched_setscheduler() won't have DEQUEUE_DELAYED,
> so stuff will just be left on the CFS tree.
>
> Worse, what we need here is the __block_task() like we have at the end of
> dequeue_entities(), otherwise p stays ->on_rq and that's borked - AFAICT
> that explains the splat you're getting, because affine_move_task() ends up
> doing a move_queued_task() for what really is a dequeued task.
Sounds like something that *I* would do! ;-)
> I unfortunately couldn't reproduce the issue locally using your TREE03
> invocation. I've pushed a new patch on top of my branch, would you mind
> giving it a spin? It's a bit sketchy but should at least be going in the
> right direction...
>
> [1]: http://lore.kernel.org/r/Zs2d2aaC/zSyR94v@chenyu5-mobl2
Thank you!
I just now fired it up on 50*TREE03. If that passes, I will let you
know and also fire up 500*TREE03.
Thanx, Paul
Powered by blists - more mailing lists