[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9c66e90f-34bd-40e3-bd34-4c218ad6850a@paulmck-laptop>
Date: Wed, 28 Aug 2024 06:40:44 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Valentin Schneider <vschneid@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org,
sfr@...b.auug.org.au, linux-next@...r.kernel.org,
kernel-team@...a.com, Chen Yu <yu.c.chen@...el.com>
Subject: Re: [BUG almost bisected] Splat in dequeue_rt_stack() and build error
On Wed, Aug 28, 2024 at 06:03:31AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 28, 2024 at 02:35:45PM +0200, Valentin Schneider wrote:
> > On 27/08/24 13:36, Paul E. McKenney wrote:
> > > On Tue, Aug 27, 2024 at 10:30:24PM +0200, Valentin Schneider wrote:
> > >> On 27/08/24 11:35, Paul E. McKenney wrote:
> > >> > On Tue, Aug 27, 2024 at 10:33:13AM -0700, Paul E. McKenney wrote:
> > >> >> On Tue, Aug 27, 2024 at 05:41:52PM +0200, Valentin Schneider wrote:
> > >> >> > I've taken tip/sched/core and shuffled hunks around; I didn't re-order any
> > >> >> > commit. I've also taken out the dequeue from switched_from_fair() and put
> > >> >> > it at the very top of the branch which should hopefully help bisection.
> > >> >> >
> > >> >> > The final delta between that branch and tip/sched/core is empty, so it
> > >> >> > really is just shuffling inbetween commits.
> > >> >> >
> > >> >> > Please find the branch at:
> > >> >> >
> > >> >> > https://gitlab.com/vschneid/linux.git -b mainline/sched/eevdf-complete-builderr
> > >> >> >
> > >> >> > I'll go stare at the BUG itself now.
> > >> >>
> > >> >> Thank you!
> > >> >>
> > >> >> I have fired up tests on the "BROKEN?" commit. If that fails, I will
> > >> >> try its predecessor, and if that fails, I wlll bisect from e28b5f8bda01
> > >> >> ("sched/fair: Assert {set_next,put_prev}_entity() are properly balanced"),
> > >> >> which has stood up to heavy hammering in earlier testing.
> > >> >
> > >> > And of 50 runs of TREE03 on the "BROKEN?" commit resulted in 32 failures.
> > >> > Of these, 29 were the dequeue_rt_stack() failure. Two more were RCU
> > >> > CPU stall warnings, and the last one was an oddball "kernel BUG at
> > >> > kernel/sched/rt.c:1714" followed by an equally oddball "Oops: invalid
> > >> > opcode: 0000 [#1] PREEMPT SMP PTI".
> > >> >
> > >> > Just to be specific, this is commit:
> > >> >
> > >> > df8fe34bfa36 ("BROKEN? sched/fair: Dequeue sched_delayed tasks when switching from fair")
> > >> >
> > >> > This commit's predecessor is this commit:
> > >> >
> > >> > 2f888533d073 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
> > >> >
> > >> > This predecessor commit passes 50 runs of TREE03 with no failures.
> > >> >
> > >> > So that addition of that dequeue_task() call to the switched_from_fair()
> > >> > function is looking quite suspicious to me. ;-)
> > >> >
> > >> > Thanx, Paul
> > >>
> > >> Thanks for the testing!
> > >>
> > >> The WARN_ON_ONCE(!rt_se->on_list); hit in __dequeue_rt_entity() feels like
> > >> a put_prev/set_next kind of issue...
> > >>
> > >> So far I'd assumed a ->sched_delayed task can't be current during
> > >> switched_from_fair(), I got confused because it's Mond^CCC Tuesday, but I
> > >> think that still holds: we can't get a balance_dl() or balance_rt() to drop
> > >> the RQ lock because prev would be fair, and we can't get a
> > >> newidle_balance() with a ->sched_delayed task because we'd have
> > >> sched_fair_runnable() := true.
> > >>
> > >> I'll pick this back up tomorrow, this is a task that requires either
> > >> caffeine or booze and it's too late for either.
> > >
> > > Thank you for chasing this, and get some sleep! This one is of course
> > > annoying, but it is not (yet) an emergency. I look forward to seeing
> > > what you come up with.
> > >
> > > Also, I would of course be happy to apply debug patches.
> > >
> > > Thanx, Paul
> >
> > Chen Yu made me realize [1] that dequeue_task() really isn't enough; the
> > dequeue_task() in e.g. __sched_setscheduler() won't have DEQUEUE_DELAYED,
> > so stuff will just be left on the CFS tree.
> >
> > Worse, what we need here is the __block_task() like we have at the end of
> > dequeue_entities(), otherwise p stays ->on_rq and that's borked - AFAICT
> > that explains the splat you're getting, because affine_move_task() ends up
> > doing a move_queued_task() for what really is a dequeued task.
>
> Sounds like something that *I* would do! ;-)
>
> > I unfortunately couldn't reproduce the issue locally using your TREE03
> > invocation. I've pushed a new patch on top of my branch, would you mind
> > giving it a spin? It's a bit sketchy but should at least be going in the
> > right direction...
> >
> > [1]: http://lore.kernel.org/r/Zs2d2aaC/zSyR94v@chenyu5-mobl2
>
> Thank you!
>
> I just now fired it up on 50*TREE03. If that passes, I will let you
> know and also fire up 500*TREE03.
The good news is that there are no dequeue_rt_stack() failures.
The not-quite-so-good news is that there were 26 failures out of 50
runs, several of which are RCU CPU or rcutorture-writer stall warnings,
but the most frequent being splats like the one shown below. (If you
really want, I would be happy to send you the full set.)
In case it helps, this is my reproducer:
tools/testing/selftests/rcutorture/bin/kvm-remote.sh "<list of 20 system names>" --cpus 80 --duration 5m --configs "50*TREE03"
Each of the 20 system names is a 80-CPU system that can be reached with "ssh".
Though I can also reproduce on my laptop, it just takes a bit longer to
run 50 instances of TREE03. ;-)
On the other hand, this kvm-remote.sh run took only 11 minutes wall-clock
time, so testing your patches works fine for me, give or take timezone
issues.
Thanx, Paul
------------------------------------------------------------------------
[ 1.636950] BUG: kernel NULL pointer dereference, address: 0000000000000051
[ 1.637886] #PF: supervisor read access in kernel mode
[ 1.637886] #PF: error_code(0x0000) - not-present page
[ 1.637886] PGD 0 P4D 0
[ 1.637886] Oops: Oops: 0000 [#1] PREEMPT SMP PTI
[ 1.637886] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.11.0-rc1-00050-ge8593c21265a #1698
[ 1.637886] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 1.637886] RIP: 0010:pick_task_fair+0x26/0xb0
[ 1.637886] Code: 90 90 90 90 f3 0f 1e fa 8b 57 50 85 d2 0f 84 93 00 00 00 55 48 8d 6f 40 53 48 89 fb 48 83 ec 08 48 89 ef eb 1c e8 6a be ff ff <80> 78 51 00 75 38 48 85 c0 74 43 48 8b b8 a8 00 00 00 48 85 ff 74
[ 1.637886] RSP: 0000:ffffa5fb00013b78 EFLAGS: 00010082
[ 1.637886] RAX: 0000000000000000 RBX: ffff95271f22c200 RCX: 0000000000000800
[ 1.637886] RDX: ed73115ce0430400 RSI: 0000000000000800 RDI: 0000db7ef943c3be
[ 1.637886] RBP: ffff95271f22c240 R08: 0000000000000000 R09: 0000000000000000
[ 1.637886] R10: fffe89e124bd80db R11: 0000000000000000 R12: ffffffffaa00e9b0
[ 1.637886] R13: ffff9527011e8000 R14: 0000000000000000 R15: ffff9527011e8000
[ 1.637886] FS: 0000000000000000(0000) GS:ffff95271f200000(0000) knlGS:0000000000000000
[ 1.637886] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.637886] CR2: 0000000000000051 CR3: 0000000008e2e000 CR4: 00000000000006f0
[ 1.637886] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1.637886] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1.637886] Call Trace:
[ 1.637886] <TASK>
[ 1.637886] ? __die+0x1f/0x70
[ 1.637886] ? page_fault_oops+0x155/0x440
[ 1.637886] ? search_extable+0x26/0x30
[ 1.637886] ? pick_task_fair+0x26/0xb0
[ 1.637886] ? search_exception_tables+0x37/0x50
[ 1.637886] ? exc_page_fault+0x69/0x150
[ 1.637886] ? asm_exc_page_fault+0x26/0x30
[ 1.637886] ? pick_task_fair+0x26/0xb0
[ 1.637886] ? pick_task_fair+0x26/0xb0
[ 1.637886] pick_next_task_fair+0x40/0x2e0
[ 1.637886] __schedule+0x106/0x8b0
[ 1.637886] ? hrtimer_update_next_event+0x70/0x90
[ 1.637886] schedule+0x22/0xd0
[ 1.637886] schedule_hrtimeout_range_clock+0xa8/0x120
[ 1.637886] ? __pfx_hrtimer_wakeup+0x10/0x10
[ 1.637886] wait_task_inactive+0x1ac/0x1c0
[ 1.637886] __kthread_bind_mask+0x13/0x60
[ 1.637886] kthread_create_on_cpu+0x54/0x80
[ 1.637886] __smpboot_create_thread.part.0+0x60/0x130
[ 1.637886] smpboot_create_threads+0x5c/0x90
[ 1.637886] ? __pfx_smpboot_create_threads+0x10/0x10
[ 1.637886] cpuhp_invoke_callback+0x2cd/0x470
[ 1.637886] ? __pfx_trace_rb_cpu_prepare+0x10/0x10
[ 1.637886] __cpuhp_invoke_callback_range+0x71/0xe0
[ 1.637886] _cpu_up+0xee/0x1d0
[ 1.637886] cpu_up+0x88/0xb0
[ 1.637886] cpuhp_bringup_mask+0x47/0xb0
[ 1.637886] bringup_nonboot_cpus+0xca/0xf0
[ 1.637886] smp_init+0x25/0x80
[ 1.637886] kernel_init_freeable+0xd6/0x2d0
[ 1.637886] ? __pfx_kernel_init+0x10/0x10
[ 1.637886] kernel_init+0x15/0x1c0
[ 1.637886] ret_from_fork+0x2f/0x50
[ 1.637886] ? __pfx_kernel_init+0x10/0x10
[ 1.637886] ret_from_fork_asm+0x1a/0x30
[ 1.637886] </TASK>
[ 1.637886] Modules linked in:
[ 1.637886] CR2: 0000000000000051
[ 1.637886] ---[ end trace 0000000000000000 ]---
Powered by blists - more mailing lists