[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2b407663-9c9b-45df-8419-25f97f5281e8@ovn.org>
Date: Tue, 10 Dec 2024 01:31:14 +0100
From: Ilya Maximets <i.maximets@....org>
To: Vineeth Remanan Pillai <vineeth@...byteword.org>,
Peter Zijlstra <peterz@...radead.org>
Cc: i.maximets@....org, Joel Fernandes <joel@...lfernandes.org>,
LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>, vineethrp@...gle.com,
shraash@...gle.com
Subject: Re: [v6.12] WARNING: at kernel/sched/deadline.c:1995
enqueue_dl_entity (task blocked for more than 28262 seconds)
On 12/9/24 13:34, Ilya Maximets wrote:
> On 12/9/24 13:29, Vineeth Remanan Pillai wrote:
>> On Mon, Dec 9, 2024 at 5:55 AM Peter Zijlstra <peterz@...radead.org> wrote:
>>>
>>> On Fri, Dec 06, 2024 at 11:57:30AM -0500, Vineeth Remanan Pillai wrote:
>>>
>>>> I was able to reproduce this WARN_ON couple of days back with
>>>> syzkaller. dlserver's dl_se gets enqueued during a update_curr while
>>>> the dlserver is stopped. And subsequent dlserver start will cause a
>>>> double enqueue.
>>>
>>> Right, I spotted that hole late last week. There is this thread:
>>>
>>> https://lore.kernel.org/all/20241209094941.GF21636@noisy.programming.kicks-ass.net/T/#u
>>>
>>> Where I just added this thunk:
>>>
>>> @@ -1674,6 +1679,12 @@ void dl_server_start(struct sched_dl_entity *dl_se)
>>>
>>> void dl_server_stop(struct sched_dl_entity *dl_se)
>>> {
>>> + if (current->dl_server == dl_se) {
>>> + struct rq *rq = rq_of_dl_se(dl_se);
>>> + trace_printk("stop fair server %d\n", cpu_of(rq));
>>> + current->dl_server = NULL;
>>> + }
>>> +
>>> if (!dl_se->dl_runtime)
>>> return;
>>>
>>> Which was my attempt at plugging said hole. But since I do not have
>>> means of reproduction, I'm not at all sure it is sufficient :/
>>>
>> I think I was able to get to the root cause last week. So the issue
>> seems to be that dlserver is stopped in the pick_task path of dlserver
>> itself when the sched_delayed is set:
>> __pick_next_task
>> => pick_task_dl -> server_pick_task
>> => pick_task_fair
>> => pick_next_entity (if (sched_delayed))
>> => dequeue_entities
>> => dl_server_stop
>>
>> Now server_pick_task returns NULL and then we set dl_yielded and call
>> update_curr_dl_se. But dl_se is already dequeued and now the code is
>> confused and it does all sorts of things including setting a timer to
>> enqueue it back. This ultimately leads to double enqueue when dlserver
>> is started next time(based on timing of dl_server_start)
>>
>> I think we should not call update_curr_dl_se when the dlserver is
>> dequeued. Based on this I have a small patch and it seems to solve the
>> issue:
>>
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index d9d5a702f1a6..a9f3f020e421 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -2419,12 +2419,18 @@ static struct task_struct *__pick_task_dl(struct rq *rq)
>>
>> if (dl_server(dl_se)) {
>> p = dl_se->server_pick_task(dl_se);
>> - if (!p) {
>> + if (p) {
>> + rq->dl_server = dl_se;
>> + } else if (WARN_ON_ONCE(on_dl_rq(dl_se))) {
>> + /*
>> + * If server_pick_task returns NULL and dlserver is
>> + * enqueued, we have a problem. Lets yield and do a
>> + * pick again.
>> + */
>> dl_se->dl_yielded = 1;
>> update_curr_dl_se(rq, dl_se, 0);
>> goto again;
>> }
>> - rq->dl_server = dl_se;
>> } else {
>> p = dl_task_of(dl_se);
>> }
>>
>> I can send a formal patch if this looks okay to you all..
>
> Thanks!
>
> I can try this out on my setup today over the night (it takes a long time
> to reproduce the issue on my setup).
So, I tried applying this one on top of v6.12 and I got the following after
about 20 minutes of testing:
Dec 9 19:08:31 kernel: ------------[ cut here ]------------
Dec 9 19:08:31 kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [handler37:428139]
Dec 9 19:08:31 kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Dec 9 19:08:31 kernel: rcu: #0110-...0: (1 GPs behind) idle=e41c/1/0x4000000000000000 softirq=186760/186760 fqs=7457
Dec 9 19:08:31 kernel: rcu: #011(detected by 27, t=60002 jiffies, g=913205, q=503974 ncpus=40)
Dec 9 19:08:31 kernel: Sending NMI from CPU 27 to CPUs 0:
Dec 9 19:08:31 kernel: rcu: rcu_preempt kthread starved for 40001 jiffies! g913205 f0x2 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15
Dec 9 19:08:31 kernel: rcu: #011Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
Dec 9 19:08:31 kernel: rcu: RCU grace-period kthread stack dump:
Dec 9 19:08:31 kernel: task:rcu_preempt state:R running task stack:0 pid:18 tgid:18 ppid:2 flags:0x00004000
Dec 9 19:08:31 kernel: Call Trace:
Dec 9 19:08:31 kernel: <TASK>
Dec 9 19:08:31 kernel: ? lock_timer_base (kernel/time/timer.c:1051)
Dec 9 19:08:31 kernel: ? _raw_spin_lock (./arch/x86/include/asm/paravirt.h:584 ./arch/x86/include/asm/qspinlock.h:51 ./include/asm-generic/qspinlock.h:114 ./include/linux/spinlock.h:187 ./include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154)
Dec 9 19:08:31 kernel: ? raw_spin_rq_lock_nested (./arch/x86/include/asm/preempt.h:84 kernel/sched/core.c:600)
Dec 9 19:08:31 kernel: ? resched_cpu (./arch/x86/include/asm/bitops.h:239 ./include/asm-generic/bitops/instrumented-non-atomic.h:142 ./include/linux/cpumask.h:570 ./include/linux/cpumask.h:1117 kernel/sched/core.c:1109)
Dec 9 19:08:31 kernel: ? force_qs_rnp (kernel/rcu/tree.c:2734 (discriminator 6))
Dec 9 19:08:31 kernel: ? __pfx_rcu_watching_snap_recheck (kernel/rcu/tree.c:816)
Dec 9 19:08:31 kernel: ? __pfx_rcu_gp_kthread (kernel/rcu/tree.c:2222)
Dec 9 19:08:31 kernel: ? rcu_gp_fqs_loop (kernel/rcu/tree.c:2004 kernel/rcu/tree.c:2067)
Dec 9 19:08:31 kernel: ? rcu_gp_kthread (kernel/rcu/tree.c:2250)
Dec 9 19:08:31 kernel: ? kthread (kernel/kthread.c:389)
Dec 9 19:08:31 kernel: ? __pfx_kthread (kernel/kthread.c:342)
Dec 9 19:08:31 kernel: ? ret_from_fork (arch/x86/kernel/process.c:147)
Dec 9 19:08:31 kernel: ? __pfx_kthread (kernel/kthread.c:342)
Dec 9 19:08:31 kernel: ? ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
Dec 9 19:08:31 kernel: </TASK>
Dec 9 19:08:31 kernel: rcu: Stack dump where RCU GP kthread last ran:
Dec 9 19:08:31 kernel: Sending NMI from CPU 27 to CPUs 15:
Dec 9 19:09:49 systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!
Dec 9 19:09:49 systemd[1]: systemd-udevd.service: Killing process 1087 (systemd-udevd) with signal SIGABRT.
Dec 9 19:11:19 systemd[1]: systemd-udevd.service: State 'stop-watchdog' timed out. Killing.
Dec 9 19:11:19 systemd[1]: systemd-udevd.service: Killing process 1087 (systemd-udevd) with signal SIGKILL.
Dec 9 19:11:51 kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Dec 9 19:11:51 kernel: rcu: #0110-...0: (1 GPs behind) idle=e41c/1/0x4000000000000002 softirq=186760/186760 fqs=7457
Dec 9 19:11:51 kernel: rcu: #011(detected by 5, t=260010 jiffies, g=913205, q=2087901 ncpus=40)
Dec 9 19:11:51 kernel: Sending NMI from CPU 5 to CPUs 0:
Dec 9 19:11:51 kernel: rcu: rcu_preempt kthread starved for 240008 jiffies! g913205 f0x2 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=15
Dec 9 19:11:51 kernel: rcu: #011Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
Dec 9 19:11:51 kernel: rcu: RCU grace-period kthread stack dump:
Dec 9 19:11:51 kernel: task:rcu_preempt state:R running task stack:0 pid:18 tgid:18 ppid:2 flags:0x00004000
Dec 9 19:11:51 kernel: Call Trace:
Dec 9 19:11:51 kernel: <TASK>
Dec 9 19:11:51 kernel: ? lock_timer_base (kernel/time/timer.c:1051)
Dec 9 19:11:51 kernel: ? _raw_spin_lock (./arch/x86/include/asm/paravirt.h:584 ./arch/x86/include/asm/qspinlock.h:51 ./include/asm-generic/qspinlock.h:114 ./include/linux/spinlock.h:187 ./include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154)
Dec 9 19:11:51 kernel: ? raw_spin_rq_lock_nested (./arch/x86/include/asm/preempt.h:84 kernel/sched/core.c:600)
Dec 9 19:11:51 kernel: ? resched_cpu (./arch/x86/include/asm/bitops.h:239 ./include/asm-generic/bitops/instrumented-non-atomic.h:142 ./include/linux/cpumask.h:570 ./include/linux/cpumask.h:1117 kernel/sched/core.c:1109)
Dec 9 19:11:51 kernel: ? force_qs_rnp (kernel/rcu/tree.c:2734 (discriminator 6))
Dec 9 19:11:51 kernel: ? __pfx_rcu_watching_snap_recheck (kernel/rcu/tree.c:816)
Dec 9 19:11:51 kernel: ? __pfx_rcu_gp_kthread (kernel/rcu/tree.c:2222)
Dec 9 19:11:51 kernel: ? rcu_gp_fqs_loop (kernel/rcu/tree.c:2004 kernel/rcu/tree.c:2067)
Dec 9 19:11:51 kernel: ? rcu_gp_kthread (kernel/rcu/tree.c:2250)
Dec 9 19:11:51 kernel: ? kthread (kernel/kthread.c:389)
Dec 9 19:11:51 kernel: ? __pfx_kthread (kernel/kthread.c:342)
Dec 9 19:11:51 kernel: ? ret_from_fork (arch/x86/kernel/process.c:147)
Dec 9 19:11:51 kernel: ? __pfx_kthread (kernel/kthread.c:342)
Dec 9 19:11:51 kernel: ? ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
Dec 9 19:11:51 kernel: </TASK>
Dec 9 19:11:51 kernel: rcu: Stack dump where RCU GP kthread last ran:
Dec 9 19:11:51 kernel: Sending NMI from CPU 5 to CPUs 15:
Dec 9 19:12:49 systemd[1]: systemd-udevd.service: Processes still around after SIGKILL. Ignoring.
Dec 9 19:14:20 systemd[1]: systemd-udevd.service: State 'final-sigterm' timed out. Killing.
An interesting part here is that CPU 15 doesn't even react to NMI...
I had to kill the system as it got completely unresponsive.
Best regards, Ilya Maximets.
Powered by blists - more mailing lists