[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5e3b9b29-8d17-6a04-f2a5-1748a52d1a6a@amd.com>
Date: Wed, 31 Jul 2024 11:46:48 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>
CC: <oe-lkp@...ts.linux.dev>, <lkp@...el.com>, <linux-kernel@...r.kernel.org>,
<aubrey.li@...ux.intel.com>, <yu.c.chen@...el.com>, kernel test robot
<oliver.sang@...el.com>
Subject: Re: [peterz-queue:sched/prep] [sched/fair] 124c8f4374:
WARNING:at_kernel/sched/sched.h:#update_load_avg
Hello Peter, Oliver,
P.S. I see that peterz-queue has been updated recently so I'm not sure
if these observations still hold but if they do ...
On 7/29/2024 11:44 AM, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed "WARNING:at_kernel/sched/sched.h:#update_load_avg" on:
>
> commit: 124c8f43740f3f1e3e2aeecd976bcf5ddd7bd435 ("sched/fair: Implement delayed dequeue")
> https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git sched/prep
>
> in testcase: boot
>
> compiler: clang-18
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
> we noticed the issue does not always happen, 60 times out of 150 runs as below.
> but the parent keeps clean.
>
> a1133ac6dc7c6bde 124c8f43740f3f1e3e2aeecd976
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> :150 40% 60:150 dmesg.WARNING:at_kernel/sched/sched.h:#update_load_avg
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@...el.com>
> | Closes: https://lore.kernel.org/oe-lkp/202407291328.761441a7-oliver.sang@intel.com
>
>
> [ 13.522527][ C0] ------------[ cut here ]------------
> [ 13.523046][ C0] rq->clock_update_flags < RQCF_ACT_SKIP
> [ 13.523055][ C0] WARNING: CPU: 0 PID: 158 at kernel/sched/sched.h:1550 update_load_avg (kernel/sched/sched.h:1550 kernel/sched/pelt.h:67 kernel/sched/pelt.h:174 kernel/sched/fair.c:4747)
> [ 13.524265][ C0] Modules linked in: drm fuse ip_tables
> [ 13.524722][ C0] CPU: 0 UID: 0 PID: 158 Comm: openipmi Not tainted 6.10.0-12745-g124c8f43740f #1
> [ 13.525464][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 13.526307][ C0] RIP: 0010:update_load_avg (kernel/sched/sched.h:1550 kernel/sched/pelt.h:67 kernel/sched/pelt.h:174 kernel/sched/fair.c:4747)
> [ 13.526793][ C0] Code: 00 4c 2b bb 88 01 00 00 40 f6 c5 02 0f 84 b7 fa ff ff e9 ca fa ff ff c6 05 b9 1b e4 01 01 48 c7 c7 34 74 d6 a6 e8 a1 c7 fa ff <0f> 0b e9 78 fa ff ff 65 8b 0d 8f 83 91 5a 89 c9 48 0f a3 0d e9 7b
> All code
> ========
> 0: 00 4c 2b bb add %cl,-0x45(%rbx,%rbp,1)
> 4: 88 01 mov %al,(%rcx)
> 6: 00 00 add %al,(%rax)
> 8: 40 f6 c5 02 test $0x2,%bpl
> c: 0f 84 b7 fa ff ff je 0xfffffffffffffac9
> 12: e9 ca fa ff ff jmpq 0xfffffffffffffae1
> 17: c6 05 b9 1b e4 01 01 movb $0x1,0x1e41bb9(%rip) # 0x1e41bd7
> 1e: 48 c7 c7 34 74 d6 a6 mov $0xffffffffa6d67434,%rdi
> 25: e8 a1 c7 fa ff callq 0xfffffffffffac7cb
> 2a:* 0f 0b ud2 <-- trapping instruction
> 2c: e9 78 fa ff ff jmpq 0xfffffffffffffaa9
> 31: 65 8b 0d 8f 83 91 5a mov %gs:0x5a91838f(%rip),%ecx # 0x5a9183c7
> 38: 89 c9 mov %ecx,%ecx
> 3a: 48 rex.W
> 3b: 0f .byte 0xf
> 3c: a3 .byte 0xa3
> 3d: 0d .byte 0xd
> 3e: e9 .byte 0xe9
> 3f: 7b .byte 0x7b
>
> Code starting with the faulting instruction
> ===========================================
> 0: 0f 0b ud2
> 2: e9 78 fa ff ff jmpq 0xfffffffffffffa7f
> 7: 65 8b 0d 8f 83 91 5a mov %gs:0x5a91838f(%rip),%ecx # 0x5a91839d
> e: 89 c9 mov %ecx,%ecx
> 10: 48 rex.W
> 11: 0f .byte 0xf
> 12: a3 .byte 0xa3
> 13: 0d .byte 0xd
> 14: e9 .byte 0xe9
> 15: 7b .byte 0x7b
> [ 13.528234][ C0] RSP: 0000:ffffbaef40003d90 EFLAGS: 00010046
> [ 13.528723][ C0] RAX: b53834d2850b7b00 RBX: ffff9ef2efd34640 RCX: 0000000000000027
> [ 13.529379][ C0] RDX: 0000000000000002 RSI: 00000000ffff7fff RDI: ffff9ef2efc20b88
> [ 13.530029][ C0] RBP: 0000000000000001 R08: 0000000000007fff R09: ffffffffa7053610
> [ 13.530685][ C0] R10: 0000000000017ffd R11: 0000000000000004 R12: ffff9ef2efd345c0
> [ 13.531344][ C0] R13: ffff9ef2efd34640 R14: ffff9eefc0e26400 R15: 0000000000000000
> [ 13.532001][ C0] FS: 0000000000000000(0000) GS:ffff9ef2efc00000(0063) knlGS:00000000f7f5f180
> [ 13.532720][ C0] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
> [ 13.533242][ C0] CR2: 00000000577c51ac CR3: 0000000162e6c000 CR4: 00000000000406f0
> [ 13.533891][ C0] Call Trace:
> [ 13.534215][ C0] <IRQ>
> [ 13.534504][ C0] ? __warn (kernel/panic.c:240 kernel/panic.c:735)
> [ 13.534863][ C0] ? update_load_avg (kernel/sched/sched.h:1550 kernel/sched/pelt.h:67 kernel/sched/pelt.h:174 kernel/sched/fair.c:4747)
> [ 13.535278][ C0] ? report_bug (lib/bug.c:? lib/bug.c:219)
> [ 13.535695][ C0] ? handle_bug (arch/x86/kernel/traps.c:239)
> [ 13.536079][ C0] ? exc_invalid_op (arch/x86/kernel/traps.c:260)
> [ 13.536498][ C0] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621)
> [ 13.536920][ C0] ? update_load_avg (kernel/sched/sched.h:1550 kernel/sched/pelt.h:67 kernel/sched/pelt.h:174 kernel/sched/fair.c:4747)
> [ 13.537334][ C0] dequeue_entity (kernel/sched/sched.h:817 kernel/sched/fair.c:5429)
> [ 13.537729][ C0] dequeue_entities (kernel/sched/fair.c:6972)
> [ 13.538130][ C0] unregister_fair_sched_group (kernel/sched/fair.c:379 kernel/sched/fair.c:13159)
> [ 13.554647][ C0] ? __pfx_sched_unregister_group_rcu (kernel/sched/core.c:8759)
> [ 13.555167][ C0] sched_unregister_group_rcu (kernel/sched/core.c:8707 kernel/sched/core.c:8761)
> [ 13.555621][ C0] rcu_do_batch (arch/x86/include/asm/preempt.h:26 kernel/rcu/tree.c:2576)
> [ 13.556008][ C0] rcu_core (kernel/rcu/tree.c:2845)
> [ 13.556373][ C0] handle_softirqs (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:555)
> [ 13.556775][ C0] __irq_exit_rcu (kernel/softirq.c:617 kernel/softirq.c:639)
> [ 13.557166][ C0] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1043 arch/x86/kernel/apic/apic.c:1043)
Looking at the call stack, any chance it is the following case that is
tripping this warning:
CPU0 CPU1
==== ====
wakeup_preempt(rq) {
resched_curr(rq); ------------------------> sysvec_reschedule_ipi() {
... scheduler_ipi();
if (task_on_rq_queued(rq->curr) && } /* sysvec_reschedule_ipi */
test_tsk_need_resched(rq->curr)) handle_softirqs() {
rq_clock_skip_update(rq); /* RCU_SOFTIRQ */
} /* wakeup_preempt(); */ rcu_core_si() {
unregister_fair_sched_group() {
/* On the rq that got the resched IPI */
if (se->sched_delayed) {
dequeue_entities() {
/* Trips RQCF_ACT_SKIP check within since waker called rq_clock_skip_update() */
} /* dequeue_entities(); */
} /* if() */
} /* unregister_fair_sched_group */
} /* rcu_core_si */
} /* handle_softirqs */
--
commit a64692a3afd8 ("sched: Cleanup/optimize clock updates") seems to
have added this condition to skip clock updates after an enqueue since
wakeup target will do a schedule() soon. Since delayed dequeue may has
entity to dequeue now, looking at rq_clock_pelt() in dequeue path will
trip this since PELT believes that the clock is stale. Could something
like the below work?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd4a6bf14828..c437b408d29b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13297,10 +13297,34 @@ void unregister_fair_sched_group(struct task_group *tg)
if (se->sched_delayed) {
guard(rq_lock_irqsave)(rq);
if (se->sched_delayed) {
+ /*
+ * We can reach here when processing RCU_SOFTIRQ on exit path from
+ * a reschedule IPI. wakeup_preempt() may have set RQCF_REQ_SKIP to
+ * skip a close clock update in schedule(), however, in presence of
+ * a delayed entity, this trips the check in rq_clock_pelt() which
+ * now believes the clock value is stale and needs updating. To
+ * prevent such situation, cancel any pending skip updates, and
+ * update the rq clock.
+ */
+ rq_clock_cancel_skipupdate(rq);
+
+ /*
+ * XXX: Will this trip WARN_DOUBLE_CLOCK? In which case, can
+ * rq_clock_cancel_skipupdate() be made to return a bool if
+ * RQCF_REQ_SKIP is set and we avoid this update?
+ */
update_rq_clock(rq);
+
dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+
+ /* Avoid updating the clock again if a schedule() is pending */
+ if (task_on_rq_queued(rq->curr) &&
+ test_tsk_need_resched(rq->curr))
+ rq_clock_skip_update(rq);
}
list_del_leaf_cfs_rq(cfs_rq);
+
+
}
remove_entity_load_avg(se);
}
--
I was not able to hit the SCHED_WARN_ON() reliably on my end when
launching a guest to confirm my hypothesis - I was able to hit it
approx. 1 in 10 run, but I've ran with the above changes 10s of
times and I've not hit the splat (yet!)
Note: Although the diff was updated for the latest
peterz-queue:sched/eevdf, all my testing was done on the previous
version on this branch based on v6.10.0 kernel.
--
Thanks and Regards,
Prateek
> [ 13.557638][ C0] </IRQ>
> [ 13.557920][ C0] <TASK>
> [ 13.560463][ C0] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:702)
> [ 13.560957][ C0] RIP: 0010:irqentry_exit_to_user_mode (kernel/entry/common.c:? include/linux/entry-common.h:328 kernel/entry/common.c:231)
> [ 13.561465][ C0] Code: 90 41 57 41 56 53 48 89 fb e8 63 90 1f ff 65 4c 8b 34 25 40 37 03 00 49 8b 3e f7 c7 0e 30 02 00 74 7c 49 89 ff e8 97 8f 1f ff <4c> 89 f8 a8 08 75 38 a9 00 10 00 00 75 40 a9 00 20 00 00 75 4b a9
> All code
> ========
> 0: 90 nop
> 1: 41 57 push %r15
> 3: 41 56 push %r14
> 5: 53 push %rbx
> 6: 48 89 fb mov %rdi,%rbx
> 9: e8 63 90 1f ff callq 0xffffffffff1f9071
> e: 65 4c 8b 34 25 40 37 mov %gs:0x33740,%r14
> 15: 03 00
> 17: 49 8b 3e mov (%r14),%rdi
> 1a: f7 c7 0e 30 02 00 test $0x2300e,%edi
> 20: 74 7c je 0x9e
> 22: 49 89 ff mov %rdi,%r15
> 25: e8 97 8f 1f ff callq 0xffffffffff1f8fc1
> 2a:* 4c 89 f8 mov %r15,%rax <-- trapping instruction
> 2d: a8 08 test $0x8,%al
> 2f: 75 38 jne 0x69
> 31: a9 00 10 00 00 test $0x1000,%eax
> 36: 75 40 jne 0x78
> 38: a9 00 20 00 00 test $0x2000,%eax
> 3d: 75 4b jne 0x8a
> 3f: a9 .byte 0xa9
>
> Code starting with the faulting instruction
> ===========================================
> 0: 4c 89 f8 mov %r15,%rax
> 3: a8 08 test $0x8,%al
> 5: 75 38 jne 0x3f
> 7: a9 00 10 00 00 test $0x1000,%eax
> c: 75 40 jne 0x4e
> e: a9 00 20 00 00 test $0x2000,%eax
> 13: 75 4b jne 0x60
> 15: a9 .byte 0xa9
> [ 13.562938][ C0] RSP: 0000:ffffbaef40613f38 EFLAGS: 00000202
> [ 13.563444][ C0] RAX: ffff9ef0287dd080 RBX: ffffbaef40613f58 RCX: 000000000000080b
> [ 13.564125][ C0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000008
> [ 13.564796][ C0] RBP: 0000000000000000 R08: 000000000000002b R09: 000000000000002b
> [ 13.565456][ C0] R10: 0000000000000296 R11: 00000000ff8772ac R12: 0000000000000000
> [ 13.566115][ C0] R13: 0000000000000000 R14: ffff9ef0287dd080 R15: 0000000020000008
> [ 13.566782][ C0] ? irqentry_exit_to_user_mode (kernel/entry/common.c:? include/linux/entry-common.h:328 kernel/entry/common.c:231)
> [ 13.567254][ C0] asm_sysvec_reschedule_ipi (arch/x86/include/asm/idtentry.h:707)
> [ 13.567709][ C0] RIP: 0023:0x56652630
> [ 13.568072][ C0] Code: ff ff 83 c4 10 85 c0 74 05 83 c4 08 5b c3 83 ec 0c 8d 83 8d 6b ff ff 50 e8 cd 7c ff ff 8d b4 26 00 00 00 00 8d b6 00 00 00 00 <55> 57 56 53 e8 e7 61 ff ff 81 c3 13 38 01 00 83 ec 0c 8b 44 24 20
> All code
> ========
> 0: ff (bad)
> 1: ff 83 c4 10 85 c0 incl -0x3f7aef3c(%rbx)
> 7: 74 05 je 0xe
> 9: 83 c4 08 add $0x8,%esp
> c: 5b pop %rbx
> d: c3 retq
> e: 83 ec 0c sub $0xc,%esp
> 11: 8d 83 8d 6b ff ff lea -0x9473(%rbx),%eax
> 17: 50 push %rax
> 18: e8 cd 7c ff ff callq 0xffffffffffff7cea
> 1d: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
> 24: 8d b6 00 00 00 00 lea 0x0(%rsi),%esi
> 2a:* 55 push %rbp <-- trapping instruction
> 2b: 57 push %rdi
> 2c: 56 push %rsi
> 2d: 53 push %rbx
> 2e: e8 e7 61 ff ff callq 0xffffffffffff621a
> 33: 81 c3 13 38 01 00 add $0x13813,%ebx
> 39: 83 ec 0c sub $0xc,%esp
> 3c: 8b 44 24 20 mov 0x20(%rsp),%eax
>
> Code starting with the faulting instruction
> ===========================================
> 0: 55 push %rbp
> 1: 57 push %rdi
> 2: 56 push %rsi
> 3: 53 push %rbx
> 4: e8 e7 61 ff ff callq 0xffffffffffff61f0
> 9: 81 c3 13 38 01 00 add $0x13813,%ebx
> f: 83 ec 0c sub $0xc,%esp
> 12: 8b 44 24 20 mov 0x20(%rsp),%eax
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240729/202407291328.761441a7-oliver.sang@intel.com
>
>
>
Powered by blists - more mailing lists