[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c10f6fda-aa8c-4d8e-a315-3c084af08862@amd.com>
Date: Tue, 21 Oct 2025 12:38:17 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>, "Luis Claudio R. Goncalves"
<lgoncalv@...hat.com>
CC: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, "Phil
Auld" <pauld@...hat.com>, Valentin Schneider <vschneid@...hat.com>, "Steven
Rostedt" <rostedt@...dmis.org>, Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Shizhao Chen <shichen@...hat.com>,
<linux-kernel@...r.kernel.org>, Omar Sandoval <osandov@...com>, Xuewen Yan
<xuewen.yan@...soc.com>
Subject: Re: sched: update_entity_lag does not handle corner case with task in
PI chain
Hello Peter, Luis,
On 10/19/2025 1:27 AM, Peter Zijlstra wrote:
>> [ 1805.450470] ------------[ cut here ]------------
>> [ 1805.450474] WARNING: CPU: 2 PID: 19 at kernel/sched/fair.c:697 update_entity_lag+0x5b/0x70
>> [ 1805.463366] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_th
>> ermal intel_powerclamp coretemp kvm_intel kvm platform_profile dell_wmi sparse_keymap rfkill irqbypass iTCO_wdt video mgag200 rapl iTCO_vendor_support dell_smbios ipmi_ssif in
>> tel_cstate vfat dcdbas wmi_bmof intel_uncore dell_wmi_descriptor pcspkr fat i2c_algo_bit lpc_ich mei_me i2c_i801 i2c_smbus mei intel_pch_thermal ipmi_si acpi_power_meter acpi_
>> ipmi ipmi_devintf ipmi_msghandler sg fuse loop xfs sd_mod i40e ghash_clmulni_intel libie libie_adminq ahci libahci tg3 libata wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
>> nfnetlink
>> [ 1805.525160] CPU: 2 UID: 0 PID: 19 Comm: rcub/0 Kdump: loaded Not tainted 6.17.1-rt5 #1 PREEMPT_RT
>> [ 1805.534113] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 2.21.1 03/07/2024
>> [ 1805.541678] RIP: 0010:update_entity_lag+0x5b/0x70
>> [ 1805.546385] Code: 42 f8 48 81 3b 00 00 10 00 75 23 48 89 fa 48 f7 da 48 39 ea 48 0f 4c d5 48 39 fd 48 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc <0f> 0b eb b1 48 89 de e8 b9
>> 8c ff ff 48 89 c7 eb d0 0f 1f 40 00 90
>> [ 1805.565130] RSP: 0000:ffffcc9e802f7b90 EFLAGS: 00010046
>> [ 1805.570358] RAX: 0000000000000000 RBX: ffff8959080c0080 RCX: 0000000000000000
>> [ 1805.577488] RDX: 0000000000000000 RSI: ffff8959080c0080 RDI: ffff895592cc1c00
>> [ 1805.584622] RBP: ffff895592cc1c00 R08: 0000000000008800 R09: 0000000000000000
>> [ 1805.591756] R10: 0000000000000001 R11: 0000000000200b20 R12: 000000000000000e
>> [ 1805.598886] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>> [ 1805.606020] FS: 0000000000000000(0000) GS:ffff895947da2000(0000) knlGS:0000000000000000
>> [ 1805.614107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1805.619853] CR2: 00007f655816ed40 CR3: 00000004ab854006 CR4: 00000000007726f0
>> [ 1805.626985] PKRU: 55555554
>> [ 1805.629696] Call Trace:
>> [ 1805.632150] <TASK>
>> [ 1805.634258] dequeue_entity+0x90/0x4f0
>> [ 1805.638012] dequeue_entities+0xc9/0x6b0
>> [ 1805.641935] dequeue_task_fair+0x8a/0x190
>> [ 1805.645949] ? sched_clock+0x10/0x30
>> [ 1805.649527] rt_mutex_setprio+0x318/0x4b0
>
> So we have:
>
> rt_mutex_setprio()
>
> rq = __task_rq_lock(p, ..); // this asserts p->pi_lock is held
>
> ...
>
> queued = task_on_rq_queued(rq); // basically reads p->on_rq
> running = task_current_donor()
> if (queued)
> dequeue_task(rq, p, queue_flags);
> dequeue_task_fair()
> dequeue_entities()
> dequeue_entity()
> update_entity_lag()
> WARN_ON_ONCE(se->on_rq);
>
> So the only way to get here is if: rq->on_rq is in fact !0 *and*
> se->on_rq is zero.
>
> And I'm not at all sure how one would get into such a state.
This looks like something that can happen when a delayed task is
dequeued from a throttled hierarchy. Matt had reported similar
problem with wait_task_inactive() in
https://lore.kernel.org/all/20250925133310.1843863-1-matt@readmodwrite.com/
rt_mutex_setprio()
...
if (prev_class != next_class && p->se.sched_delayed)
dequeue_task(rq, p, DEQUEUE_DELAYED)
dequeue_entities(se = &p->se)
dequeue_entity(se)
se->on_rq = 0; /* se->on_rq turns 0 here */
...
if (cfs_rq_throttled(cfs_rq))
return 0; /* Early return brfore __block_task() */
...
/* __block_task() not called; task_on_rq_queued() returns true. */
queued = task_on_rq_queued(p);
...
if (queued)
dequeue_task(rq, p, queue_flag)
dequeue_entities(se = &p->se)
dequeue_entity(se)
update_entity_lag(se)
WARN_ON_ONCE(!se->on_rq)
v6.18 kernels will get rid of this issue as a part of per-task throttle
feature and stable should pick up the fix for same on the thread soon.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists