[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251018195730.GJ3419281@noisy.programming.kicks-ass.net>
Date: Sat, 18 Oct 2025 21:57:30 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: "Luis Claudio R. Goncalves" <lgoncalv@...hat.com>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
Phil Auld <pauld@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Shizhao Chen <shichen@...hat.com>, linux-kernel@...r.kernel.org,
Omar Sandoval <osandov@...com>, Xuewen Yan <xuewen.yan@...soc.com>
Subject: Re: sched: update_entity_lag does not handle corner case with task
in PI chain
On Sat, Oct 18, 2025 at 08:34:52AM -0300, Luis Claudio R. Goncalves wrote:
> Hello!
>
> While running sched_group_migration test from CKI repository[1], which
What's a CKI ?
> migrates tasks between cpusets, Shizhao Chen reports hitting the warning
> in update_entity_lag():
>
> WARN_ON_ONCE(!se->on_rq);
>
> In short, update_entity_lag() is acting on a task that is waiting on a lock,
> sleeping, with both on_rq and se->on_rq equal to zero.
You can't get to where you are with p->on_rq being zero.
> When a stalled RCU grace period occurs, rcu_boost_kthread() is called. If an
> rt_mutex is involved in the process, rt_mutex_setprio() is called and may
> eventually walk down a Priority Inheritance chain, adjusting the priorities
> of the waiters in the chain. In such cases update_entity_lag() may be called.
>
> What is the expected behavior for this case, to bail out of update_entity_lag()
> or avoid calling the function entirely?
>
>
> --[ Additional Notes:
>
> Reproducing the Problem:
>
> - Install sched_group_migration[1] and run it on a loop.
> (while : ; do runtest.sh; done)
> - In my experience, running the test with 4 CPUs reproduces the problem
> within 15 minutes. Setting "nr_cpus=4 max_cpus=4" on boot does the trick.
>
>
> The scenario below is a simplification of the cases I observed while
> investigating the problem:
>
> CPUn CPUx
>
> task01 has rcu-state lock
> contends on another lock
> (goes to sleep)
> --> on_rq=0 se.on_rq=0
> rcub/x contends on rcu-state lock
> rcu_boost_kthread()
> rt_set_prio()
> update_entity_lag(task01->se)
> WARNING()
There is a whole lot wrong with this, firstly there is no rt_set_prio()
function, and update_entity_lag() isn't directly called by it.
Additionally, you should never get to update_entity_lag() if !p->on_rq,
see below:
> [ 1805.450470] ------------[ cut here ]------------
> [ 1805.450474] WARNING: CPU: 2 PID: 19 at kernel/sched/fair.c:697 update_entity_lag+0x5b/0x70
> [ 1805.463366] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_th
> ermal intel_powerclamp coretemp kvm_intel kvm platform_profile dell_wmi sparse_keymap rfkill irqbypass iTCO_wdt video mgag200 rapl iTCO_vendor_support dell_smbios ipmi_ssif in
> tel_cstate vfat dcdbas wmi_bmof intel_uncore dell_wmi_descriptor pcspkr fat i2c_algo_bit lpc_ich mei_me i2c_i801 i2c_smbus mei intel_pch_thermal ipmi_si acpi_power_meter acpi_
> ipmi ipmi_devintf ipmi_msghandler sg fuse loop xfs sd_mod i40e ghash_clmulni_intel libie libie_adminq ahci libahci tg3 libata wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
> nfnetlink
> [ 1805.525160] CPU: 2 UID: 0 PID: 19 Comm: rcub/0 Kdump: loaded Not tainted 6.17.1-rt5 #1 PREEMPT_RT
> [ 1805.534113] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 2.21.1 03/07/2024
> [ 1805.541678] RIP: 0010:update_entity_lag+0x5b/0x70
> [ 1805.546385] Code: 42 f8 48 81 3b 00 00 10 00 75 23 48 89 fa 48 f7 da 48 39 ea 48 0f 4c d5 48 39 fd 48 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc <0f> 0b eb b1 48 89 de e8 b9
> 8c ff ff 48 89 c7 eb d0 0f 1f 40 00 90
> [ 1805.565130] RSP: 0000:ffffcc9e802f7b90 EFLAGS: 00010046
> [ 1805.570358] RAX: 0000000000000000 RBX: ffff8959080c0080 RCX: 0000000000000000
> [ 1805.577488] RDX: 0000000000000000 RSI: ffff8959080c0080 RDI: ffff895592cc1c00
> [ 1805.584622] RBP: ffff895592cc1c00 R08: 0000000000008800 R09: 0000000000000000
> [ 1805.591756] R10: 0000000000000001 R11: 0000000000200b20 R12: 000000000000000e
> [ 1805.598886] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 1805.606020] FS: 0000000000000000(0000) GS:ffff895947da2000(0000) knlGS:0000000000000000
> [ 1805.614107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1805.619853] CR2: 00007f655816ed40 CR3: 00000004ab854006 CR4: 00000000007726f0
> [ 1805.626985] PKRU: 55555554
> [ 1805.629696] Call Trace:
> [ 1805.632150] <TASK>
> [ 1805.634258] dequeue_entity+0x90/0x4f0
> [ 1805.638012] dequeue_entities+0xc9/0x6b0
> [ 1805.641935] dequeue_task_fair+0x8a/0x190
> [ 1805.645949] ? sched_clock+0x10/0x30
> [ 1805.649527] rt_mutex_setprio+0x318/0x4b0
So we have:
rt_mutex_setprio()
rq = __task_rq_lock(p, ..); // this asserts p->pi_lock is held
...
queued = task_on_rq_queued(rq); // basically reads p->on_rq
running = task_current_donor()
if (queued)
dequeue_task(rq, p, queue_flags);
dequeue_task_fair()
dequeue_entities()
dequeue_entity()
update_entity_lag()
WARN_ON_ONCE(se->on_rq);
So the only way to get here is if: rq->on_rq is in fact !0 *and*
se->on_rq is zero.
And I'm not at all sure how one would get into such a state.
Powered by blists - more mailing lists