linux-kernel - sched: update_entity_lag does not handle corner case with task in PI chain

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPN7XBJbGhdWJDb2@uudg.org>
Date: Sat, 18 Oct 2025 08:34:52 -0300
From: "Luis Claudio R. Goncalves" <lgoncalv@...hat.com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Phil Auld <pauld@...hat.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Shizhao Chen <shichen@...hat.com>, linux-kernel@...r.kernel.org,
	Omar Sandoval <osandov@...com>, Xuewen Yan <xuewen.yan@...soc.com>
Subject: sched: update_entity_lag does not handle corner case with task in PI
 chain

Hello!

The underlying question here is what is the expected behavior of
update_entity_lag() in the context explained below...


--[ Short Description:

While running sched_group_migration test from CKI repository[1], which
migrates tasks between cpusets, Shizhao Chen reports hitting the warning
in update_entity_lag():

    WARN_ON_ONCE(!se->on_rq);

In short, update_entity_lag() is acting on a task that is waiting on a lock,
sleeping, with both on_rq and se->on_rq equal to zero.

When a stalled RCU grace period occurs, rcu_boost_kthread() is called. If an
rt_mutex is involved in the process, rt_mutex_setprio() is called and may
eventually walk down a Priority Inheritance chain, adjusting the priorities
of the waiters in the chain. In such cases update_entity_lag() may be called.

What is the expected behavior for this case, to bail out of update_entity_lag()
or avoid calling the function entirely?


--[ Additional Notes:

Reproducing the Problem:

  - Install sched_group_migration[1] and run it on a loop.
    (while : ;  do runtest.sh; done)
  - In my experience, running the test with 4 CPUs reproduces the problem
    within 15 minutes. Setting "nr_cpus=4 max_cpus=4" on boot does the trick.


The scenario below is a simplification of the cases I observed while
investigating the problem:

    CPUn					CPUx

    task01 has rcu-state lock
    contends on another lock		
    (goes to sleep)
    --> on_rq=0 se.on_rq=0
					rcub/x contends on rcu-state lock
					  rcu_boost_kthread()
					    rt_set_prio()
					      update_entity_lag(task01->se)
					        WARNING()


It could be that task01 and the task holding the lock wanted by task01 are
being migrated from one cpuset to another at that point. In any case, that
is not an error, so the problem seems to be update_entity_lag() being called
to work on a task that hurts a basic requirement (!se->on_rq).


The resulting backtrace is:

[ 1805.450470] ------------[ cut here ]------------
[ 1805.450474] WARNING: CPU: 2 PID: 19 at kernel/sched/fair.c:697 update_entity_lag+0x5b/0x70
[ 1805.463366] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_th
ermal intel_powerclamp coretemp kvm_intel kvm platform_profile dell_wmi sparse_keymap rfkill irqbypass iTCO_wdt video mgag200 rapl iTCO_vendor_support dell_smbios ipmi_ssif in
tel_cstate vfat dcdbas wmi_bmof intel_uncore dell_wmi_descriptor pcspkr fat i2c_algo_bit lpc_ich mei_me i2c_i801 i2c_smbus mei intel_pch_thermal ipmi_si acpi_power_meter acpi_
ipmi ipmi_devintf ipmi_msghandler sg fuse loop xfs sd_mod i40e ghash_clmulni_intel libie libie_adminq ahci libahci tg3 libata wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
 nfnetlink
[ 1805.525160] CPU: 2 UID: 0 PID: 19 Comm: rcub/0 Kdump: loaded Not tainted 6.17.1-rt5 #1 PREEMPT_RT 
[ 1805.534113] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 2.21.1 03/07/2024
[ 1805.541678] RIP: 0010:update_entity_lag+0x5b/0x70
[ 1805.546385] Code: 42 f8 48 81 3b 00 00 10 00 75 23 48 89 fa 48 f7 da 48 39 ea 48 0f 4c d5 48 39 fd 48 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc <0f> 0b eb b1 48 89 de e8 b9
 8c ff ff 48 89 c7 eb d0 0f 1f 40 00 90
[ 1805.565130] RSP: 0000:ffffcc9e802f7b90 EFLAGS: 00010046
[ 1805.570358] RAX: 0000000000000000 RBX: ffff8959080c0080 RCX: 0000000000000000
[ 1805.577488] RDX: 0000000000000000 RSI: ffff8959080c0080 RDI: ffff895592cc1c00
[ 1805.584622] RBP: ffff895592cc1c00 R08: 0000000000008800 R09: 0000000000000000
[ 1805.591756] R10: 0000000000000001 R11: 0000000000200b20 R12: 000000000000000e
[ 1805.598886] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1805.606020] FS:  0000000000000000(0000) GS:ffff895947da2000(0000) knlGS:0000000000000000
[ 1805.614107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1805.619853] CR2: 00007f655816ed40 CR3: 00000004ab854006 CR4: 00000000007726f0
[ 1805.626985] PKRU: 55555554
[ 1805.629696] Call Trace:
[ 1805.632150]  <TASK>
[ 1805.634258]  dequeue_entity+0x90/0x4f0
[ 1805.638012]  dequeue_entities+0xc9/0x6b0
[ 1805.641935]  dequeue_task_fair+0x8a/0x190
[ 1805.645949]  ? sched_clock+0x10/0x30
[ 1805.649527]  rt_mutex_setprio+0x318/0x4b0
[ 1805.653541]  rt_mutex_adjust_prio_chain+0x71c/0xa40
[ 1805.658421]  task_blocks_on_rt_mutex.constprop.0+0x20c/0x4a0
[ 1805.664081]  __rt_mutex_slowlock.constprop.0+0x53/0x1d0
[ 1805.669305]  __rt_mutex_slowlock_locked.constprop.0+0x48/0x70
[ 1805.675051]  rt_mutex_slowlock.constprop.0+0x4d/0xd0
[ 1805.680016]  rcu_boost_kthread+0xd5/0x2d0
[ 1805.684030]  ? __pfx_rcu_boost_kthread+0x10/0x10
[ 1805.688646]  kthread+0x108/0x250
[ 1805.691880]  ? migrate_enable+0xd1/0xf0
[ 1805.695719]  ? __pfx_kthread+0x10/0x10
[ 1805.699473]  ret_from_fork+0x116/0x130
[ 1805.703226]  ? __pfx_kthread+0x10/0x10
[ 1805.706978]  ret_from_fork_asm+0x1a/0x30
[ 1805.710908]  </TASK>


Please let me know if what I reported above is enough to understand the problem
and design/suggest a solution. I tried to organize the scattered information
bits as well as possible.

Best regards,
Luis

[1] https://gitlab.com/cki-project/kernel-tests/-/archive/main/kernel-tests-main.zip#general/scheduler/sched_group_migration/