lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPN7XBJbGhdWJDb2@uudg.org>
Date: Sat, 18 Oct 2025 08:34:52 -0300
From: "Luis Claudio R. Goncalves" <lgoncalv@...hat.com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Phil Auld <pauld@...hat.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Shizhao Chen <shichen@...hat.com>, linux-kernel@...r.kernel.org,
	Omar Sandoval <osandov@...com>, Xuewen Yan <xuewen.yan@...soc.com>
Subject: sched: update_entity_lag does not handle corner case with task in PI
 chain

Hello!

The underlying question here is what is the expected behavior of
update_entity_lag() in the context explained below...


--[ Short Description:

While running sched_group_migration test from CKI repository[1], which
migrates tasks between cpusets, Shizhao Chen reports hitting the warning
in update_entity_lag():

    WARN_ON_ONCE(!se->on_rq);

In short, update_entity_lag() is acting on a task that is waiting on a lock,
sleeping, with both on_rq and se->on_rq equal to zero.

When a stalled RCU grace period occurs, rcu_boost_kthread() is called. If an
rt_mutex is involved in the process, rt_mutex_setprio() is called and may
eventually walk down a Priority Inheritance chain, adjusting the priorities
of the waiters in the chain. In such cases update_entity_lag() may be called.

What is the expected behavior for this case, to bail out of update_entity_lag()
or avoid calling the function entirely?


--[ Additional Notes:

Reproducing the Problem:

  - Install sched_group_migration[1] and run it on a loop.
    (while : ;  do runtest.sh; done)
  - In my experience, running the test with 4 CPUs reproduces the problem
    within 15 minutes. Setting "nr_cpus=4 max_cpus=4" on boot does the trick.


The scenario below is a simplification of the cases I observed while
investigating the problem:

    CPUn					CPUx

    task01 has rcu-state lock
    contends on another lock		
    (goes to sleep)
    --> on_rq=0 se.on_rq=0
					rcub/x contends on rcu-state lock
					  rcu_boost_kthread()
					    rt_set_prio()
					      update_entity_lag(task01->se)
					        WARNING()


It could be that task01 and the task holding the lock wanted by task01 are
being migrated from one cpuset to another at that point. In any case, that
is not an error, so the problem seems to be update_entity_lag() being called
to work on a task that hurts a basic requirement (!se->on_rq).


The resulting backtrace is:

[ 1805.450470] ------------[ cut here ]------------
[ 1805.450474] WARNING: CPU: 2 PID: 19 at kernel/sched/fair.c:697 update_entity_lag+0x5b/0x70
[ 1805.463366] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_th
ermal intel_powerclamp coretemp kvm_intel kvm platform_profile dell_wmi sparse_keymap rfkill irqbypass iTCO_wdt video mgag200 rapl iTCO_vendor_support dell_smbios ipmi_ssif in
tel_cstate vfat dcdbas wmi_bmof intel_uncore dell_wmi_descriptor pcspkr fat i2c_algo_bit lpc_ich mei_me i2c_i801 i2c_smbus mei intel_pch_thermal ipmi_si acpi_power_meter acpi_
ipmi ipmi_devintf ipmi_msghandler sg fuse loop xfs sd_mod i40e ghash_clmulni_intel libie libie_adminq ahci libahci tg3 libata wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
 nfnetlink
[ 1805.525160] CPU: 2 UID: 0 PID: 19 Comm: rcub/0 Kdump: loaded Not tainted 6.17.1-rt5 #1 PREEMPT_RT 
[ 1805.534113] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 2.21.1 03/07/2024
[ 1805.541678] RIP: 0010:update_entity_lag+0x5b/0x70
[ 1805.546385] Code: 42 f8 48 81 3b 00 00 10 00 75 23 48 89 fa 48 f7 da 48 39 ea 48 0f 4c d5 48 39 fd 48 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc <0f> 0b eb b1 48 89 de e8 b9
 8c ff ff 48 89 c7 eb d0 0f 1f 40 00 90
[ 1805.565130] RSP: 0000:ffffcc9e802f7b90 EFLAGS: 00010046
[ 1805.570358] RAX: 0000000000000000 RBX: ffff8959080c0080 RCX: 0000000000000000
[ 1805.577488] RDX: 0000000000000000 RSI: ffff8959080c0080 RDI: ffff895592cc1c00
[ 1805.584622] RBP: ffff895592cc1c00 R08: 0000000000008800 R09: 0000000000000000
[ 1805.591756] R10: 0000000000000001 R11: 0000000000200b20 R12: 000000000000000e
[ 1805.598886] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1805.606020] FS:  0000000000000000(0000) GS:ffff895947da2000(0000) knlGS:0000000000000000
[ 1805.614107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1805.619853] CR2: 00007f655816ed40 CR3: 00000004ab854006 CR4: 00000000007726f0
[ 1805.626985] PKRU: 55555554
[ 1805.629696] Call Trace:
[ 1805.632150]  <TASK>
[ 1805.634258]  dequeue_entity+0x90/0x4f0
[ 1805.638012]  dequeue_entities+0xc9/0x6b0
[ 1805.641935]  dequeue_task_fair+0x8a/0x190
[ 1805.645949]  ? sched_clock+0x10/0x30
[ 1805.649527]  rt_mutex_setprio+0x318/0x4b0
[ 1805.653541]  rt_mutex_adjust_prio_chain+0x71c/0xa40
[ 1805.658421]  task_blocks_on_rt_mutex.constprop.0+0x20c/0x4a0
[ 1805.664081]  __rt_mutex_slowlock.constprop.0+0x53/0x1d0
[ 1805.669305]  __rt_mutex_slowlock_locked.constprop.0+0x48/0x70
[ 1805.675051]  rt_mutex_slowlock.constprop.0+0x4d/0xd0
[ 1805.680016]  rcu_boost_kthread+0xd5/0x2d0
[ 1805.684030]  ? __pfx_rcu_boost_kthread+0x10/0x10
[ 1805.688646]  kthread+0x108/0x250
[ 1805.691880]  ? migrate_enable+0xd1/0xf0
[ 1805.695719]  ? __pfx_kthread+0x10/0x10
[ 1805.699473]  ret_from_fork+0x116/0x130
[ 1805.703226]  ? __pfx_kthread+0x10/0x10
[ 1805.706978]  ret_from_fork_asm+0x1a/0x30
[ 1805.710908]  </TASK>


Please let me know if what I reported above is enough to understand the problem
and design/suggest a solution. I tried to organize the scattered information
bits as well as possible.

Best regards,
Luis

[1] https://gitlab.com/cki-project/kernel-tests/-/archive/main/kernel-tests-main.zip#general/scheduler/sched_group_migration/


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ