linux-kernel - Re: sched: update_entity_lag does not handle corner case with task in PI chain

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251018195730.GJ3419281@noisy.programming.kicks-ass.net>
Date: Sat, 18 Oct 2025 21:57:30 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: "Luis Claudio R. Goncalves" <lgoncalv@...hat.com>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
	Phil Auld <pauld@...hat.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Shizhao Chen <shichen@...hat.com>, linux-kernel@...r.kernel.org,
	Omar Sandoval <osandov@...com>, Xuewen Yan <xuewen.yan@...soc.com>
Subject: Re: sched: update_entity_lag does not handle corner case with task
 in PI chain

On Sat, Oct 18, 2025 at 08:34:52AM -0300, Luis Claudio R. Goncalves wrote:
> Hello!
> 

> While running sched_group_migration test from CKI repository[1], which

What's a CKI ?

> migrates tasks between cpusets, Shizhao Chen reports hitting the warning
> in update_entity_lag():
> 
>     WARN_ON_ONCE(!se->on_rq);
> 
> In short, update_entity_lag() is acting on a task that is waiting on a lock,
> sleeping, with both on_rq and se->on_rq equal to zero.

You can't get to where you are with p->on_rq being zero.

> When a stalled RCU grace period occurs, rcu_boost_kthread() is called. If an
> rt_mutex is involved in the process, rt_mutex_setprio() is called and may
> eventually walk down a Priority Inheritance chain, adjusting the priorities
> of the waiters in the chain. In such cases update_entity_lag() may be called.
> 
> What is the expected behavior for this case, to bail out of update_entity_lag()
> or avoid calling the function entirely?
> 
> 
> --[ Additional Notes:
> 
> Reproducing the Problem:
> 
>   - Install sched_group_migration[1] and run it on a loop.
>     (while : ;  do runtest.sh; done)
>   - In my experience, running the test with 4 CPUs reproduces the problem
>     within 15 minutes. Setting "nr_cpus=4 max_cpus=4" on boot does the trick.
> 
> 
> The scenario below is a simplification of the cases I observed while
> investigating the problem:
> 
>     CPUn					CPUx
> 
>     task01 has rcu-state lock
>     contends on another lock		
>     (goes to sleep)
>     --> on_rq=0 se.on_rq=0
> 					rcub/x contends on rcu-state lock
> 					  rcu_boost_kthread()
> 					    rt_set_prio()
> 					      update_entity_lag(task01->se)
> 					        WARNING()

There is a whole lot wrong with this, firstly there is no rt_set_prio()
function, and update_entity_lag() isn't directly called by it.
Additionally, you should never get to update_entity_lag() if !p->on_rq,
see below:

> [ 1805.450470] ------------[ cut here ]------------
> [ 1805.450474] WARNING: CPU: 2 PID: 19 at kernel/sched/fair.c:697 update_entity_lag+0x5b/0x70
> [ 1805.463366] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_th
> ermal intel_powerclamp coretemp kvm_intel kvm platform_profile dell_wmi sparse_keymap rfkill irqbypass iTCO_wdt video mgag200 rapl iTCO_vendor_support dell_smbios ipmi_ssif in
> tel_cstate vfat dcdbas wmi_bmof intel_uncore dell_wmi_descriptor pcspkr fat i2c_algo_bit lpc_ich mei_me i2c_i801 i2c_smbus mei intel_pch_thermal ipmi_si acpi_power_meter acpi_
> ipmi ipmi_devintf ipmi_msghandler sg fuse loop xfs sd_mod i40e ghash_clmulni_intel libie libie_adminq ahci libahci tg3 libata wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
>  nfnetlink
> [ 1805.525160] CPU: 2 UID: 0 PID: 19 Comm: rcub/0 Kdump: loaded Not tainted 6.17.1-rt5 #1 PREEMPT_RT 
> [ 1805.534113] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 2.21.1 03/07/2024
> [ 1805.541678] RIP: 0010:update_entity_lag+0x5b/0x70
> [ 1805.546385] Code: 42 f8 48 81 3b 00 00 10 00 75 23 48 89 fa 48 f7 da 48 39 ea 48 0f 4c d5 48 39 fd 48 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc <0f> 0b eb b1 48 89 de e8 b9
>  8c ff ff 48 89 c7 eb d0 0f 1f 40 00 90
> [ 1805.565130] RSP: 0000:ffffcc9e802f7b90 EFLAGS: 00010046
> [ 1805.570358] RAX: 0000000000000000 RBX: ffff8959080c0080 RCX: 0000000000000000
> [ 1805.577488] RDX: 0000000000000000 RSI: ffff8959080c0080 RDI: ffff895592cc1c00
> [ 1805.584622] RBP: ffff895592cc1c00 R08: 0000000000008800 R09: 0000000000000000
> [ 1805.591756] R10: 0000000000000001 R11: 0000000000200b20 R12: 000000000000000e
> [ 1805.598886] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 1805.606020] FS:  0000000000000000(0000) GS:ffff895947da2000(0000) knlGS:0000000000000000
> [ 1805.614107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1805.619853] CR2: 00007f655816ed40 CR3: 00000004ab854006 CR4: 00000000007726f0
> [ 1805.626985] PKRU: 55555554
> [ 1805.629696] Call Trace:
> [ 1805.632150]  <TASK>
> [ 1805.634258]  dequeue_entity+0x90/0x4f0
> [ 1805.638012]  dequeue_entities+0xc9/0x6b0
> [ 1805.641935]  dequeue_task_fair+0x8a/0x190
> [ 1805.645949]  ? sched_clock+0x10/0x30
> [ 1805.649527]  rt_mutex_setprio+0x318/0x4b0

So we have:

rt_mutex_setprio()

  rq = __task_rq_lock(p, ..); // this asserts p->pi_lock is held

  ...

  queued = task_on_rq_queued(rq); // basically reads p->on_rq
  running = task_current_donor()
  if (queued)
    dequeue_task(rq, p, queue_flags);
      dequeue_task_fair()
        dequeue_entities()
	  dequeue_entity()
	    update_entity_lag()
	      WARN_ON_ONCE(se->on_rq);

So the only way to get here is if: rq->on_rq is in fact !0 *and*
se->on_rq is zero.

And I'm not at all sure how one would get into such a state.