linux-kernel - Re: sched: update_entity_lag does not handle corner case with task in PI chain

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPYWNxnVVLZ8AzXa@uudg.org>
Date: Mon, 20 Oct 2025 08:00:07 -0300
From: "Luis Claudio R. Goncalves" <lgoncalv@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
	Phil Auld <pauld@...hat.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Shizhao Chen <shichen@...hat.com>, linux-kernel@...r.kernel.org,
	Omar Sandoval <osandov@...com>, Xuewen Yan <xuewen.yan@...soc.com>
Subject: Re: sched: update_entity_lag does not handle corner case with task
 in PI chain

On Sat, Oct 18, 2025 at 09:57:30PM +0200, Peter Zijlstra wrote:
> On Sat, Oct 18, 2025 at 08:34:52AM -0300, Luis Claudio R. Goncalves wrote:
> > Hello!
> > 
> 
> > While running sched_group_migration test from CKI repository[1], which
> 
> What's a CKI ?

Continuous Kernel Integration (https://cki-project.org/). That's not
relevant to the problem, I just wanted to mention the source of the test in
case there were other tests with the same name or other versions in different
repositories.

> > migrates tasks between cpusets, Shizhao Chen reports hitting the warning
> > in update_entity_lag():
> > 
> >     WARN_ON_ONCE(!se->on_rq);
> > 
> > In short, update_entity_lag() is acting on a task that is waiting on a lock,
> > sleeping, with both on_rq and se->on_rq equal to zero.
> 
> You can't get to where you are with p->on_rq being zero.

I can check the other vmcores I have, but I am certain that I saw this in
at least two of the vmcores I analyzed:


   crash> task -R pi_blocked_on,prio,rt_priority,on_rq,se.on_rq 0xffff8955813c2180 0xffff895926cb4300
   PID: 19       TASK: ffff8955813c2180  CPU: 2    COMMAND: "rcub/0"
     pi_blocked_on = 0xffffcc9e802f7de0,       <------- held by the thread below
     prio = 98,
     rt_priority = 1,
     on_rq = 1,
     se.on_rq = 0 '\000',

   PID: 445515   TASK: ffff895926cb4300  CPU: 0    COMMAND: "bz1738415-test"
     pi_blocked_on = 0xffffcc9ea19f7b70,       <------- waiting on a lock
     prio = 98,
     rt_priority = 0,
     on_rq = 0,                                <------- 
     se.on_rq = 0 '\000',                      <-------


In the vmcores I collected the thread blocking rcub/X (holding the lock) is
blocked on another test thread, waiting on a different lock. As in some of
the vmcores this second lock has no owner, this looks like the lock had just
been released. The timing was perfect to hit this apparent corner case.

> > When a stalled RCU grace period occurs, rcu_boost_kthread() is called. If an
> > rt_mutex is involved in the process, rt_mutex_setprio() is called and may
> > eventually walk down a Priority Inheritance chain, adjusting the priorities
> > of the waiters in the chain. In such cases update_entity_lag() may be called.
> > 
> > What is the expected behavior for this case, to bail out of update_entity_lag()
> > or avoid calling the function entirely?
> > 
> > 
> > --[ Additional Notes:
> > 
> > Reproducing the Problem:
> > 
> >   - Install sched_group_migration[1] and run it on a loop.
> >     (while : ;  do runtest.sh; done)
> >   - In my experience, running the test with 4 CPUs reproduces the problem
> >     within 15 minutes. Setting "nr_cpus=4 max_cpus=4" on boot does the trick.
> > 
> > 
> > The scenario below is a simplification of the cases I observed while
> > investigating the problem:
> > 
> >     CPUn					CPUx
> > 
> >     task01 has rcu-state lock
> >     contends on another lock		
> >     (goes to sleep)
> >     --> on_rq=0 se.on_rq=0
> > 					rcub/x contends on rcu-state lock
> > 					  rcu_boost_kthread()
> > 					    rt_set_prio()
> > 					      update_entity_lag(task01->se)
> > 					        WARNING()
> 
> There is a whole lot wrong with this, firstly there is no rt_set_prio()
> function, and update_entity_lag() isn't directly called by it.
> Additionally, you should never get to update_entity_lag() if !p->on_rq,
> see below:

Sorry, I meant rt_mutex_setprio(). My mistake. And yes, I oversimplified
the note. The idea is that rt_mutex_setprio() could eventually result in
a call to update_entity_lag() down the chain.

> > [ 1805.450470] ------------[ cut here ]------------
> > [ 1805.450474] WARNING: CPU: 2 PID: 19 at kernel/sched/fair.c:697 update_entity_lag+0x5b/0x70
> > [ 1805.463366] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_th
> > ermal intel_powerclamp coretemp kvm_intel kvm platform_profile dell_wmi sparse_keymap rfkill irqbypass iTCO_wdt video mgag200 rapl iTCO_vendor_support dell_smbios ipmi_ssif in
> > tel_cstate vfat dcdbas wmi_bmof intel_uncore dell_wmi_descriptor pcspkr fat i2c_algo_bit lpc_ich mei_me i2c_i801 i2c_smbus mei intel_pch_thermal ipmi_si acpi_power_meter acpi_
> > ipmi ipmi_devintf ipmi_msghandler sg fuse loop xfs sd_mod i40e ghash_clmulni_intel libie libie_adminq ahci libahci tg3 libata wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
> >  nfnetlink
> > [ 1805.525160] CPU: 2 UID: 0 PID: 19 Comm: rcub/0 Kdump: loaded Not tainted 6.17.1-rt5 #1 PREEMPT_RT 
> > [ 1805.534113] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 2.21.1 03/07/2024
> > [ 1805.541678] RIP: 0010:update_entity_lag+0x5b/0x70
> > [ 1805.546385] Code: 42 f8 48 81 3b 00 00 10 00 75 23 48 89 fa 48 f7 da 48 39 ea 48 0f 4c d5 48 39 fd 48 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc <0f> 0b eb b1 48 89 de e8 b9
> >  8c ff ff 48 89 c7 eb d0 0f 1f 40 00 90
> > [ 1805.565130] RSP: 0000:ffffcc9e802f7b90 EFLAGS: 00010046
> > [ 1805.570358] RAX: 0000000000000000 RBX: ffff8959080c0080 RCX: 0000000000000000
> > [ 1805.577488] RDX: 0000000000000000 RSI: ffff8959080c0080 RDI: ffff895592cc1c00
> > [ 1805.584622] RBP: ffff895592cc1c00 R08: 0000000000008800 R09: 0000000000000000
> > [ 1805.591756] R10: 0000000000000001 R11: 0000000000200b20 R12: 000000000000000e
> > [ 1805.598886] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > [ 1805.606020] FS:  0000000000000000(0000) GS:ffff895947da2000(0000) knlGS:0000000000000000
> > [ 1805.614107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1805.619853] CR2: 00007f655816ed40 CR3: 00000004ab854006 CR4: 00000000007726f0
> > [ 1805.626985] PKRU: 55555554
> > [ 1805.629696] Call Trace:
> > [ 1805.632150]  <TASK>
> > [ 1805.634258]  dequeue_entity+0x90/0x4f0
> > [ 1805.638012]  dequeue_entities+0xc9/0x6b0
> > [ 1805.641935]  dequeue_task_fair+0x8a/0x190
> > [ 1805.645949]  ? sched_clock+0x10/0x30
> > [ 1805.649527]  rt_mutex_setprio+0x318/0x4b0
> 
> So we have:
> 
> rt_mutex_setprio()
> 
>   rq = __task_rq_lock(p, ..); // this asserts p->pi_lock is held
> 
>   ...
> 
>   queued = task_on_rq_queued(rq); // basically reads p->on_rq
>   running = task_current_donor()
>   if (queued)
>     dequeue_task(rq, p, queue_flags);
>       dequeue_task_fair()
>         dequeue_entities()
> 	  dequeue_entity()
> 	    update_entity_lag()
> 	      WARN_ON_ONCE(se->on_rq);
> 
> So the only way to get here is if: rq->on_rq is in fact !0 *and*
> se->on_rq is zero.

Assuming the vmcores I collected are not damaged and that the simple crash
command I used earlier to display the thread on_rq and se->on_rq fields is
correct, is there a chance that the sequence above could be tampered by
the thread being concurrently moved from one cpuset to another?

> And I'm not at all sure how one would get into such a state.

Sorry again for the convoluted report, but I am also trying to make sense
of the results observed. Is there anything I could do in terms of tests or
information that could help shed a light here?

Best,
Luis