linux-kernel - Re: sched: update_entity_lag does not handle corner case with task in PI chain

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c10f6fda-aa8c-4d8e-a315-3c084af08862@amd.com>
Date: Tue, 21 Oct 2025 12:38:17 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>, "Luis Claudio R. Goncalves"
	<lgoncalv@...hat.com>
CC: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, "Phil
 Auld" <pauld@...hat.com>, Valentin Schneider <vschneid@...hat.com>, "Steven
 Rostedt" <rostedt@...dmis.org>, Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>, Shizhao Chen <shichen@...hat.com>,
	<linux-kernel@...r.kernel.org>, Omar Sandoval <osandov@...com>, Xuewen Yan
	<xuewen.yan@...soc.com>
Subject: Re: sched: update_entity_lag does not handle corner case with task in
 PI chain

Hello Peter, Luis,

On 10/19/2025 1:27 AM, Peter Zijlstra wrote:
>> [ 1805.450470] ------------[ cut here ]------------
>> [ 1805.450474] WARNING: CPU: 2 PID: 19 at kernel/sched/fair.c:697 update_entity_lag+0x5b/0x70
>> [ 1805.463366] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common skx_edac skx_edac_common nfit libnvdimm x86_pkg_temp_th
>> ermal intel_powerclamp coretemp kvm_intel kvm platform_profile dell_wmi sparse_keymap rfkill irqbypass iTCO_wdt video mgag200 rapl iTCO_vendor_support dell_smbios ipmi_ssif in
>> tel_cstate vfat dcdbas wmi_bmof intel_uncore dell_wmi_descriptor pcspkr fat i2c_algo_bit lpc_ich mei_me i2c_i801 i2c_smbus mei intel_pch_thermal ipmi_si acpi_power_meter acpi_
>> ipmi ipmi_devintf ipmi_msghandler sg fuse loop xfs sd_mod i40e ghash_clmulni_intel libie libie_adminq ahci libahci tg3 libata wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
>>  nfnetlink
>> [ 1805.525160] CPU: 2 UID: 0 PID: 19 Comm: rcub/0 Kdump: loaded Not tainted 6.17.1-rt5 #1 PREEMPT_RT 
>> [ 1805.534113] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 2.21.1 03/07/2024
>> [ 1805.541678] RIP: 0010:update_entity_lag+0x5b/0x70
>> [ 1805.546385] Code: 42 f8 48 81 3b 00 00 10 00 75 23 48 89 fa 48 f7 da 48 39 ea 48 0f 4c d5 48 39 fd 48 0f 4d d7 48 89 53 78 5b 5d c3 cc cc cc cc <0f> 0b eb b1 48 89 de e8 b9
>>  8c ff ff 48 89 c7 eb d0 0f 1f 40 00 90
>> [ 1805.565130] RSP: 0000:ffffcc9e802f7b90 EFLAGS: 00010046
>> [ 1805.570358] RAX: 0000000000000000 RBX: ffff8959080c0080 RCX: 0000000000000000
>> [ 1805.577488] RDX: 0000000000000000 RSI: ffff8959080c0080 RDI: ffff895592cc1c00
>> [ 1805.584622] RBP: ffff895592cc1c00 R08: 0000000000008800 R09: 0000000000000000
>> [ 1805.591756] R10: 0000000000000001 R11: 0000000000200b20 R12: 000000000000000e
>> [ 1805.598886] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>> [ 1805.606020] FS:  0000000000000000(0000) GS:ffff895947da2000(0000) knlGS:0000000000000000
>> [ 1805.614107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1805.619853] CR2: 00007f655816ed40 CR3: 00000004ab854006 CR4: 00000000007726f0
>> [ 1805.626985] PKRU: 55555554
>> [ 1805.629696] Call Trace:
>> [ 1805.632150]  <TASK>
>> [ 1805.634258]  dequeue_entity+0x90/0x4f0
>> [ 1805.638012]  dequeue_entities+0xc9/0x6b0
>> [ 1805.641935]  dequeue_task_fair+0x8a/0x190
>> [ 1805.645949]  ? sched_clock+0x10/0x30
>> [ 1805.649527]  rt_mutex_setprio+0x318/0x4b0
> 
> So we have:
> 
> rt_mutex_setprio()
> 
>   rq = __task_rq_lock(p, ..); // this asserts p->pi_lock is held
> 
>   ...
> 
>   queued = task_on_rq_queued(rq); // basically reads p->on_rq
>   running = task_current_donor()
>   if (queued)
>     dequeue_task(rq, p, queue_flags);
>       dequeue_task_fair()
>         dequeue_entities()
> 	  dequeue_entity()
> 	    update_entity_lag()
> 	      WARN_ON_ONCE(se->on_rq);
> 
> So the only way to get here is if: rq->on_rq is in fact !0 *and*
> se->on_rq is zero.
> 
> And I'm not at all sure how one would get into such a state.

This looks like something that can happen when a delayed task is
dequeued from a throttled hierarchy. Matt had reported similar
problem with wait_task_inactive() in
https://lore.kernel.org/all/20250925133310.1843863-1-matt@readmodwrite.com/

rt_mutex_setprio()
  ...
  if (prev_class != next_class && p->se.sched_delayed)
    dequeue_task(rq, p, DEQUEUE_DELAYED)
      dequeue_entities(se = &p->se)
        dequeue_entity(se)
          se->on_rq = 0; /* se->on_rq turns 0 here */
        ...
        if (cfs_rq_throttled(cfs_rq))
          return 0; /* Early return brfore __block_task() */
  ...

  /* __block_task() not called; task_on_rq_queued() returns true. */
  queued = task_on_rq_queued(p);
  ...

  if (queued)
    dequeue_task(rq, p, queue_flag)
      dequeue_entities(se = &p->se)
        dequeue_entity(se)
          update_entity_lag(se)
            WARN_ON_ONCE(!se->on_rq)


v6.18 kernels will get rid of this issue as a part of per-task throttle
feature and stable should pick up the fix for same on the thread soon. 

-- 
Thanks and Regards,
Prateek