linux-kernel - [PATCH AUTOSEL 6.17] sched_ext: Fix possible deadlock in the deferred_irq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20251124080644.3871678-2-sashal@kernel.org>
Date: Mon, 24 Nov 2025 03:06:16 -0500
From: Sasha Levin <sashal@...nel.org>
To: patches@...ts.linux.dev,
	stable@...r.kernel.org
Cc: Zqiang <qiang.zhang@...ux.dev>,
	Tejun Heo <tj@...nel.org>,
	Sasha Levin <sashal@...nel.org>,
	mingo@...hat.com,
	peterz@...radead.org,
	juri.lelli@...hat.com,
	vincent.guittot@...aro.org,
	bigeasy@...utronix.de,
	clrkwllms@...nel.org,
	rostedt@...dmis.org,
	sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org,
	linux-rt-devel@...ts.linux.dev
Subject: [PATCH AUTOSEL 6.17] sched_ext: Fix possible deadlock in the deferred_irq_workfn()

From: Zqiang <qiang.zhang@...ux.dev>

[ Upstream commit a257e974210320ede524f340ffe16bf4bf0dda1e ]

For PREEMPT_RT=y kernels, the deferred_irq_workfn() is executed in
the per-cpu irq_work/* task context and not disable-irq, if the rq
returned by container_of() is current CPU's rq, the following scenarios
may occur:

lock(&rq->__lock);
<Interrupt>
  lock(&rq->__lock);

This commit use IRQ_WORK_INIT_HARD() to replace init_irq_work() to
initialize rq->scx.deferred_irq_work, make the deferred_irq_workfn()
is always invoked in hard-irq context.

Signed-off-by: Zqiang <qiang.zhang@...ux.dev>
Signed-off-by: Tejun Heo <tj@...nel.org>
Signed-off-by: Sasha Levin <sashal@...nel.org>
---

LLM Generated explanations, may be completely bogus:

## Comprehensive Analysis

### 1. Commit Message and Problem Analysis
The commit addresses a **critical deadlock** scenario in the `sched_ext`
(BPF extensible scheduler) subsystem.
- **The Issue:** On `CONFIG_PREEMPT_RT` (Real-Time) kernels, `irq_work`
  items initialized with `init_irq_work()` default to running in a per-
  CPU thread context with interrupts enabled. The work function
  `deferred_irq_workfn()` acquires the runqueue lock
  (`raw_spin_rq_lock(rq)`). If an interrupt occurs while this lock is
  held, and the interrupt handler also attempts to acquire `rq->__lock`
  (a standard scheduler pattern), a deadlock occurs (A-A deadlock).
- **The Fix:** The commit changes the initialization of
  `deferred_irq_work` to use `IRQ_WORK_INIT_HARD()`. This forces the
  work function to execute in **hard interrupt context** (with
  interrupts disabled), preventing the nested interrupt scenario that
  causes the deadlock.

### 2. Deep Code Research & Verification
- **Subsystem Context:** `sched_ext` was merged in Linux v6.12. The
  buggy code exists in all stable kernels starting from v6.12.y up to
  the current v6.17.y. Older LTS kernels (6.6.y, 6.1.y) do not contain
  `sched_ext` and are unaffected.
- **Code Mechanics:**
  - **Buggy Code:** `init_irq_work(&rq->scx.deferred_irq_work,
    deferred_irq_workfn);` relies on defaults which are unsafe for this
    locking pattern on PREEMPT_RT.
  - **Corrected Code:** `rq->scx.deferred_irq_work =
    IRQ_WORK_INIT_HARD(deferred_irq_workfn);` explicitly sets the
    `IRQ_WORK_HARD_IRQ` flag.
  - **Precedent:** This pattern is well-established in the scheduler
    core (e.g., `rto_push_work` in `kernel/sched/topology.c` uses
    `IRQ_WORK_INIT_HARD` for the exact same reason).
- **Correctness:** `deferred_irq_workfn` calls `run_deferred`, which
  uses `raw_spin_rq_lock`. These locks are safe to take in hard-irq
  context. The fix is technically sound and adheres to locking rules.

### 3. Stable Kernel Rules Evaluation
- **Fixes a real bug?** **Yes.** It fixes a reproducible deadlock that
  causes system hangs.
- **Important issue?** **Yes.** Deadlocks are critical failures,
  especially on Real-Time systems where reliability is paramount.
- **Obviously correct?** **Yes.** The fix is a one-line change using a
  standard kernel macro specifically designed for this purpose.
- **Small and contained?** **Yes.** One line changed, no external
  dependencies.
- **No new features?** **Yes.** This is a pure bug fix for existing
  functionality.

### 4. Risk Assessment
- **Regression Risk:** **Very Low.** The change only affects the
  execution context of the work item. On non-RT kernels, `irq_work`
  often runs in hard-irq context anyway, so the behavior change is
  minimal. On RT kernels, this restores the required behavior to prevent
  deadlocks.
- **User Impact:** Users running `sched_ext` on Real-Time kernels are at
  risk of random system freezes without this fix.

### Conclusion
This commit is a textbook example of stable-material. It fixes a severe
bug (deadlock) in a supported feature (`sched_ext`) using a minimal,
well-understood solution. While it lacks a "Cc: stable" tag, the nature
of the bug (deadlock) and the surgical nature of the fix make it a
mandatory backport for all stable trees containing `sched_ext` (v6.12+).

**YES**

 kernel/sched/ext.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index e1b502ef1243c..fa64fdb6e9796 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5280,7 +5280,7 @@ void __init init_sched_ext_class(void)
 		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL, n));
 		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_preempt, GFP_KERNEL, n));
 		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait, GFP_KERNEL, n));
-		init_irq_work(&rq->scx.deferred_irq_work, deferred_irq_workfn);
+		rq->scx.deferred_irq_work = IRQ_WORK_INIT_HARD(deferred_irq_workfn);
 		init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn);
 
 		if (cpu_online(cpu))
-- 
2.51.0