lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251006181835.1919496-22-sashal@kernel.org>
Date: Mon,  6 Oct 2025 14:17:54 -0400
From: Sasha Levin <sashal@...nel.org>
To: patches@...ts.linux.dev,
	stable@...r.kernel.org
Cc: Aaron Lu <ziqianlu@...edance.com>,
	"Peter Zijlstra (Intel)" <peterz@...radead.org>,
	Sasha Levin <sashal@...nel.org>,
	mingo@...hat.com,
	juri.lelli@...hat.com,
	vincent.guittot@...aro.org,
	linux-kernel@...r.kernel.org
Subject: [PATCH AUTOSEL 6.17] sched/fair: update_cfs_group() for throttled cfs_rqs

From: Aaron Lu <ziqianlu@...edance.com>

[ Upstream commit fcd394866e3db344cbe0bb485d7e3f741ac07245 ]

With task based throttle model, tasks in a throttled hierarchy are
allowed to continue to run if they are running in kernel mode. For this
reason, PELT clock is not stopped for these cfs_rqs in throttled
hierarchy when they still have tasks running or queued.

Since PELT clock is not stopped, whether to allow update_cfs_group()
doing its job for cfs_rqs which are in throttled hierarchy but still
have tasks running/queued is a question.

The good side is, continue to run update_cfs_group() can get these
cfs_rq entities with an up2date weight and that up2date weight can be
useful to derive an accurate load for the CPU as well as ensure fairness
if multiple tasks of different cgroups are running on the same CPU.
OTOH, as Benjamin Segall pointed: when unthrottle comes around the most
likely correct distribution is the distribution we had at the time of
throttle.

In reality, either way may not matter that much if tasks in throttled
hierarchy don't run in kernel mode for too long. But in case that
happens, let these cfs_rq entities have an up2date weight seems a good
thing to do.

Signed-off-by: Aaron Lu <ziqianlu@...edance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
Signed-off-by: Sasha Levin <sashal@...nel.org>
---

LLM Generated explanations, may be completely bogus:

## Backport Status: YES (with critical dependencies)

### Executive Summary

This commit **should be backported** to stable kernel trees, but **only
as part of the complete task-based throttle model series**. In fact,
this commit has **already been backported** to Linux 6.17 stable tree
(commit 4ee521004f8ef) by stable maintainer Sasha Levin, which is strong
evidence of its suitability for stable backporting.

### Code Change Analysis

The commit makes a minimal code change in `kernel/sched/fair.c`,
removing 3 lines from `update_cfs_group()`:

```c
- if (throttled_hierarchy(gcfs_rq))
- return;
-
```

This removal allows `update_cfs_group()` to continue updating group
entity weights even for cgroups in throttled hierarchies. Previously,
line 3960-3961 would cause an early return, preventing weight
recalculation for any throttled cfs_rq.

### Context and Dependencies

**Critical Finding**: This commit is **NOT standalone**. It is part 5 of
a 7-commit series implementing the task-based throttle model:

1. **e1fad12dcb66b** - "Switch to task based throttle model" (341 line
   change - the base)
2. **eb962f251fbba** - "Task based throttle time accounting"
3. **5b726e9bf9544** - "Get rid of throttled_lb_pair()"
4. **fe8d238e646e1** - "Propagate load for throttled cfs_rq"
5. **fcd394866e3db** - "update_cfs_group() for throttled cfs_rqs" ←
   **This commit**
6. **253b3f5872419** - "Do not special case tasks in throttled
   hierarchy" (follow-up fix)
7. **0d4eaf8caf8cd** - "Do not balance task to a throttled cfs_rq"
   (follow-up performance fix)

All 7 commits were backported together to Linux 6.17 stable tree.

### Why This Change Is Necessary

Under the **old throttle model**: When a cfs_rq was throttled, its
entity was dequeued from the CPU's runqueue, preventing all tasks from
running. The PELT clock stopped, so updating group weights was
unnecessary and prevented by the `throttled_hierarchy()` check at line
3960.

Under the **new task-based throttle model** (introduced by commit
e1fad12dcb66b):
- Tasks in throttled hierarchies **continue running if in kernel mode**
- PELT clock **remains active** while throttled tasks still run/queue
- The `throttled_hierarchy()` check at line 3960 becomes **incorrect** -
  it prevents weight updates even though PELT is still running

**The fix**: Remove lines 3960-3961 to allow `calc_group_shares()` (line
3963) and `reweight_entity()` (line 3965) to execute, giving throttled
cfs_rq entities up-to-date weights for accurate CPU load calculation and
cross-cgroup fairness.

### Benefits and Trade-offs

**Benefits** (from commit message):
- Up-to-date weights enable accurate CPU load derivation
- Ensures fairness when multiple tasks from different cgroups run on
  same CPU
- Prevents stale weight values during extended kernel-mode execution

**Trade-offs** (acknowledged in commit):
- As Benjamin Segall noted: "the most likely correct distribution is the
  distribution we had at the time of throttle"
- May not matter much if tasks don't run in kernel mode for long periods
- Performance tuning was needed (see follow-up commit 0d4eaf8caf8cd
  which addresses hackbench regression by preventing load balancing to
  throttled cfs_rqs)

### What Problems Does This Solve?

The base task-based throttle model (e1fad12dcb66b) solves a **real
bug**: With the old model, a task holding a percpu_rwsem as reader in a
throttled cgroup couldn't run until the next period, causing:
- Writers waiting longer
- Reader build-up
- **Task hung warnings**

This specific commit ensures the new model works correctly by keeping
weight calculations accurate during kernel-mode execution of throttled
tasks.

### Risk Assessment

**Low to Medium Risk** for the following reasons:

**Mitigating factors**:
- Small code change (3 lines removed)
- Already backported to 6.17 stable by experienced maintainer
- Well-tested by multiple developers (Valentin Schneider, Chen Yu,
  Matteo Martelli, K Prateek Nayak)
- Part of thoroughly reviewed patch series linked at
  https://lore.kernel.org/r/20250829081120.806-4-ziqianlu@bytedance.com

**Risk factors**:
- Modifies core scheduler behavior in subtle ways
- Requires entire series (cannot be cherry-picked alone)
- Follow-up performance fixes needed (commit 0d4eaf8caf8cd mentions AMD
  Genoa performance degradation with hackbench that required additional
  checks)
- Affects PELT weight calculations during throttling edge cases

**No evidence of**:
- Reverts
- CVE assignments
- Major regression reports
- Security implications

### Backporting Requirements

If backporting to stable trees **without** the task-based throttle
model:

**DO NOT BACKPORT** - This commit will break things. The
`throttled_hierarchy()` check at line 3960 exists for a reason in the
old throttle model where PELT clocks stop on throttle.

If backporting to stable trees **with** the task-based throttle model:

**MUST BACKPORT** as part of the complete series:
1. Base commit e1fad12dcb66b (341 lines - major change)
2. Commits eb962f251fbba, 5b726e9bf9544, fe8d238e646e1
3. **This commit** (fcd394866e3db)
4. **Follow-up fixes** 253b3f5872419 and 0d4eaf8caf8cd

### Stable Tree Rules Compliance

- ✅ **Fixes important bugs**: Yes (task hung due to percpu_rwsem
  interactions)
- ✅ **Relatively small change**: Yes for this commit (3 lines), but
  series is large
- ✅ **Minimal side effects**: When backported with complete series
- ❌ **No major architectural changes**: No - this IS part of a major
  architectural change
- ✅ **Clear benefits**: Yes - prevents task hangs, improves fairness
- ⚠️ **Explicit stable tag**: No "Cc: stable" tag, but manually selected
  by stable maintainer
- ✅ **Minimal regression risk**: When backported with complete series
  and follow-ups

### Recommendation

**YES - Backport this commit**, with the following requirements:

1. **MUST include the entire task-based throttle series** (commits 1-7
   listed above)
2. **MUST include follow-up performance fixes** (especially
   0d4eaf8caf8cd)
3. **Target kernel version**: 6.17+ (already done) or newer LTS versions
   planning major scheduler updates
4. **Not suitable for**: Older stable trees without appetite for the
   341-line base architectural change

The fact that Sasha Levin already backported this entire series to 6.17
stable is the strongest indicator this is appropriate for stable
backporting.

 kernel/sched/fair.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ce56a8d507f9..eea0b6571af5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3957,9 +3957,6 @@ static void update_cfs_group(struct sched_entity *se)
 	if (!gcfs_rq || !gcfs_rq->load.weight)
 		return;
 
-	if (throttled_hierarchy(gcfs_rq))
-		return;
-
 	shares = calc_group_shares(gcfs_rq);
 	if (unlikely(se->load.weight != shares))
 		reweight_entity(cfs_rq_of(se), se, shares);
-- 
2.51.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ