linux-kernel - [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20260122080937.22347-4-sunlightlinux@gmail.com>
Date: Thu, 22 Jan 2026 10:09:39 +0200
From: "Ionut Nechita (Sunlight Linux)" <sunlightlinux@...il.com>
To: rafael@...nel.org
Cc: daniel.lezcano@...aro.org,
	christian.loehle@....com,
	linux-pm@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	yumpusamongus@...il.com,
	Ionut Nechita <ionut_n2001@...oo.com>,
	stable@...r.kernel.org
Subject: [PATCH v2 1/1] cpuidle: menu: Use min() to prevent deep C-states when tick is stopped

From: Ionut Nechita <ionut_n2001@...oo.com>

When the tick is already stopped and the predicted idle duration is short
(< TICK_NSEC), the original code uses next_timer_ns directly. This can
lead to selecting excessively deep C-states when the actual idle duration
is much shorter than the next timer event.

On modern Intel server platforms (Sapphire Rapids and newer), deep package
C-states can have exit latencies of 150-190us due to:
- Tile-based architecture with per-tile power gating
- DDR5 and CXL power management overhead
- Complex mesh interconnect resynchronization

When a network packet arrives after 500us but the governor selected a deep
C-state (PC6) based on a 10ms timer, the high exit latency (150us+)
dominates the response time.

Use the minimum of predicted_ns and next_timer_ns instead of using
next_timer_ns directly. This avoids selecting unnecessarily deep states
when the prediction is short but the next timer is distant, while still
being conservative enough to prevent getting stuck in shallow states for
extended periods.

Testing on Sapphire Rapids with qperf tcp_lat shows:
- Before: 151us average latency (frequent PC6 entry)
- After: ~30us average latency (avoids PC6 on short predictions)
- Improvement: 5x latency reduction

The fix is platform-agnostic and benefits other platforms with high
C-state exit latencies. Testing on systems with large C-state gaps
(e.g., C2 at 36us → C3 at 700us with 350us latency) shows similar
improvements in avoiding deep state selection for short idle periods.

Power efficiency testing shows minimal impact (<1% difference in package
power consumption during mixed workloads), well within measurement noise.

Cc: stable@...r.kernel.org
Signed-off-by: Ionut Nechita <ionut_n2001@...oo.com>
---
 drivers/cpuidle/governors/menu.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 64d6f7a1c776..199eac2a1849 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -287,12 +287,16 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	/*
 	 * If the tick is already stopped, the cost of possible short idle
 	 * duration misprediction is much higher, because the CPU may be stuck
-	 * in a shallow idle state for a long time as a result of it.  In that
-	 * case, say we might mispredict and use the known time till the closest
-	 * timer event for the idle state selection.
+	 * in a shallow idle state for a long time as a result of it.
+	 *
+	 * Instead of using next_timer_ns directly (which could be very large,
+	 * e.g., 10ms), use the minimum of the prediction and the timer. This
+	 * prevents selecting excessively deep C-states when the prediction
+	 * suggests a short idle period, while still clamping to next_timer_ns
+	 * to avoid unnecessarily shallow states.
 	 */
 	if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
-		predicted_ns = data->next_timer_ns;
+		predicted_ns = min(predicted_ns, data->next_timer_ns);

 	/*
 	 * Find the idle state with the lowest power while satisfying
-- 
2.52.0