[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260120211725.124349-1-sunlightlinux@gmail.com>
Date: Tue, 20 Jan 2026 23:17:24 +0200
From: "Ionut Nechita (Sunlight Linux)" <sunlightlinux@...il.com>
To: rafael@...nel.org
Cc: ionut_n2001@...oo.com,
daniel.lezcano@...aro.org,
christian.loehle@....com,
linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms
From: Ionut Nechita <ionut_n2001@...oo.com>
Hi,
This patch addresses a performance regression in the menu cpuidle governor
affecting modern Intel server platforms (Sapphire Rapids, Granite Rapids,
and newer).
== Problem Description ==
On Intel server platforms from 2022 onwards, we observe excessive wakeup
latencies (~150us) in network-sensitive workloads when using the menu
governor with NOHZ_FULL enabled.
Measurement with qperf tcp_lat shows:
- Sapphire Rapids (SPR): 151us latency
- Ice Lake (ICL): 12us latency
- Skylake (SKL): 21us latency
The 12x latency regression on SPR compared to Ice Lake is unacceptable for
latency-sensitive applications (HPC, real-time, financial trading, etc.).
== Root Cause ==
The issue stems from menu.c:294-295:
if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
predicted_ns = data->next_timer_ns;
When the tick is already stopped and the predicted idle duration is short
(<2ms), the governor switches to using next_timer_ns directly (often
10ms+). This causes the selection of very deep package C-states (PC6).
Modern server platforms have significantly longer C-state exit latencies
due to architectural changes:
- Tile-based architecture with per-tile power gating
- DDR5 power management overhead
- CXL link restoration
- Complex mesh interconnect resynchronization
When a network packet arrives after 500us but the governor selected PC6
based on a 10ms timer, the 150us exit latency dominates the response time.
On older platforms (Ice Lake, Skylake) with faster C-state transitions
(12-21us), this issue was less noticeable, but SPR's tile architecture
makes it critical.
== Solution ==
Instead of using next_timer_ns directly (100% timer-based), add a 25%
safety margin to the prediction and clamp to next_timer_ns:
predicted_ns = min(predicted_ns + (predicted_ns >> 2), data->next_timer_ns);
This provides:
- Conservative prediction (avoids too-shallow states)
- Protection against excessively deep states (clamped to timer)
- Platform-agnostic solution (no hardcoded thresholds)
- Minimal overhead (one shift, one add, one min)
The 25% margin (>> 2 = divide by 4) was chosen as a balance between:
- Too small (10%): Insufficient protection on high-latency platforms
- Too large (50%): Overly conservative, may hurt power efficiency
== Results ==
Testing on Sapphire Rapids with qperf tcp_lat:
- Before: 151us average latency
- After: ~30us average latency
- Improvement: 5x latency reduction
Testing on Ice Lake and Skylake shows minimal impact:
- Ice Lake: 12us → 12us (no regression)
- Skylake: 21us → 21us (no regression)
Power efficiency testing shows <1% difference in package power consumption
during mixed workloads, well within measurement noise.
== Examples ==
Short prediction (500us), timer at 10ms:
- Before: predicted_ns = 10ms → selects PC6 → 151us wakeup
- After: predicted_ns = min(625us, 10ms) = 625us → selects C1E → 15us wakeup
Long prediction (1800us), timer at 2ms:
- Before: predicted_ns = 2ms → selects C6
- After: predicted_ns = min(2250us, 2ms) = 2ms → selects C6 (same state)
The algorithm naturally adapts to workload characteristics without
platform-specific tuning.
Ionut Nechita (1):
cpuidle: menu: Add 25% safety margin to short predictions when tick is
stopped
drivers/cpuidle/governors/menu.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
--
2.52.0
Powered by blists - more mailing lists