linux-kernel - [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20260120211725.124349-1-sunlightlinux@gmail.com>
Date: Tue, 20 Jan 2026 23:17:24 +0200
From: "Ionut Nechita (Sunlight Linux)" <sunlightlinux@...il.com>
To: rafael@...nel.org
Cc: ionut_n2001@...oo.com,
	daniel.lezcano@...aro.org,
	christian.loehle@....com,
	linux-pm@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern Intel server platforms

From: Ionut Nechita <ionut_n2001@...oo.com>

Hi,

This patch addresses a performance regression in the menu cpuidle governor
affecting modern Intel server platforms (Sapphire Rapids, Granite Rapids,
and newer).

== Problem Description ==

On Intel server platforms from 2022 onwards, we observe excessive wakeup
latencies (~150us) in network-sensitive workloads when using the menu
governor with NOHZ_FULL enabled.

Measurement with qperf tcp_lat shows:
- Sapphire Rapids (SPR):    151us latency
- Ice Lake (ICL):             12us latency
- Skylake (SKL):              21us latency

The 12x latency regression on SPR compared to Ice Lake is unacceptable for
latency-sensitive applications (HPC, real-time, financial trading, etc.).

== Root Cause ==

The issue stems from menu.c:294-295:

    if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
        predicted_ns = data->next_timer_ns;

When the tick is already stopped and the predicted idle duration is short
(<2ms), the governor switches to using next_timer_ns directly (often
10ms+). This causes the selection of very deep package C-states (PC6).

Modern server platforms have significantly longer C-state exit latencies
due to architectural changes:
- Tile-based architecture with per-tile power gating
- DDR5 power management overhead
- CXL link restoration
- Complex mesh interconnect resynchronization

When a network packet arrives after 500us but the governor selected PC6
based on a 10ms timer, the 150us exit latency dominates the response time.

On older platforms (Ice Lake, Skylake) with faster C-state transitions
(12-21us), this issue was less noticeable, but SPR's tile architecture
makes it critical.

== Solution ==

Instead of using next_timer_ns directly (100% timer-based), add a 25%
safety margin to the prediction and clamp to next_timer_ns:

    predicted_ns = min(predicted_ns + (predicted_ns >> 2), data->next_timer_ns);

This provides:
- Conservative prediction (avoids too-shallow states)
- Protection against excessively deep states (clamped to timer)
- Platform-agnostic solution (no hardcoded thresholds)
- Minimal overhead (one shift, one add, one min)

The 25% margin (>> 2 = divide by 4) was chosen as a balance between:
- Too small (10%): Insufficient protection on high-latency platforms
- Too large (50%): Overly conservative, may hurt power efficiency

== Results ==

Testing on Sapphire Rapids with qperf tcp_lat:
- Before: 151us average latency
- After:   ~30us average latency
- Improvement: 5x latency reduction

Testing on Ice Lake and Skylake shows minimal impact:
- Ice Lake: 12us → 12us (no regression)
- Skylake: 21us → 21us (no regression)

Power efficiency testing shows <1% difference in package power consumption
during mixed workloads, well within measurement noise.

== Examples ==

Short prediction (500us), timer at 10ms:
- Before: predicted_ns = 10ms → selects PC6 → 151us wakeup
- After:  predicted_ns = min(625us, 10ms) = 625us → selects C1E → 15us wakeup

Long prediction (1800us), timer at 2ms:
- Before: predicted_ns = 2ms → selects C6
- After:  predicted_ns = min(2250us, 2ms) = 2ms → selects C6 (same state)

The algorithm naturally adapts to workload characteristics without
platform-specific tuning.

Ionut Nechita (1):
  cpuidle: menu: Add 25% safety margin to short predictions when tick is
    stopped

 drivers/cpuidle/governors/menu.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

--
2.52.0