[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eac3541b-f22f-4cd9-a31e-4841e4fad5a1@arm.com>
Date: Wed, 21 Jan 2026 11:49:19 +0000
From: Christian Loehle <christian.loehle@....com>
To: "Ionut Nechita (Sunlight Linux)" <sunlightlinux@...il.com>,
rafael@...nel.org
Cc: ionut_n2001@...oo.com, daniel.lezcano@...aro.org,
linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/1] cpuidle: menu: Fix high wakeup latency on modern
Intel server platforms
On 1/20/26 21:17, Ionut Nechita (Sunlight Linux) wrote:
> From: Ionut Nechita <ionut_n2001@...oo.com>
>
> Hi,
Hi Ionut,
>
> This patch addresses a performance regression in the menu cpuidle governor
> affecting modern Intel server platforms (Sapphire Rapids, Granite Rapids,
> and newer).
I'll take a look at the patch later, but just to be clear, this isn't a
performance regression right? There's no kernel version that this behaved
better with, is there?
If there is it needs to be stated and maybe a Fixes tag would be applicable.
>
> == Problem Description ==
>
> On Intel server platforms from 2022 onwards, we observe excessive wakeup
> latencies (~150us) in network-sensitive workloads when using the menu
> governor with NOHZ_FULL enabled.
>
> Measurement with qperf tcp_lat shows:
> - Sapphire Rapids (SPR): 151us latency
> - Ice Lake (ICL): 12us latency
> - Skylake (SKL): 21us latency
>
> The 12x latency regression on SPR compared to Ice Lake is unacceptable for
> latency-sensitive applications (HPC, real-time, financial trading, etc.).
So just newer generation having higher latency.
TBF the examples you mentioned should really have their latencies in control
themselves and not rely on menu guesstimating what's needed here.
>
> == Root Cause ==
>
> The issue stems from menu.c:294-295:
>
> if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC)
> predicted_ns = data->next_timer_ns;
>
> When the tick is already stopped and the predicted idle duration is short
> (<2ms), the governor switches to using next_timer_ns directly (often
> 10ms+). This causes the selection of very deep package C-states (PC6).
>
> Modern server platforms have significantly longer C-state exit latencies
> due to architectural changes:
> - Tile-based architecture with per-tile power gating
> - DDR5 power management overhead
> - CXL link restoration
> - Complex mesh interconnect resynchronization
>
> When a network packet arrives after 500us but the governor selected PC6
> based on a 10ms timer, the 150us exit latency dominates the response time.
>
> On older platforms (Ice Lake, Skylake) with faster C-state transitions
> (12-21us), this issue was less noticeable, but SPR's tile architecture
> makes it critical.
> [snip]
Can you provide idle state tables with residencies and usage?
Ideally idle misses for both as well?
Thanks!
Powered by blists - more mailing lists