linux-kernel - Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1e870bd8-0e13-4e0d-bbe9-e9f601f59a50@arm.com>
Date: Sun, 23 Mar 2025 09:18:30 +0000
From: Christian Loehle <christian.loehle@....com>
To: "King, Colin" <colin.king@...el.com>, Jens Axboe <axboe@...nel.dk>,
 "Rafael J. Wysocki" <rafael@...nel.org>,
 Daniel Lezcano <daniel.lezcano@...aro.org>,
 "linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
 "linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
 fast I/O devices

On 3/17/25 10:03, King, Colin wrote:
> Hi Christian, 
> 
> Follow-up below:
> 
>> -----Original Message-----
>> From: Christian Loehle <christian.loehle@....com>
>> Sent: 03 March 2025 22:25
>> To: King, Colin <colin.king@...el.com>; Jens Axboe <axboe@...nel.dk>; Rafael
>> J. Wysocki <rafael@...nel.org>; Daniel Lezcano <daniel.lezcano@...aro.org>;
>> linux-block@...r.kernel.org; linux-pm@...r.kernel.org
>> Cc: linux-kernel@...r.kernel.org
>> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
>> fast I/O devices
>>
>> On 3/3/25 16:43, Colin Ian King wrote:
>>> Modern processors can drop into deep sleep states relatively quickly
>>> to save power. However, coming out of deep sleep states takes a small
>>> amount of time and this is detrimental to performance for I/O devices
>>> such as fast PCIe NVME drives when servicing a completed I/O
>>> transactions.
>>>
>>> Testing with fio with read/write RAID0 PCIe NVME devices on various
>>> modern SMP based systems (such as 96 thead Granite Rapids Xeon 6741P)
>>> has shown that on 85-90% of read/write transactions issued on a CPU
>>> are completed by the same CPU, so it makes some sense to prevent the
>>> CPU from dropping into a deep sleep state to help reduce I/O handling
>>> latency.
>>
>> For the platform you tested on that may be true, but even if we constrain
>> ourselves to pci-nvme there's a variety of queue/irq mappings where this
>> doesn't hold I'm afraid.
> 
> This code is optional, one can enable it or disable it via the config option. Also, 
> even when it is built-in one can disable it by writing 0 to the sysfs file
>   /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
> 
>>
>>>
>>> This commit introduces a simple, lightweight and fast power sleep
>>> demotion mechanism that provides the block layer a way to inform the
>>> menu governor to prevent a CPU from going into a deep sleep when an
>>> I/O operation is requested. While it is true that some I/Os may not
>>
>> s/requested/completed is the full truth, isn't it?
>>
>>> be serviced on the same CPU that issued the I/O request and hence is
>>> not 100% perfect the mechanism does work well in the vast majority of
>>> I/O operations and there is very small overhead with the sleep
>>> demotion prevention.
>>>
>>> Test results on a 96 thread Xeon 6741P with a 6 way RAID0 PCIe NVME md
>>> array using fio 3.35 performing random read and read-write test on a
>>> 512GB file with 8 concurrent I/O jobs. Tested with the
>>> NHM_C1_AUTO_DEMOTE bit set in MSR_PKG_CST_CONFIG_CONTROL set in
>> the BIOS.
>>>
>>> Test case: random reads, results based on geometic mean of results
>>> from
>>> 5 test runs:
>>>            Bandwidth         IO-ops   Latency   Bandwidth
>>>            read (bytes/sec)  per sec    (ns)    % Std.Deviation
>>> Baseline:  21365755610	     20377     390105   1.86%
>>> Patched:   25950107558       24748     322905   0.16%
>>
>> What is the baseline?
>> Do you mind trying with Rafael's recently posted series?
>> Given the IOPS I'd expect good results from that alone already.
>> https://lore.kernel.org/lkml/1916668.tdWV9SEqCh@rjwysocki.net/
>>
>> (Happy to see teo as comparison too, which you don't modify).
> 
> OK, I re-ran the tests on 6.14-rc5 on the same H/W. The "Baseline" is 6.14-rc5 without
> Raphel's patch. I also testing the Baseline with C6 and C6P disabled as this stops deeper
> C-state sleeps and in theory should provide "best performance". I also benchmarked with
> Raphel's patch and just my patch, and finally Raphels patch AND my patch:
> 
> Reads
>                         Bandwidth       Bandwidth       latency         latency
>                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> Baseline                25689182477     0.15            326177          0.15
> C6, C6P disabled        25839580014     0.19            324349          0.19
> Raphels Patch:          25695523747     0.06            326150          0.06
> My patch:               25782011833     0.07            324999          0.07
> Raphel + My patch:      25792551514     0.10            324924          0.10

So these are mostly equal right?

> 
> Writes
>                         Bandwidth       Bandwidth       latency         latency
>                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> Baseline                15220468898     3.33            550290          3.36
> C6, C6P disabled        13405624707     0.66            625424          0.66
> Raphels Patch:          14017625200     1.15            598049          1.16
> My patch:               15444417488     3.73            467818          29.10
> Raphel + My patch:      14037711032     1.13            597143          1.13

These don't make sense to me, why would C6 / C6P be this bad?
Why would Rafael's patch make it worse?
Are these just noise?

> 
> Combined Read+Writes, Reads
> 
>                         Bandwidth       Bandwidth       latency         latency
>                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> Baseline                10132229433     0.41            484919          0.25
> C6, C6P disabled        10567536346     0.60            515260          1.16
> Raphels Patch:          10171044817     0.37            486937          0.20
> My patch:               10468953527     0.07            504797          0.07
> Raphel + My patch:      10174707546     1.26            488263          1.13
> 
> Combined Read+Writes, Writes
> 
>                         Bandwidth       Bandwidth       latency         latency
>                         Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
> Baseline                10139393169     0.44            342132          1.23
> C6, C6P disabled        10583264662     0.63            277052          3.87
> Raphels Patch:          10178275035     0.39            336989          0.94
> My patch:               10482766569     1.28            294803          6.87
> Raphel + My patch:      10183837235     0.38            330657          3.39      

The above two indicate a +3% from (now mainline) Rafael's patch to yours.
There's no reason why Rafael + your patch should be worse than just your patch.
If this is statistically significant we would need to look into why.

FWIW I think the energy cost of essentially remaining in polling even for quite
sporadic IO can't be understated IMO.
Comparison for teo, which might slightly better than menu while not outright
disabling idle would be interesting still.