linux-kernel - RE: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ2PR11MB7670F63C7052C88637D305DF8DDF2@SJ2PR11MB7670.namprd11.prod.outlook.com>
Date: Mon, 17 Mar 2025 10:03:51 +0000
From: "King, Colin" <colin.king@...el.com>
To: Christian Loehle <christian.loehle@....com>, Jens Axboe <axboe@...nel.dk>,
	"Rafael J. Wysocki" <rafael@...nel.org>, Daniel Lezcano
	<daniel.lezcano@...aro.org>, "linux-block@...r.kernel.org"
	<linux-block@...r.kernel.org>, "linux-pm@...r.kernel.org"
	<linux-pm@...r.kernel.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] cpuidle: psd: add power sleep demotion prevention for
 fast I/O devices

Hi Christian, 

Follow-up below:

> -----Original Message-----
> From: Christian Loehle <christian.loehle@....com>
> Sent: 03 March 2025 22:25
> To: King, Colin <colin.king@...el.com>; Jens Axboe <axboe@...nel.dk>; Rafael
> J. Wysocki <rafael@...nel.org>; Daniel Lezcano <daniel.lezcano@...aro.org>;
> linux-block@...r.kernel.org; linux-pm@...r.kernel.org
> Cc: linux-kernel@...r.kernel.org
> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
> fast I/O devices
> 
> On 3/3/25 16:43, Colin Ian King wrote:
> > Modern processors can drop into deep sleep states relatively quickly
> > to save power. However, coming out of deep sleep states takes a small
> > amount of time and this is detrimental to performance for I/O devices
> > such as fast PCIe NVME drives when servicing a completed I/O
> > transactions.
> >
> > Testing with fio with read/write RAID0 PCIe NVME devices on various
> > modern SMP based systems (such as 96 thead Granite Rapids Xeon 6741P)
> > has shown that on 85-90% of read/write transactions issued on a CPU
> > are completed by the same CPU, so it makes some sense to prevent the
> > CPU from dropping into a deep sleep state to help reduce I/O handling
> > latency.
> 
> For the platform you tested on that may be true, but even if we constrain
> ourselves to pci-nvme there's a variety of queue/irq mappings where this
> doesn't hold I'm afraid.

This code is optional, one can enable it or disable it via the config option. Also, 
even when it is built-in one can disable it by writing 0 to the sysfs file
  /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms

> 
> >
> > This commit introduces a simple, lightweight and fast power sleep
> > demotion mechanism that provides the block layer a way to inform the
> > menu governor to prevent a CPU from going into a deep sleep when an
> > I/O operation is requested. While it is true that some I/Os may not
> 
> s/requested/completed is the full truth, isn't it?
> 
> > be serviced on the same CPU that issued the I/O request and hence is
> > not 100% perfect the mechanism does work well in the vast majority of
> > I/O operations and there is very small overhead with the sleep
> > demotion prevention.
> >
> > Test results on a 96 thread Xeon 6741P with a 6 way RAID0 PCIe NVME md
> > array using fio 3.35 performing random read and read-write test on a
> > 512GB file with 8 concurrent I/O jobs. Tested with the
> > NHM_C1_AUTO_DEMOTE bit set in MSR_PKG_CST_CONFIG_CONTROL set in
> the BIOS.
> >
> > Test case: random reads, results based on geometic mean of results
> > from
> > 5 test runs:
> >            Bandwidth         IO-ops   Latency   Bandwidth
> >            read (bytes/sec)  per sec    (ns)    % Std.Deviation
> > Baseline:  21365755610	     20377     390105   1.86%
> > Patched:   25950107558       24748     322905   0.16%
> 
> What is the baseline?
> Do you mind trying with Rafael's recently posted series?
> Given the IOPS I'd expect good results from that alone already.
> https://lore.kernel.org/lkml/1916668.tdWV9SEqCh@rjwysocki.net/
> 
> (Happy to see teo as comparison too, which you don't modify).

OK, I re-ran the tests on 6.14-rc5 on the same H/W. The "Baseline" is 6.14-rc5 without
Raphel's patch. I also testing the Baseline with C6 and C6P disabled as this stops deeper
C-state sleeps and in theory should provide "best performance". I also benchmarked with
Raphel's patch and just my patch, and finally Raphels patch AND my patch:

Reads
                        Bandwidth       Bandwidth       latency         latency
                        Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
Baseline                25689182477     0.15            326177          0.15
C6, C6P disabled        25839580014     0.19            324349          0.19
Raphels Patch:          25695523747     0.06            326150          0.06
My patch:               25782011833     0.07            324999          0.07
Raphel + My patch:      25792551514     0.10            324924          0.10

Writes
                        Bandwidth       Bandwidth       latency         latency
                        Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
Baseline                15220468898     3.33            550290          3.36
C6, C6P disabled        13405624707     0.66            625424          0.66
Raphels Patch:          14017625200     1.15            598049          1.16
My patch:               15444417488     3.73            467818          29.10
Raphel + My patch:      14037711032     1.13            597143          1.13

Combined Read+Writes, Reads

                        Bandwidth       Bandwidth       latency         latency
                        Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
Baseline                10132229433     0.41            484919          0.25
C6, C6P disabled        10567536346     0.60            515260          1.16
Raphels Patch:          10171044817     0.37            486937          0.20
My patch:               10468953527     0.07            504797          0.07
Raphel + My patch:      10174707546     1.26            488263          1.13

Combined Read+Writes, Writes

                        Bandwidth       Bandwidth       latency         latency
                        Bytes/sec       %Std.Dev.       nanosecs        %Std.Dev.
Baseline                10139393169     0.44            342132          1.23
C6, C6P disabled        10583264662     0.63            277052          3.87
Raphels Patch:          10178275035     0.39            336989          0.94
My patch:               10482766569     1.28            294803          6.87
Raphel + My patch:      10183837235     0.38            330657          3.39      
> 
> >
> > Read rate improvement of ~21%.
> >
> > Test case: random read+writes, results based on geometic mean of
> > results from 5 test runs:
> >
> >            Bandwidth         IO-ops   Latency   Bandwidth
> >            read (bytes/sec)  per sec    (ns)    % Std.Deviation
> > Baseline:   9937848224        9477     550094   1.04%
> > Patched:   10502592508       10016     509315   1.85%
> >
> > Read rate improvement of ~5.7%
> >
> >            Bandwidth         IO-ops   Latency   Bandwidth
> >            write (bytes/sec) per sec    (ns)    % Std.Deviation
> > Baseline:   9945197656        9484     288933   1.02%
> > Patched:   10517268400       10030     287026   1.85%
> >
> > Write rate improvement of ~5.7%
> >
> > For kernel builds, where all CPUs are fully loaded no perfomance
> > improvement or regressions were observed based on the results of
> > 5 kernel build test runs.
> >
> > By default, CPU power sleep demotion blocking is set to run for 3 ms
> > on I/O requests, but this can be modified using the new sysfs
> > interface:
> >
> >   /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
> 
> rounding up a jiffie sure is a heavy price to pay then.

Jiffies were used because the code is designed to be as light-weight as possible. Rounding it up isn't
problematic, using higher precision timing is more CPU expensive and really doesn't gain us any extra
win for this kind of time resolution.  We need the code to be light weight and "good-enough" on the I/O
path rather than overly expensive and "100% perfect".

> 
> >
> > setting this to zero will disabled the mechanism.
> >
> > Signed-off-by: Colin Ian King <colin.king@...el.com>
> > ---
> >  block/blk-mq.c                   |   2 +
> >  drivers/cpuidle/Kconfig          |  10 +++
> >  drivers/cpuidle/Makefile         |   1 +
> >  drivers/cpuidle/governors/menu.c |   4 +
> >  drivers/cpuidle/psd.c            | 123 +++++++++++++++++++++++++++++++
> >  include/linux/cpuidle_psd.h      |  32 ++++++++
> >  6 files changed, 172 insertions(+)
> >  create mode 100644 drivers/cpuidle/psd.c  create mode 100644
> > include/linux/cpuidle_psd.h
> >