[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ2PR11MB76701E015E411063C35938D98DA62@SJ2PR11MB7670.namprd11.prod.outlook.com>
Date: Wed, 26 Mar 2025 15:14:09 +0000
From: "King, Colin" <colin.king@...el.com>
To: Christian Loehle <christian.loehle@....com>, Jens Axboe <axboe@...nel.dk>,
"Rafael J. Wysocki" <rafael@...nel.org>, Daniel Lezcano
<daniel.lezcano@...aro.org>, "linux-block@...r.kernel.org"
<linux-block@...r.kernel.org>, "linux-pm@...r.kernel.org"
<linux-pm@...r.kernel.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] cpuidle: psd: add power sleep demotion prevention for
fast I/O devices
Hi,
> -----Original Message-----
> From: Christian Loehle <christian.loehle@....com>
> Sent: 23 March 2025 09:19
> To: King, Colin <colin.king@...el.com>; Jens Axboe <axboe@...nel.dk>; Rafael
> J. Wysocki <rafael@...nel.org>; Daniel Lezcano <daniel.lezcano@...aro.org>;
> linux-block@...r.kernel.org; linux-pm@...r.kernel.org
> Cc: linux-kernel@...r.kernel.org
> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
> fast I/O devices
>
> On 3/17/25 10:03, King, Colin wrote:
> > Hi Christian,
> >
> > Follow-up below:
> >
> >> -----Original Message-----
> >> From: Christian Loehle <christian.loehle@....com>
> >> Sent: 03 March 2025 22:25
> >> To: King, Colin <colin.king@...el.com>; Jens Axboe <axboe@...nel.dk>;
> >> Rafael J. Wysocki <rafael@...nel.org>; Daniel Lezcano
> >> <daniel.lezcano@...aro.org>; linux-block@...r.kernel.org;
> >> linux-pm@...r.kernel.org
> >> Cc: linux-kernel@...r.kernel.org
> >> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion
> >> prevention for fast I/O devices
> >>
> >> On 3/3/25 16:43, Colin Ian King wrote:
> >>> Modern processors can drop into deep sleep states relatively quickly
> >>> to save power. However, coming out of deep sleep states takes a
> >>> small amount of time and this is detrimental to performance for I/O
> >>> devices such as fast PCIe NVME drives when servicing a completed I/O
> >>> transactions.
> >>>
> >>> Testing with fio with read/write RAID0 PCIe NVME devices on various
> >>> modern SMP based systems (such as 96 thead Granite Rapids Xeon
> >>> 6741P) has shown that on 85-90% of read/write transactions issued on
> >>> a CPU are completed by the same CPU, so it makes some sense to
> >>> prevent the CPU from dropping into a deep sleep state to help reduce
> >>> I/O handling latency.
> >>
> >> For the platform you tested on that may be true, but even if we
> >> constrain ourselves to pci-nvme there's a variety of queue/irq
> >> mappings where this doesn't hold I'm afraid.
> >
> > This code is optional, one can enable it or disable it via the config
> > option. Also, even when it is built-in one can disable it by writing 0 to the
> sysfs file
> > /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
> >
> >>
> >>>
> >>> This commit introduces a simple, lightweight and fast power sleep
> >>> demotion mechanism that provides the block layer a way to inform the
> >>> menu governor to prevent a CPU from going into a deep sleep when an
> >>> I/O operation is requested. While it is true that some I/Os may not
> >>
> >> s/requested/completed is the full truth, isn't it?
> >>
> >>> be serviced on the same CPU that issued the I/O request and hence is
> >>> not 100% perfect the mechanism does work well in the vast majority
> >>> of I/O operations and there is very small overhead with the sleep
> >>> demotion prevention.
> >>>
> >>> Test results on a 96 thread Xeon 6741P with a 6 way RAID0 PCIe NVME
> >>> md array using fio 3.35 performing random read and read-write test
> >>> on a 512GB file with 8 concurrent I/O jobs. Tested with the
> >>> NHM_C1_AUTO_DEMOTE bit set in MSR_PKG_CST_CONFIG_CONTROL set
> in
> >> the BIOS.
> >>>
> >>> Test case: random reads, results based on geometic mean of results
> >>> from
> >>> 5 test runs:
> >>> Bandwidth IO-ops Latency Bandwidth
> >>> read (bytes/sec) per sec (ns) % Std.Deviation
> >>> Baseline: 21365755610 20377 390105 1.86%
> >>> Patched: 25950107558 24748 322905 0.16%
> >>
> >> What is the baseline?
> >> Do you mind trying with Rafael's recently posted series?
> >> Given the IOPS I'd expect good results from that alone already.
> >> https://lore.kernel.org/lkml/1916668.tdWV9SEqCh@rjwysocki.net/
> >>
> >> (Happy to see teo as comparison too, which you don't modify).
> >
> > OK, I re-ran the tests on 6.14-rc5 on the same H/W. The "Baseline" is
> > 6.14-rc5 without Raphel's patch. I also testing the Baseline with C6
> > and C6P disabled as this stops deeper C-state sleeps and in theory
> > should provide "best performance". I also benchmarked with Raphel's patch
> and just my patch, and finally Raphels patch AND my patch:
> >
> > Reads
> > Bandwidth Bandwidth latency latency
> > Bytes/sec %Std.Dev. nanosecs %Std.Dev.
> > Baseline 25689182477 0.15 326177 0.15
> > C6, C6P disabled 25839580014 0.19 324349 0.19
> > Raphels Patch: 25695523747 0.06 326150 0.06
> > My patch: 25782011833 0.07 324999 0.07
> > Raphel + My patch: 25792551514 0.10 324924 0.10
>
> So these are mostly equal right?
Not really, in this case, we have an system under a lot of random read I/O. Disabling C6 increases
the read rate since we remove the latency from the sleep states. With my patch we approach the C6 disabled
rate. The results are an average of 5 runs, the metrics have fairly low jitter, so I believe the read
rates are reliable. We're getting a better read rate with my patch than the baseline and without the
power penalty of C6 being permanently disabled.
>
> >
> > Writes
> > Bandwidth Bandwidth latency latency
> > Bytes/sec %Std.Dev. nanosecs %Std.Dev.
> > Baseline 15220468898 3.33 550290 3.36
> > C6, C6P disabled 13405624707 0.66 625424 0.66
> > Raphels Patch: 14017625200 1.15 598049 1.16
> > My patch: 15444417488 3.73 467818 29.10
> > Raphel + My patch: 14037711032 1.13 597143 1.13
>
> These don't make sense to me, why would C6 / C6P be this bad?
I double checked, it's ..because I got the row labels mixed up when pasting the results
into the email. Doh! It should be:
Baseline 13405624707 0.66 625424 0.66
C6, C6P disabled 15220468898 3.33 550290 3.36
> Why would Rafael's patch make it worse?
> Are these just noise?
>
> >
> > Combined Read+Writes, Reads
> >
> > Bandwidth Bandwidth latency latency
> > Bytes/sec %Std.Dev. nanosecs %Std.Dev.
> > Baseline 10132229433 0.41 484919 0.25
> > C6, C6P disabled 10567536346 0.60 515260 1.16
> > Raphels Patch: 10171044817 0.37 486937 0.20
> > My patch: 10468953527 0.07 504797 0.07
> > Raphel + My patch: 10174707546 1.26 488263 1.13
> >
> > Combined Read+Writes, Writes
> >
> > Bandwidth Bandwidth latency latency
> > Bytes/sec %Std.Dev. nanosecs %Std.Dev.
> > Baseline 10139393169 0.44 342132 1.23
> > C6, C6P disabled 10583264662 0.63 277052 3.87
> > Raphels Patch: 10178275035 0.39 336989 0.94
> > My patch: 10482766569 1.28 294803 6.87
> > Raphel + My patch: 10183837235 0.38 330657 3.39
>
> The above two indicate a +3% from (now mainline) Rafael's patch to yours.
> There's no reason why Rafael + your patch should be worse than just your
> patch.
> If this is statistically significant we would need to look into why.
>
> FWIW I think the energy cost of essentially remaining in polling even for quite
> sporadic IO can't be understated IMO.
> Comparison for teo, which might slightly better than menu while not outright
> disabling idle would be interesting still.
>
>
Powered by blists - more mailing lists