linux-kernel - Re: [PATCH 00/24] Complete EEVDF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGETcx97SEHP5MspzBHsMkmSExnY870DQ-F5L077vzOGnPx0UA@mail.gmail.com>
Date: Tue, 5 Nov 2024 17:07:44 -0800
From: Saravana Kannan <saravanak@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org, 
	kprateek.nayak@....com, wuyun.abel@...edance.com, youssefesmat@...omium.org, 
	tglx@...utronix.de, efault@....de, 
	Android Kernel Team <kernel-team@...roid.com>, Qais Yousef <qyousef@...gle.com>, 
	Vincent Palomares <paillon@...gle.com>, Samuel Wu <wusamuel@...gle.com>, David Dai <davidai@...gle.com>, 
	John Stultz <jstultz@...gle.com>
Subject: Re: [PATCH 00/24] Complete EEVDF

On Sat, Jul 27, 2024 at 3:27 AM Peter Zijlstra <peterz@...radead.org> wrote:
>
> Hi all,
>
> So after much delay this is hopefully the final version of the EEVDF patches.
> They've been sitting in my git tree for ever it seems, and people have been
> testing it and sending fixes.
>
> I've spend the last two days testing and fixing cfs-bandwidth, and as far
> as I know that was the very last issue holding it back.
>
> These patches apply on top of queue.git sched/dl-server, which I plan on merging
> in tip/sched/core once -rc1 drops.
>
> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>
>
> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>
>  - split up the huge delay-dequeue patch
>  - tested/fixed cfs-bandwidth
>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>  - SCHED_BATCH is equivalent to RESPECT_SLICE
>  - propagate min_slice up cgroups
>  - CLOCK_THREAD_DVFS_ID
>

Hi Peter,

TL;DR:
We run some basic sched/cpufreq behavior tests on a Pixel 6 for every
change we accept. Some of these changes are merges from Linus's tree.
We can see a very clear change in behavior with this patch series.
Based on what we are seeing, we'd expect this change in behavior to
cause pretty serious power regression (7-25%) depending on what the
actual bug is and the use case.

Intro:
We run these tests 20 times for every build (a bunch of changes). All
the data below is from the 20+ builds before this series and 20 builds
after this series (inclusive). So, all the "before numbers" are from
(20 x 20) 400+ runs and all the after numbers are from another 400+
runs.

Test:
We create a synthetic "tiny" thread that runs for 3ms and sleeps for
10ms at Fmin. We let it run like this for several seconds to make sure
the util is low and all the "new thread" boost stuff isn't kicking in.
So, at this point, the CPU frequency is at Fmin. Then we let this
thread run continuously without sleeping and measure (using ftrace)
the time it takes for the CPU to get to Fmax.

We do this separately (fresh run) on the Pixel 6 with the cpu affinity
set to each cluster and once without any cpu affinity (thread starts
at little).

Data:
All the values below are in milliseconds.

When the thread is not affined to any CPU: So thread starts on little,
ramps up to fmax, migrates to middle, ramps up to fmax, migrates to
big, ramps up to fmax.
+----------------------------------+
| Data            | Before | After |
|-----------------------+----------|
| 5th percentile  | 169    | 151   |
|-----------------------+----------|
| Median          | 221    | 177   |
|-----------------------+----------|
| Mean            | 221    | 177   |
|-----------------------+----------|
| 95th percentile | 249    | 200   |
+----------------------------------+

When thread is affined to the little cluster:
The average time to reach Fmax is 104 ms without this series and 66 ms
after this series. We didn't collect the individual per run data. We
can if you really need it. We also noticed that the little cluster
wouldn't go to Fmin (300 MHz) after this series even when the CPUs are
mostly idle. It was instead hovering at 738 MHz (the Fmax is ~1800
MHz).

When thread is affined to the middle cluster:
+----------------------------------+
| Data            | Before | After |
|-----------------------+----------|
| 5th percentile  | 99     | 84    |
|-----------------------+----------|
| Median          | 111    | 104   |
|-----------------------+----------|
| Mean            | 111    | 104   |
|-----------------------+----------|
| 95th percentile | 120    | 119   |
+----------------------------------+

When thread is affined to the big cluster:
+----------------------------------+
| Data            | Before | After |
|-----------------------+----------|
| 5th percentile  | 138    | 96    |
|-----------------------+----------|
| Median          | 147    | 144   |
|-----------------------+----------|
| Mean            | 145    | 134   |
|-----------------------+----------|
| 95th percentile | 151    | 150   |
+----------------------------------+

As you can see, the ramp up time has decreased noticeably. Also, as
you can tell from the 5th percentile numbers, the standard deviation
has also increased a lot too, causing a wider spread of the ramp up
time (more noticeable in the middle and big clusters). So in general
this looks like it's going to increase the usage of the middle and big
CPU clusters and also going to shift the CPU frequency residency to
frequencies that are 5 to 25% higher.

We already checked the rate_limit_us value and it is the same for both
the before/after cases and it's set to 7.5 ms (jiffies is 4ms in our
case). Also, based on my limited understanding the DELAYED_DEQUEUE
stuff is only relevant if there are multiple contending threads in a
CPU. In this case it's just 1 continuously running thread with a
kworker that runs sporadically less than 1% of the time.

So, without a deeper understanding of this patch series, it's behaving
as if the PELT signal is accumulating faster than expected. Which is a
bit surprising to me because AFAIK (which is not much) the EEVDF
series isn't supposed to change the PELT behavior.

If you want to get a visual idea of what the system is doing, here are
some perfetto links that visualize the traces. Hopefully you have
access permissions to these. You can use the W, S, A, D keys to pan
and zoom around the timeline.

Big Before:
https://ui.perfetto.dev/#!/?s=01aa3ad3a5ddd78f2948c86db4265ce2249da8aa
Big After:
https://ui.perfetto.dev/#!/?s=7729ee012f238e96cfa026459eac3f8c3e88d9a9

Thanks,
Saravana, Sam and David