linux-kernel - Re: [PATCH 00/24] Complete EEVDF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4a3dac53-69e5-d3cd-8bc0-3549af4932b3@amd.com>
Date: Wed, 6 Nov 2024 11:49:00 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Saravana Kannan <saravanak@...gle.com>, Samuel Wu <wusamuel@...gle.com>,
	David Dai <davidai@...gle.com>, Peter Zijlstra <peterz@...radead.org>
CC: <mingo@...hat.com>, <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
	<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
	<mgorman@...e.de>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
	<wuyun.abel@...edance.com>, <youssefesmat@...omium.org>,
	<tglx@...utronix.de>, <efault@....de>, Android Kernel Team
	<kernel-team@...roid.com>, Qais Yousef <qyousef@...gle.com>, Vincent
 Palomares <paillon@...gle.com>, John Stultz <jstultz@...gle.com>, Mike
 Galbraith <efault@....de>, Luis Machado <luis.machado@....com>
Subject: Re: [PATCH 00/24] Complete EEVDF

(+ Mike, Luis)

Hello Saravana, Sam, David,

On 11/6/2024 6:37 AM, Saravana Kannan wrote:
> On Sat, Jul 27, 2024 at 3:27 AM Peter Zijlstra <peterz@...radead.org> wrote:
>>
>> Hi all,
>>
>> So after much delay this is hopefully the final version of the EEVDF patches.
>> They've been sitting in my git tree for ever it seems, and people have been
>> testing it and sending fixes.
>>
>> I've spend the last two days testing and fixing cfs-bandwidth, and as far
>> as I know that was the very last issue holding it back.
>>
>> These patches apply on top of queue.git sched/dl-server, which I plan on merging
>> in tip/sched/core once -rc1 drops.
>>
>> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>>
>>
>> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>>
>>   - split up the huge delay-dequeue patch
>>   - tested/fixed cfs-bandwidth
>>   - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>>   - SCHED_BATCH is equivalent to RESPECT_SLICE
>>   - propagate min_slice up cgroups
>>   - CLOCK_THREAD_DVFS_ID
>>
> 
> Hi Peter,
> 
> TL;DR:
> We run some basic sched/cpufreq behavior tests on a Pixel 6 for every
> change we accept. Some of these changes are merges from Linus's tree.
> We can see a very clear change in behavior with this patch series.
> Based on what we are seeing, we'd expect this change in behavior to
> cause pretty serious power regression (7-25%) depending on what the
> actual bug is and the use case.

Do the regressions persist with NO_DELAY_DEQUEUE? You can disable the
DELAY_DEQUEUE feature added in EEVDF Complete via debugfs by doing a:

     # echo NO_DELAY_DEQUEUE > /sys/kernel/debug/sched/features

Since delayed entities are still on the runqueue, they can affect PELT
calculation. Vincent and Dietmar have both noted this and Peter posted
https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
in response but it was pulled out since Luis reported observing -ve
values for h_nr_delayed on his setup. A lot has been fixed around
delayed dequeue since and I wonder if now would be the right time to
re-attempt h_nr_delayed tracking.

There is also the fact that delayed entities don't update the tg
loadavg since the delayed path calls update_load_avg() without
UPDATE_TG flag set which can again cause some changes in PELT
calculation since the averages are used to estimate the entity
shares when running with cgroups.

> 
> Intro:
> We run these tests 20 times for every build (a bunch of changes). All
> the data below is from the 20+ builds before this series and 20 builds
> after this series (inclusive). So, all the "before numbers" are from
> (20 x 20) 400+ runs and all the after numbers are from another 400+
> runs.
> 
> Test:
> We create a synthetic "tiny" thread that runs for 3ms and sleeps for
> 10ms at Fmin. We let it run like this for several seconds to make sure
> the util is low and all the "new thread" boost stuff isn't kicking in.
> So, at this point, the CPU frequency is at Fmin. Then we let this
> thread run continuously without sleeping and measure (using ftrace)
> the time it takes for the CPU to get to Fmax.
> 
> We do this separately (fresh run) on the Pixel 6 with the cpu affinity
> set to each cluster and once without any cpu affinity (thread starts
> at little).
> 
> Data:
> All the values below are in milliseconds.
> 
> When the thread is not affined to any CPU: So thread starts on little,
> ramps up to fmax, migrates to middle, ramps up to fmax, migrates to
> big, ramps up to fmax.
> +----------------------------------+
> | Data            | Before | After |
> |-----------------------+----------|
> | 5th percentile  | 169    | 151   |
> |-----------------------+----------|
> | Median          | 221    | 177   |
> |-----------------------+----------|
> | Mean            | 221    | 177   |
> |-----------------------+----------|
> | 95th percentile | 249    | 200   |
> +----------------------------------+
> 
> When thread is affined to the little cluster:
> The average time to reach Fmax is 104 ms without this series and 66 ms
> after this series. We didn't collect the individual per run data. We
> can if you really need it. We also noticed that the little cluster
> wouldn't go to Fmin (300 MHz) after this series even when the CPUs are
> mostly idle. It was instead hovering at 738 MHz (the Fmax is ~1800
> MHz).
> 
> When thread is affined to the middle cluster:
> +----------------------------------+
> | Data            | Before | After |
> |-----------------------+----------|
> | 5th percentile  | 99     | 84    |
> |-----------------------+----------|
> | Median          | 111    | 104   |
> |-----------------------+----------|
> | Mean            | 111    | 104   |
> |-----------------------+----------|
> | 95th percentile | 120    | 119   |
> +----------------------------------+
> 
> When thread is affined to the big cluster:
> +----------------------------------+
> | Data            | Before | After |
> |-----------------------+----------|
> | 5th percentile  | 138    | 96    |
> |-----------------------+----------|
> | Median          | 147    | 144   |
> |-----------------------+----------|
> | Mean            | 145    | 134   |
> |-----------------------+----------|
> | 95th percentile | 151    | 150   |
> +----------------------------------+
> 
> As you can see, the ramp up time has decreased noticeably. Also, as
> you can tell from the 5th percentile numbers, the standard deviation
> has also increased a lot too, causing a wider spread of the ramp up
> time (more noticeable in the middle and big clusters). So in general
> this looks like it's going to increase the usage of the middle and big
> CPU clusters and also going to shift the CPU frequency residency to
> frequencies that are 5 to 25% higher.
> 
> We already checked the rate_limit_us value and it is the same for both
> the before/after cases and it's set to 7.5 ms (jiffies is 4ms in our
> case). Also, based on my limited understanding the DELAYED_DEQUEUE
> stuff is only relevant if there are multiple contending threads in a
> CPU. In this case it's just 1 continuously running thread with a
> kworker that runs sporadically less than 1% of the time.

There is an ongoing investigation on delayed entities possibly not
migrating if they are woken up before they are fully dequeued. Since you
mention there is only one task, this should not matter but could you
also try out Mike's suggestion from
https://lore.kernel.org/lkml/1bffa5f2ca0fec8a00f84ffab86dc6e8408af31c.camel@gmx.de/
and see if it makes a difference on your test suite?

-- 
Thanks and Regards,
Prateek

> 
> So, without a deeper understanding of this patch series, it's behaving
> as if the PELT signal is accumulating faster than expected. Which is a
> bit surprising to me because AFAIK (which is not much) the EEVDF
> series isn't supposed to change the PELT behavior.
> 
> If you want to get a visual idea of what the system is doing, here are
> some perfetto links that visualize the traces. Hopefully you have
> access permissions to these. You can use the W, S, A, D keys to pan
> and zoom around the timeline.
> 
> Big Before:
> https://ui.perfetto.dev/#!/?s=01aa3ad3a5ddd78f2948c86db4265ce2249da8aa
> Big After:
> https://ui.perfetto.dev/#!/?s=7729ee012f238e96cfa026459eac3f8c3e88d9a9

P.S. I only gave a quick glance but I do see the frequency ramp up with
larger deltas and reach Fmax much quickly in case of "Big After"

> 
> Thanks,
> Saravana, Sam and David