linux-kernel - Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1461277e-af68-41e7-947c-9178b55810b1@arm.com>
Date: Wed, 24 Apr 2024 16:15:42 +0100
From: Luis Machado <luis.machado@....com>
To: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
 juri.lelli@...hat.com, vincent.guittot@...aro.org, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 bristot@...hat.com, vschneid@...hat.com, linux-kernel@...r.kernel.org
Cc: kprateek.nayak@....com, wuyun.abel@...edance.com, tglx@...utronix.de,
 efault@....de, nd <nd@....com>, John Stultz <jstultz@...gle.com>,
 Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

Hi,

On 4/15/24 18:07, Luis Machado wrote:
> Hi Peter,
> 
> On 4/5/24 11:28, Peter Zijlstra wrote:
>> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
>> noting that lag is fundamentally a temporal measure. It should not be
>> carried around indefinitely.
>>
>> OTOH it should also not be instantly discarded, doing so will allow a
>> task to game the system by purposefully (micro) sleeping at the end of
>> its time quantum.
>>
>> Since lag is intimately tied to the virtual time base, a wall-time
>> based decay is also insufficient, notably competition is required for
>> any of this to make sense.
>>
>> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
>> competing until they are eligible.
>>
>> Strictly speaking, we only care about keeping them until the 0-lag
>> point, but that is a difficult proposition, instead carry them around
>> until they get picked again, and dequeue them at that point.
>>
>> Since we should have dequeued them at the 0-lag point, truncate lag
>> (eg. don't let them earn positive lag).
>>
>> XXX test the cfs-throttle stuff
>>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> 
> Playing around with a Pixel 6 running a 6.6-based kernel with this
> series backported on top, I spotted a noticeable performance improvement
> on the speedometer benchmark:
> 
> - m6.6-stock-* is the 6.6 mainline Android kernel unmodified.
> 
> - m6.6-eevdf-complete-* is the 6.6 mainline Android kernel with
> this series applied on top (along with a few required backported
> patches).
> 
> +-------------------+-----------------------+-----------+
> |      metric       |          tag          | perc_diff |
> +-------------------+-----------------------+-----------+
> | Speedometer Score |     m6.6-stock-1      |   0.0%    |
> | Speedometer Score |     m6.6-stock-2      |   1.23%   |
> | Speedometer Score |     m6.6-stock-3      |  -0.22%   |
> | Speedometer Score | m6.6-eevdf-complete-1 |   4.54%   |
> | Speedometer Score | m6.6-eevdf-complete-2 |   4.95%   |
> | Speedometer Score | m6.6-eevdf-complete-3 |   6.07%   |
> +-------------------+-----------------------+-----------+
> 
> Also some interesting improvements in terms of frame timing for the
> uibenchjanktests benchmark. In particular the metrics of missed
> deadlines and jank (late) frames, which seems to indicate better
> latencies.
> 
> +-----------------------+-----------------------+-----------+
> |        metric         |          tag          | perc_diff |
> +-----------------------+-----------------------+-----------+
> | gfx-avg-frame-time-50 |     m6.6-stock-1      |    0.0    |
> | gfx-avg-frame-time-90 |     m6.6-stock-1      |    0.0    |
> | gfx-avg-frame-time-95 |     m6.6-stock-1      |    0.0    |
> | gfx-avg-frame-time-99 |     m6.6-stock-1      |    0.0    |
> | gfx-avg-frame-time-50 |     m6.6-stock-2      |   3.46    |
> | gfx-avg-frame-time-90 |     m6.6-stock-2      |   1.19    |
> | gfx-avg-frame-time-95 |     m6.6-stock-2      |   0.24    |
> | gfx-avg-frame-time-99 |     m6.6-stock-2      |   0.48    |
> | gfx-avg-frame-time-50 | m6.6-eevdf-complete-1 |  -30.45   |
> | gfx-avg-frame-time-90 | m6.6-eevdf-complete-1 |  -48.44   |
> | gfx-avg-frame-time-95 | m6.6-eevdf-complete-1 |  -51.32   |
> | gfx-avg-frame-time-99 | m6.6-eevdf-complete-1 |  -52.48   |
> | gfx-avg-frame-time-50 | m6.6-eevdf-complete-2 |  -30.32   |
> | gfx-avg-frame-time-90 | m6.6-eevdf-complete-2 |  -48.16   |
> | gfx-avg-frame-time-95 | m6.6-eevdf-complete-2 |  -51.08   |
> | gfx-avg-frame-time-99 | m6.6-eevdf-complete-2 |   -51.7   |
> +-----------------------+-----------------------+-----------+
> 
> +-----------------------------------+-----------------------+-----------+
> |              metric               |          tag          | perc_diff |
> +-----------------------------------+-----------------------+-----------+
> | gfx-avg-num-frame-deadline-missed |     m6.6-stock-1      |    0.0    |
> | gfx-max-num-frame-deadline-missed |     m6.6-stock-1      |    0.0    |
> | gfx-avg-num-frame-deadline-missed |     m6.6-stock-2      |   -3.21   |
> | gfx-max-num-frame-deadline-missed |     m6.6-stock-2      |   -3.21   |
> | gfx-avg-num-frame-deadline-missed | m6.6-eevdf-complete-1 |  -85.29   |
> | gfx-max-num-frame-deadline-missed | m6.6-eevdf-complete-1 |  -85.29   |
> | gfx-avg-num-frame-deadline-missed | m6.6-eevdf-complete-2 |   -84.8   |
> | gfx-max-num-frame-deadline-missed | m6.6-eevdf-complete-2 |   -84.8   |
> +-----------------------------------+-----------------------+-----------+
> 
> +----------------------------+-----------------------+-----------+
> |           metric           |          tag          | perc_diff |
> +----------------------------+-----------------------+-----------+
> | gfx-avg-high-input-latency |     m6.6-stock-1      |    0.0    |
> | gfx-max-high-input-latency |     m6.6-stock-1      |    0.0    |
> | gfx-avg-high-input-latency |     m6.6-stock-2      |   0.93    |
> | gfx-max-high-input-latency |     m6.6-stock-2      |   0.93    |
> | gfx-avg-high-input-latency | m6.6-eevdf-complete-1 |  -18.35   |
> | gfx-max-high-input-latency | m6.6-eevdf-complete-1 |  -18.35   |
> | gfx-avg-high-input-latency | m6.6-eevdf-complete-2 |  -18.05   |
> | gfx-max-high-input-latency | m6.6-eevdf-complete-2 |  -18.05   |
> +----------------------------+-----------------------+-----------+
> 
> +--------------+-----------------------+-----------+
> |    metric    |          tag          | perc_diff |
> +--------------+-----------------------+-----------+
> | gfx-avg-jank |     m6.6-stock-1      |    0.0    |
> | gfx-max-jank |     m6.6-stock-1      |    0.0    |
> | gfx-avg-jank |     m6.6-stock-2      |   1.56    |
> | gfx-max-jank |     m6.6-stock-2      |   1.56    |
> | gfx-avg-jank | m6.6-eevdf-complete-1 |  -82.81   |
> | gfx-max-jank | m6.6-eevdf-complete-1 |  -82.81   |
> | gfx-avg-jank | m6.6-eevdf-complete-2 |  -78.12   |
> | gfx-max-jank | m6.6-eevdf-complete-2 |  -78.12   |
> +--------------+-----------------------+-----------+
> 
> Bisecting through the patches in this series, I ended up with patch 08/10
> as the one that improved things overall for these benchmarks.
> 
> I'd like to investigate this further to understand the reason behind some of
> these dramatic improvements.
> 

Investigating these improvements a bit more, I noticed they came with a significantly
higher power usage on the Pixel6 (where EAS is enabled). I bisected it down to the delayed
dequeue patch. Disabling DELAY_DEQUEUE and DELAY_ZERO at runtime doesn't help in bringing
the power usage down.

Though I don't fully understand the reason behind this change in behavior yet, I did spot
the benchmark processes running almost entirely on the big core cluster, with little
to no use of the little core and mid core clusters.

That would explain higher power usage and also the significant jump in performance.

I wonder if the delayed dequeue logic is having an unwanted effect on the calculation of
utilization/load of the runqueue and, as a consequence, we're scheduling things to run on
higher OPP's in the big cores, leading to poor decisions for energy efficiency.