lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230125160935.7qms26b4yx3l2rbg@airbuntu>
Date:   Wed, 25 Jan 2023 16:09:35 +0000
From:   Qais Yousef <qyousef@...alina.io>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Kajetan Puchalski <kajetan.puchalski@....com>, mingo@...nel.org,
        peterz@...radead.org, dietmar.eggemann@....com, rafael@...nel.org,
        viresh.kumar@...aro.org, vschneid@...hat.com,
        linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org,
        lukasz.luba@....com, wvw@...gle.com, xuewen.yan94@...il.com,
        han.lin@...iatek.com, Jonathan.JMChen@...iatek.com
Subject: Re: [PATCH v4] sched/fair: unlink misfit task from cpu overutilized

On 01/25/23 09:37, Vincent Guittot wrote:
> On Wed, 25 Jan 2023 at 02:46, Kajetan Puchalski
> <kajetan.puchalski@....com> wrote:
> >
> > Hi,
> >
> > > By taking into account uclamp_min, the 1:1 relation between task misfit
> > > and cpu overutilized is no more true as a task with a small util_avg may
> > > not fit a high capacity cpu because of uclamp_min constraint.
> > >
> > > Add a new state in util_fits_cpu() to reflect the case that task would fit
> > > a CPU except for the uclamp_min hint which is a performance requirement.
> > >
> > > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> > > can use this new value to take additional action to select the best CPU
> > > that doesn't match uclamp_min hint.
> > >
> >
> > Just wanted to include some more test data here to flag potential issues
> > with how all these changes use thermal pressure in the scheduler.
> >
> > For the tables below, 'baseline' is pre the already merged "uclamp fits
> > capacity" patchset.
> > 'baseline_ufc' is the current mainline with said patchset applied.
> > The 'no_thermal' variants are variants with thermal handling taken out
> > of util_fits_cpu akin to the v1 variant of the patchset.
> > The 'patched' variants are the above but with the v4 of the 'unlink misfit
> > task' patch applied as well.
> >
> > Geekbench 5:
> >
> > +-----------------+-------------------------+--------+-----------+
> > |     metric      |         kernel          | value  | perc_diff |
> > +-----------------+-------------------------+--------+-----------+
> > | multicore_score |        baseline         | 2765.4 |   0.0%    |
> > | multicore_score |      baseline_ufc       | 2704.3 |  -2.21%   | <-- mainline score regression
> > | multicore_score | baseline_ufc_no_thermal | 2870.8 |   3.81%   | <-- ~170 pts better without thermal
> > | multicore_score |     ufc_patched_v4      | 2839.8 |   2.69%   | <-- no more regression but worse than above
> > | multicore_score | ufc_patched_no_thermal  | 2891.0 |   4.54%   | <-- best score
> > +-----------------+-------------------------+--------+-----------+
> >
> > +--------------+--------+-------------------------+--------+-----------+
> > |  chan_name   | metric |         kernel          | value  | perc_diff |
> > +--------------+--------+-------------------------+--------+-----------+
> > | total_power  | gmean  |        baseline         | 2664.0 |   0.0%    |
> > | total_power  | gmean  |      baseline_ufc       | 2621.5 |   -1.6%   |
> > | total_power  | gmean  | baseline_ufc_no_thermal | 2710.0 |   1.73%   |
> > | total_power  | gmean  |     ufc_patched_v4      | 2729.0 |   2.44%   |
> > | total_power  | gmean  | ufc_patched_no_thermal  | 2778.1 |   4.28%   |
> > +--------------+--------+-------------------------+--------+-----------+
> >
> > (Jankbench scores show more or less no change, Jankbench power below)
> >
> > +--------------+--------+------------------------------+-------+-----------+
> > |  chan_name   | metric |            kernel            | value | perc_diff |
> > +--------------+--------+------------------------------+-------+-----------+
> > | total_power  | gmean  |        baseline_60hz         | 135.9 |   0.0%    |
> > | total_power  | gmean  |      baseline_ufc_60hz       | 155.7 |  14.61%   | <-- mainline power usage regression
> > | total_power  | gmean  | baseline_ufc_no_thermal_60hz | 134.5 |  -1.01%   | <-- no more regression without the thermal bit
> > | total_power  | gmean  |     ufc_patched_v4_60hz      | 131.4 |  -3.26%   | <-- no more regression with the patch either
> > | total_power  | gmean  | ufc_patched_no_thermal_60hz  | 140.4 |   3.37%   | <-- both combined increase power usage
> > +--------------+--------+------------------------------+-------+-----------+
> >
> > Specifically the swing of +15% power to -1% power by taking out thermal
> > handling from util_fits_cpu on the original patchset is noteworthy. It
> > shows that there's some subtle thermal-related interaction there taking
> > place that can have adverse effects on power usage.
> >
> > Even more interestingly, the 'unlink misfit task' patch appears to be
> > preventing said interaction from happening down the line as it has a
> > similar effect to that of just taking out the thermal bits.
> >
> > My only concern here is that without taking a closer look at the thermal
> > parts this power issue shown above could easily accidentally be
> > reintroduced at some point in the future.
> 
> Yes, the handling of thermal pressure needs some closer look. It has
> been agreed to keep the current behavior for now and have a closer
> look on thermal pressure as a next step. And your results for
> ufc_patched_v4_* provide some reasonably good perf and power figures.

Note that the GB perf improvements can be pure accidental because now some UI
tasks that are boosted end up spending more time on medium and big cores and
increasing the higher frequency residency there.

For example looking at my old result against v1 of my patches I can see for the
big cores the top frequency residency moves from 11.74% (mainline-sh 5.10
kernel) to 18.98%. I see tasks with uclamp_min value of 512 in the system.
On medium it improves by ~2% (30% to 32%).

Note that userspace now can dynamically set uclamp_min values and we don't
understand how this interaction goes.

I luckily had some old data of the results I collected when I ran v1 to
hopefully help demonstrate how uclamp_min behaves:

	Kernel B here is: Stock android12-5.10 without out of tree sched changes.
	Kernel C: Kernel B + v1 of my patches.

(Kernel A was the stock kernel in my experiments if you're wondering)

The first two tables show uclamp_min values per cpu. Note how with my fixes now
512 values are more dominant on cpus 4-7 (75% is 512).

The second set of tables shows the non zero uclamp_min values per task. Note
how on Kernel B the max is 398 for many tasks. On Pixel 6 the capacities are:

	L: 160
	M: 498
	B: 1024

498 * 0.8 = 398. So they get capped to 80% which is the migration margin
I resolved the relationship with.

Also note how AyncTask get boosted to 512 on Kernel C while they don't seem to
be boosted on Kernel B. Not sure why. I assume something with the cgroup
settings changes for some reason. Userspace configuration is not tuned to this
behavior anyway so I'm not sure what we see is optimal. So I'd take results
generally with a grain of salt. Note the 512 boost will result in these tasks
preferring big cores more to honour uclamp_min.

Also note how RenderThread and hwuiTask have big standard deviation as they are
dynamically adjusted. RenderThread 50% goes from 442 to 496; which is near the
top boundary of the medium core. Last set of tables show the count for how many
each value was seen for RenderThread for each kernel.

Overall this demonstrates there's a complex set of interactions going on. And
generally geekbench is not a workload that I'd expect to benefit from uclamp
and saying that we improved/regressed can be misleading. I can't say for sure,
but it seems to benefit from few happy accidents. We need to look at this in
the light of how uclamp_min is being used by userspace.

FWIW I'll be backporting these patches to GKI and will run another detailed
analysis. For now I think what we have makes sense and is a good improvement.

(I'm still testing v4 with my unit test - need a bit more time as I'm trying to
improve my unit test too)

--->8---


Kernel B uclamp CFS
       count      mean        std  min  25%  50%  75%    max
cpu
0    87542.0  0.172397   7.836326  0.0  0.0  0.0  0.0  512.0
1    87829.0  0.383404  11.910648  0.0  0.0  0.0  0.0  512.0
2    81713.0  0.371801  11.621681  0.0  0.0  0.0  0.0  512.0
3    85040.0  0.334113  11.160260  0.0  0.0  0.0  0.0  512.0
4    93270.0  1.032915  20.262032  0.0  0.0  0.0  0.0  512.0
5    93820.0  1.057003  20.036322  0.0  0.0  0.0  0.0  512.0
6    71159.0  4.901868  46.765630  0.0  0.0  0.0  0.0  512.0
7    78797.0  4.414445  43.265857  0.0  0.0  0.0  0.0  512.0

Kernel C uclamp CFS
       count        mean         std  min  25%    50%    75%    max
cpu
0    78514.0   55.845161  159.535262  0.0  0.0    0.0    0.0  512.0
1    78231.0   73.549296  179.461818  0.0  0.0    0.0    0.0  512.0
2    73310.0   80.742873  186.373848  0.0  0.0    0.0    0.0  512.0
3    69998.0   86.269151  191.512815  0.0  0.0    0.0    0.0  512.0
4    86887.0  169.776802  240.719169  0.0  0.0    0.0  512.0  512.0
5    87763.0  176.961875  243.162097  0.0  0.0    0.0  512.0  512.0
6    83600.0  257.773505  255.694248  0.0  0.0  506.0  512.0  512.0
7    74770.0  217.606380  252.568571  0.0  0.0    0.0  512.0  512.0

Kernel B uclamp SE
                  count        mean         std    min    25%    50%    75%    max
comm
AudioThread         8.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
Chrome_ChildIOT    35.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
Chrome_DevTools     2.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
Chrome_IOThread   132.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
Chrome_InProcGp    74.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
Chrome_ProcessL    22.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
CleanupReferenc     3.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
CookieMonsterBa     2.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
CookieMonsterCl     4.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
CrAsyncTask #1      3.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
MemoryInfra         5.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
NetworkService     20.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
RenderThread      773.0  350.175938  181.779995    6.0  225.0  448.0  512.0  512.0
SharedPreferenc     1.0  398.000000         NaN  398.0  398.0  398.0  398.0  398.0
Thread-2            1.0  398.000000         NaN  398.0  398.0  398.0  398.0  398.0
Thread-3            5.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
ThreadPoolForeg   156.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
ThreadPoolServi    47.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
ThreadPoolSingl    13.0  415.538462   42.810854  398.0  398.0  398.0  398.0  512.0
VizWebView        128.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
droid.launcher3    10.0  512.000000    0.000000  512.0  512.0  512.0  512.0  512.0
getprop            21.0  512.000000    0.000000  512.0  512.0  512.0  512.0  512.0
hwuiTask0         219.0  310.917808  190.832514    6.0   90.0  307.0  512.0  512.0
hwuiTask1         230.0  331.786957  186.994297    6.0  225.0  307.0  512.0  512.0
labs.geekbench5  1037.0  344.688525  180.838188    6.0  175.0  398.0  512.0  512.0
ll.splashscreen    99.0  457.060606  106.420692  253.0  512.0  512.0  512.0  512.0
mali-cmar-backe    17.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
mali-mem-purge     40.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
mali-utility-wo    18.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
ndroid.systemui    88.0  197.022727  197.388407   31.0   52.0   52.0  512.0  512.0
queued-work-loo     5.0  398.000000    0.000000  398.0  398.0  398.0  398.0  398.0
sh                  7.0  512.000000    0.000000  512.0  512.0  512.0  512.0  512.0

Kernel C uclamp SE
                    count        mean         std    min     25%    50%    75%    max
comm
AsyncTask #1     128468.0  512.000000    0.000000  512.0  512.00  512.0  512.0  512.0
AudioThread           8.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
Chrome_ChildIOT      35.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
Chrome_DevTools       2.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
Chrome_IOThread     154.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
Chrome_InProcGp      91.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
Chrome_ProcessL      18.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
CleanupReferenc       3.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
CookieMonsterBa       2.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
CookieMonsterCl       5.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
CrAsyncTask #1        3.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
MemoryInfra           4.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
NetworkService       18.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
RenderThread        965.0  380.451813  172.243131    6.0  235.00  496.0  512.0  512.0
SharedPreferenc       1.0  512.000000         NaN  512.0  512.00  512.0  512.0  512.0
Thread-2              1.0  436.000000         NaN  436.0  436.00  436.0  436.0  436.0
Thread-3              5.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
ThreadPoolForeg     161.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
ThreadPoolServi      38.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
ThreadPoolSingl      13.0  447.692308   28.540569  436.0  436.00  436.0  436.0  512.0
VizWebView          115.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
droid.launcher3      12.0  512.000000    0.000000  512.0  512.00  512.0  512.0  512.0
getprop              31.0  512.000000    0.000000  512.0  512.00  512.0  512.0  512.0
hwuiTask0           249.0  368.489960  174.171868    9.0  169.00  496.0  512.0  512.0
hwuiTask1           254.0  354.704724  179.083049    6.0  169.00  458.0  512.0  512.0
labs.geekbench5    1008.0  361.363095  181.320417    6.0  201.25  436.0  512.0  512.0
ll.splashscreen      73.0  422.493151   83.830206  283.0  446.00  446.0  458.0  512.0
mali-cmar-backe      21.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
mali-mem-purge       35.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
mali-utility-wo      17.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
ndroid.systemui     201.0  290.731343  195.719308   51.0  109.00  140.0  512.0  512.0
queued-work-loo       7.0  436.000000    0.000000  436.0  436.00  436.0  436.0  436.0
sh                   21.0  512.000000    0.000000  512.0  512.00  512.0  512.0  512.0

Kernel B RenderThread
512    361
253     59
307     40
84      39
264     37
52      36
225     30
16      28
398     21
472     19
448     13
43       5
216      5
375      5
138      4
87       4
143      4
112      4
501      4
21       4
190      4
6        4
67       4
25       4
175      4
347      3
13       3
126      3
93       3
103      3
129      3
100      2
78       2
413      2
97       2
242      1
75       1
31       1
115      1
81       1

Kernel C RenderThread
512    393
506     75
496     75
169     52
321     46
104     33
109     31
283     30
9       21
436     21
446     17
92      17
231     16
458     15
325     13
140     11
13       8
19       8
358      6
276      6
50       5
237      5
70       5
224      4
47       4
374      4
361      4
415      3
99       3
82       3
235      3
44       3
249      3
6        3
190      3
507      3
93       3
22       2
31       2
347      2
75       2
463      1
51       1

--->8---

Cheers

--
Qais Yousef

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ