[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7f1f7dd0-e3b5-4e16-a44e-c08fca567f97@arm.com>
Date: Mon, 4 Dec 2023 01:48:38 +0000
From: Hongyan Xia <hongyan.xia2@....com>
To: Qais Yousef <qyousef@...alina.io>
Cc: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Morten Rasmussen <morten.rasmussen@....com>,
Lukasz Luba <lukasz.luba@....com>,
Christian Loehle <christian.loehle@....com>,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/6] sched: uclamp sum aggregation
Hi Qais,
On 03/12/2023 00:25, Qais Yousef wrote:
> Hi Hongyan
>
> On 10/04/23 10:04, Hongyan Xia wrote:
>> Current implementation of uclamp leaves many things to be desired.
>> There are several problems:
>>
>> 1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or
>> with the default value, which is 1024) can ruin the previous
>> settings the moment it is enqueued, as shown in the uclamp frequency
>> spike problem in Section 5.2 of
>> Documentation/scheduler/sched-util-clamp.rst. Constantly running at
>> 1024 utilization implies that the CPU is driven at its max capacity.
>> However, with UCLAMP_MAX, this assumption is no longer true. To
>> mitigate this, one idea is to filter out uclamp values for
>> short-running tasks. However, the effectiveness of this idea remains
>> to be seen.
>
> This was proved when I was at arm and it is used in products now too. It is
> effective.
>
>>
>> 2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU
>> running at its peak, as both show utilization of 1024. However, in
>> certain cases it's important to tell the difference, as we still want
>> to take the opportunity to enqueue tasks in the former case.
>>
>> It's also debatable whether uclamp should be a frequency hint. An
>
> It is not debatable. Uclamp is a performance hint which translates into task
> placement (for HMP) and frequency selection. On SMP the latter is the only
> meaning for uclamp. For the time being at least until we get a new technology
> that can impact performance ;-)
I agree. In this new series, uclamp is still a performance hint which
affects task placement and frequency, so we don't have a disagreement here.
>> example is a system with a mid CPU of capacity 500, and a little CPU
>> with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then
>> util_fits_cpu() will schedule them on the mid CPU because feec()
>> thinks they don't fit on the little, even though we should use the
>> little core at some point instead of making the mid even more crowded.
>
> This is not a problem. One of the major design points for uclamp is to allow
> small tasks to request to run faster. So if you have 100 tasks that run for
> 100us and to meet that they need to run at mid CPU that is fine. That doesn't
> necessarily mean they'll overutilized the cluster.
I understand. This design point is what sum aggregation does as well.
This alone is not a problem, but moving a CPU to another CPU to fit
uclamp_min requirement can be a problem, when the destination CPU is
crowded and mixed with uclamp_max. I will elaborate on this below.
>
> If the cluster gets overutilized, then EAS will be disabled and
> find_idle_capacity() will spread into highest spare capacity idle CPU, which
> includes little CPUs. And this behavior is no different to have tasks with util
> greater than 100 wanting to run on mid cores. The only difference is that we
> can pack more on mid core before overutilized can be triggered. Which is the
> desired effect.
>
> So this is a no issue and we should spread once the cluster is busy.
>> Of course, this in reality doesn't happen because of other mechanisms
>> like load balancing, but it's also not good when other mechanisms can
>> cancel the effect of uclamp in feec(). A CPU running at capacity 500 for
>> 1ms or for 1000ms gives completely different performance levels, so
>> trying to fit only the frequency does not give us any performance
>> guarantees. If we then talk about not only running at some frequency but
>> also for some amount of time, then what we really mean is a capacity
>> hint, not a frequency hint.
>
> I'm not sure I parsed that. But again, it's a performance hint which implies
> both.
>
>>
>> It's even worse when the tasks scheduled on the mid CPU also have
>> UCLAMP_MAX values. In the previous example, instead of running at 500, a
>> single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much
>> lower frequency than 500, so it is then a really bad decision to honor
>> the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of
>> running it on the little CPU which is less crowded, and has pretty much
>> the same capacity (100 vs 110) anyway.
>
> I can't follow this reasoning.
>
> It seems you're assuming the user has picked 110 by mistake and it is okay to
> ignore it.
>
> FYI, one of the use cases is to set uclamp_min to capacity_of(little_core)+1
> because they're so crappy to handle their workload. It is still better than
> using affinity as when the system gets busy we can still use the little core.
> But only when the system is busy and crowded already.
>
> I don't know what's the reference to the mid core having a task with
> uclamp_max, but seems a bug in feec() if not predicting the frequency
> correctly.
The example in the cover letter might be complex. I can simplify as a "N
tasks" problem.
Say 2 CPUs, one with capacity 1024 and the other 512. Assume there's a
task with uclamp_min 513 and uclamp_max of 800, then the big CPU can, in
theory, fit an infinite number of such tasks on the big CPU, without
triggering overutilization. What does uclamp_min even mean when this
happens? uclamp_min should be a hint of a minimum performance it should
get, but it breaks down when there are N tasks, so the hint from
uclamp_min is divided by N. Under max aggregation, uclamp_min doesn't
give you the "performance" of this task. It gives performance of the
CPU, and this is not the performance of this task when N > 1. In fact, N
can be infinity. This is the same problem shared by multiple uclamp
users from LPC, where max aggregation hints the performance of the CPU,
but not the task itself.
As to "FYI, one of the use cases is to set uclamp_min to
capacity_of(little_core)+1 because they're so crappy to handle their
workload", I'm afraid I disagree with this. I think uclamp is a simple
hinting mechanism to the scheduler to say: please try to give me
[uclamp_min, uclamp_max] performance. It's then the job of EAS to pick
the most efficient CPU. We know the little CPU is crappy because we have
deep knowledge of the device, but uclamp values could just come from
game developers (using ADPF) who have no knowledge of the SoC, and in
the game developer's view, asking uclamp of little CPU + 1 is simply
because he/she wants that performance, not because the developer thinks
the little CPU should be avoided. I don't think a soft-affinity-like
property belongs here. We should improve EM and let EAS do its job
instead of teaching it to interpret uclamp as affinity hints.
>>
>> This series attempts to solve these problems by tracking a
>> util_avg_uclamp signal in tasks and cfs_rq. At task level,
>
> This was discussed offline and brought several times on the list already.
> uclamp is a performance not a bandwidth hint. I remember Dietmar was suggesting
> this concept in the past and I objected that this converts uclamp into
> a bandwidth hint. This overlaps with CFS bandwidth controller. We can use that
> if somebody wants to limit the bandwidth the task can consume. uclamp is about
> limiting the performance level it can reach. ie: it is not about cycles
> consumed, but Performance Point reached.
I'd argue the current max aggregation implementation is further away
from a performance hint. The moment there are more tasks interacting
with each other on the same rq, the performance hinting ability of max
aggregation tends to do worse, like uclamp_min being divided by N,
frequency spikes, uneven task distribution (both shown in the evaluation
section of my cover letter) etc, whereas sum aggregation still performs
well.
Other shortcomings are not that critical, but the fact that uclamp_min's
effectiveness is divided by N under max aggregation I think is not
acceptable.
>> p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but
>> clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp
>> must always be the sum of all util_avg_uclamp of all the entities on
>> this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum
>> aggregation of all the clamped values, which indicates the frequency
>> this rq should run at and what the utilization is.
>
> This is a no go for me. A task performance requirements can not be added to
> resemble the 'size' of the tasks. Tasks with util_avg 100 but a uclamp_min of
> 1024 should allow us to pack them in the big core. Which is an important use
> case to use. A bunch of small tasks that need to run faster are okay to pack on
> a bigger core if they need to run faster as long as this doesn't cause
> overutilized to trigger.
I see. I now understand where we start to diverge. It seems we want
completely different things from uclamp.
Max aggregation: please try to place me on a CPU that can give CPU
capacity [uclamp_min, uclamp_max] and run that CPU within that range.
Sum aggregation: please try to treat me like a task with utilization
within [uclamp_min, uclamp_max].
Like I mentioned in the previous comment, I'm afraid I disagree with
using uclamp as a CPU affinity hint, even if it's just a soft affinity.
Its job should just be a performance hint and EM should tell EAS which
CPU to pick under that hint. If the little CPU is crappy, then the
crappiness should be reflected in the EM, and maybe dynamic EM can help.
>>
>> This idea solves Problem 1 by capping the utilization of an
>> always-running task throttled by UCLAMP_MAX. Although the task (denoted
>> by Task A) has no idle time, the util_avg_uclamp signal gives its
>> UCLAMP_MAX value instead of 1024, so even if another task (Task B) with
>> a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's
>> UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization,
>> which means we no longer have the frequency spike problem caused by B.
>> This should mean that we might completely avoid the need for uclamp
>> filtering.
>>
>> It also solves Problem 2 by tracking the capped utilization of a rq.
>> Using util_avg_uclamp, we are able to differentiate between a CPU
>> throttled by UCLAMP_MAX and one that is actually running at its peak
>> capacity. An always-running rq with a task having UCLAMP_MAX at 300 will
>> report a cfs_rq util_avg_uclamp at 300, not 1024, which means task
>> placement sees spare capacity on this CPU and will allocate some tasks
>> to it, instead of treating it the same way as a CPU actually running at
>
> We don't need to know the actual spare capacity, though that would be nice to
> have.
>
> But generally in my view this CPU should be available for non capped task to
> run on. I view this a load balancer problem where wake up path can consider
> this CPU as a candidate and rely on load balancer to move this task away to
> another (smaller) CPU ASAP. If this capped task is already on the smaller CPU
> and there's nowhere else to place the uncapped task in the system then we must
> be quite busy already.
I'm not quite sure relying on the load balancer is a good idea, because
then it means we need uclamp code in the load balancer and the
complexity keeps growing. An example is that after applying the patch
'sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0'
even though the scheduler now knows that CPUs with 0 spare capacity are
now candidates, it creates a very unbalanced task placement and CPU0
tends to receive a lot of tasks in my tests. This certainly needs some
uclamp-aware code in the load balancer, whereas sum aggregation has no
such problem and balance is achieved during task placement in the first
place.
> If userspace ends up capping too many tasks too hard that we end up with no
> idle time in the system (where otherwise we should) I consider this a bad usage
> in the userspace and will ask them not to cap that hard. This is a case of one
> shooting themselves in the foot. Maybe we need to care in the future, but so
> far this doesn't seem a valid problem to address. I'd like to see why userspace
> has failed to do proper management first before jumping to fix this.
>
> What I'd expect more use cases of tasks that are truly 1024 but being capped.
> In this case there's truly no idle time on the CPU and your proposal won't
> address this. I don't think we have a way to distinguish between the two cases
> at the moment.
I don't quite get what you mean. In the examples in my cover letter the
capped tasks are truly 1024. Two CPUs, one with an uncapped 1024 task
and the other a 1024 task but with uclamp_max of 300. The scheduler sees
cfs_rq->avg.util_avg_uclamp of 1024 in the first CPU and 300 in the 2nd,
so my proposal indeed sees the difference, and will have the opportunity
to place more tasks on the 2nd CPU without any corner case code.
>> the peak. This brings us closer to the benefits of Android's sum
>> aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at
>> https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510),
>> which shows energy benefits because we are able to schedule tasks on
>> smaller cores which are UCLAMP_MAX capped, instead of finding a fresh
>
> You're jumping to the wrong conclusion about why this is being used. This was
> to overcome issues when using cpu.shares. This has nothing to do with
> UCLAMP_MAX. uclamp_max filter is what is being used. Based on the proposal
> I posted long ago
>
> https://android-review.googlesource.com/c/kernel/common/+/1914696
>
Maybe I need to rephrase. I said that GROUP_THROTTLE can give us some
energy benefits. I didn't draw any conclusion that this was why it was
invented. I was also trying to suggest that, under sum aggregation, the
same code can work for both uclamp and GROUP_THROTTLE. However, I
probably need to better understand the issues around cpu.shares before
saying this concretely.
Also, uclamp filtering might be another example why I think max
aggregation is over-complicated. Sum aggregation has only around 300
lines of code so far without the need for filtering at all.
>> big CPU. However, a major difference is that this series has an
>> amortized cost of O(1), instead of O(n) in cpu_util_next() in Android
>> kernels.
>>
>> It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum
>
> This is not a short coming. Performance implies frequency. We should not
> confused how schedutil works to estimate perf requirement based on historical
> run time (util_avg) and the fact that a task with short historical runtime
> needs to run faster. Running faster means increasing frequency.
>
>> aggregation, the scheduler sees the sum aggregated util_avg_uclamp and
>> will avoid the problem of enqueueing too many tasks just to fit the
>> frequency, and will place tasks on little CPUs in the previous example.
>
> I think the issue of uclamp being a performance but not a bandwidth hint was
> raised several times. Did I miss something that makes this not converting it to
> a bandwidth hint?
>
I wonder what does performance hint even mean if it doesn't at least
correlate to bandwidth in some way? Yes, specifying uclamp_min under max
aggregation can run the CPU at least at that frequency, but if you have
N tasks on the same rq, then each task can only get a share of
uclamp_min/N. I highly doubt that when a mobile phone developer
increases the uclamp_min of an application, he or she is okay with
doubling the CPU frequency when it runs and running it only for 1/10 of
the time because it gets moved to a crowded big CPU, which essentially
gives 0.2x the "performance" in my opinion, so I'm not sure fully
disconnecting bandwidth from performance is meaningful. Of course, a
(not yet) uclamp-aware load balancer could come in and migrate those
tasks, but why not avoid this problem in the first place via sum
aggregation without involving the load balancer, especially when today
we don't have any indication of how well a uclamp-aware load balancer
can perform anyway?
In the end I may still not be able to fully grasp what the issues of sum
aggregations are. I think so far I get that it's closer to "bandwidth"
and not "performance" (which I think "performance" doesn't mean much if
it doesn't correlate to bandwidth) and it doesn't work as a CPU affinity
hint (it actually does to some degree, but the uclamp values need to be
different). But I think focusing on these ignores the whole picture,
which are the benefits of sum aggregation (already presented in the
evaluation section of the cover letter of this series):
1. It's fewer than 350 lines of code (max aggregation is 750+ and still
growing).
2. It doesn't suffer from frequency spikes or load balancing issues.
3. It needs no code to consider uclamp as a corner case in the
scheduler. No need to carry around uclamp_min and uclamp_max in so many
functions. util_fits_cpu() is just one line of code.
4. It's effective. When you say a uclamp_min of X, the scheduler really
tries to give you X for your task, not for the CPU you are on, as seen
in 54% reduction of jank in jankbench, without increasing energy
consumption.
5. It looks like something multiple developers want from LPC 2023. The
use cases of "additive uclamp", and hinting the host from the VM via
util_guest can be achieved exactly by uclamp sum aggregation.
Given these benefits, I think it's probably a good idea to give sum
aggregation a serious look. On my Pixel 6, sum aggregation either gives
better benchmark scores at the same energy as max aggregation, or
consumes less power at the same benchmark scores. I'll keep benchmarking
it but so far so good.
Hongyan
Powered by blists - more mailing lists