linux-kernel - Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <27e1b3c6-7e5b-4e52-9ba4-4e08fe7a11fc@arm.com>
Date: Fri, 6 Sep 2024 16:32:20 +0100
From: Hongyan Xia <hongyan.xia2@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: linux-kernel@...r.kernel.org, qyousef@...alina.io, mingo@...hat.com,
 peterz@...radead.org, juri.lelli@...hat.com, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, vschneid@...hat.com,
 lukasz.luba@....com, mgorman@...e.de, rafael.j.wysocki@...el.com
Subject: Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of
 spare capacity

On 06/09/2024 08:08, Vincent Guittot wrote:
> On Mon, 2 Sept 2024 at 13:03, Hongyan Xia <hongyan.xia2@....com> wrote:
>>
>> On 30/08/2024 14:03, Vincent Guittot wrote:
>>> feec() looks for the CPU with highest spare capacity in a PD assuming that
>>> it will be the best CPU from a energy efficiency PoV because it will
>>> require the smallest increase of OPP. Although this is true generally
>>> speaking, this policy also filters some others CPUs which will be as
>>> efficients because of using the same OPP.
>>> In fact, we really care about the cost of the new OPP that will be
>>> selected to handle the waking task. In many cases, several CPUs will end
>>> up selecting the same OPP and as a result using the same energy cost. In
>>> these cases, we can use other metrics to select the best CPU for the same
>>> energy cost.
>>>
>>> Rework feec() to look 1st for the lowest cost in a PD and then the most
>>> performant CPU between CPUs.
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
>>> ---
>>>    kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
>>>    1 file changed, 244 insertions(+), 222 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e67d6029b269..2273eecf6086 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> [...]
>>>
>>> -     energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
>>> +/* For a same cost, select the CPU that will povide best performance for the task */
>>> +static bool select_best_cpu(struct energy_cpu_stat *target,
>>> +                         struct energy_cpu_stat *min,
>>> +                         int prev, struct sched_domain *sd)
>>> +{
>>> +     /*  Select the one with the least number of running tasks */
>>> +     if (target->nr_running < min->nr_running)
>>> +             return true;
>>> +     if (target->nr_running > min->nr_running)
>>> +             return false;
>>>
>> This makes me a bit worried about systems with coarse-grained OPPs. All
>> my dev boards and one of my old phones have <= 3 OPPs. On my Juno board,
>> the lowest OPP on the big core spans across 512 utilization, half of the
>> full capacity. Assuming a scenario where there are 4 tasks, each with
>> 300, 100, 100, 100 utilization, the placement should be 300 on one core
>> and 3 tasks with 100 on another, but the nr_running check here would
>> give 2 tasks (300 + 100) on one CPU and 2 tasks (100 + 100) on another
>> because they are still under the lowest OPP on Juno. The second CPU will
>> also finish faster and idle more than the first one.
> 
> By balancing the number of tasks on each cpu, I try to minimize the
> scheduling latency. In your case above, tasks will wait for no more
> than a slice before running whereas it might have to wait up to 2
> slices if I put all the (100 utilization) tasks on the same CPU.

If viewed in another angle, we are now asking the 300 task (which 
potentially has a heavier workload to finish) to compete with a 100 
task, and now one core finishes faster and the other takes longer time, 
making the overall execution time longer.

>>
>> To give an extreme example, assuming the system has only one OPP (such a
>> system is dumb to begin with, but just to make a point), before this
>> patch EAS would still work okay in task placement, but after this patch,
> 
> Not sure what you mean by would still work okay. Do you have an
> example in mind that would not work correctly ?

With only one OPP, this patch will balance task placement purely on the 
number of tasks without considering utilization, and I don't think 
that's entirely acceptable (I actually need to deal with such a device 
with only one OPP in real life, although that's the fault of that 
device). Before, we are still balancing on total utilization, which 
results in the lowest execution time.

> 
>> EAS would just balance on the number of tasks, regardless of utilization
>> of tasks on wake-up.
> 
> You have to keep in mind that utilization is already taken into
> account to check if the task fits the CPU and by selecting the OPP
> (which is a nope in case of one OPP). So we know that there is enough
> capacity for the waking task

Still, taking my Juno board as an example where the 1st OPP is at 
utilization 512. Assuming no 25% margin, four tasks with utilization 
200, 200, 50, 50 and two CPUs, I would strongly favor 200 + 50 on one 
CPU and same on the other, than 200 + 200 on one, 50 + 50 on the other. 
However, with this patch, these two scheduling decisions are the same, 
as long as both are under the 512 OPP.

Of course, this becomes less of a problem with fine-grained OPPs. On my 
Pixel 6 with 18 OPPs on one CPU, I don't have such concerns.

>>
>> I wonder if there is a way to still take total utilization as a factor.
> 
> utilization is still used to check that the task utilization fits with
> current cpu utilization and then to select the OPP. At this step we
> know that there is enough capacity for everybody
> 
>> It used to be 100% of the decision making, but maybe now it is only 60%,
>> and the other 40% are things like number of tasks and contention.
>>
>>> -     trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
>>> +     /* Favor previous CPU otherwise */
>>> +     if (target->cpu == prev)
>>> +             return true;
>>> +     if (min->cpu == prev)
>>> +             return false;
>>>
>>> -     return energy;
>>> +     /*
>>> +      * Choose CPU with lowest contention. One might want to consider load instead of
>>> +      * runnable but we are supposed to not be overutilized so there is enough compute
>>> +      * capacity for everybody.
>>> +      */
>>> +     if ((target->runnable * min->capa * sd->imbalance_pct) >=
>>> +                     (min->runnable * target->capa * 100))
>>> +             return false;
>>> +
>>> +     return true;
>>>    }
>>> [...]
>>