linux-kernel - Re: [RFC PATCH 5/5] sched/fair: Add push task callback for EAS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5ccc3a77-64de-48fd-b118-f08aef9e1d2c@arm.com>
Date: Fri, 13 Sep 2024 11:09:52 +0200
From: Pierre Gondois <pierre.gondois@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, lukasz.luba@....com,
 rafael.j.wysocki@...el.com, linux-kernel@...r.kernel.org,
 qyousef@...alina.io, hongyan.xia2@....com
Subject: Re: [RFC PATCH 5/5] sched/fair: Add push task callback for EAS

Hello Vincent,

On 9/12/24 14:30, Vincent Guittot wrote:
> Hello Pierre,
> 
> On Wed, 11 Sept 2024 at 16:03, Pierre Gondois <pierre.gondois@....com> wrote:
>>
>> Hello Vincent,
>>
>> On 8/30/24 15:03, Vincent Guittot wrote:
>>> EAS is based on wakeup events to efficiently place tasks on the system, but
>>> there are cases where a task will not have wakeup events anymore or at a
>>> far too low pace. For such situation, we can take advantage of the task
>>> being put back in the enqueued list to check if it should be migrated on
>>> another CPU. When the task is the only one running on the CPU, the tick
>>> will check it the task is stuck on this CPU and should migrate on another
>>> one.
>>>
>>> Wake up events remain the main way to migrate tasks but we now detect
>>> situation where a task is stuck on a CPU by checking that its utilization
>>> is larger than the max available compute capacity (max cpu capacity or
>>> uclamp max setting)
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
>>> ---
>>>    kernel/sched/fair.c  | 211 +++++++++++++++++++++++++++++++++++++++++++
>>>    kernel/sched/sched.h |   2 +
>>>    2 files changed, 213 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.cHel
>>> index e46af2416159..41fb18ac118b 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>
>> [snip]
>>
>>> +
>>> +static inline void check_misfit_cpu(struct task_struct *p, struct rq *rq)
>>> +{
>>> +     int new_cpu, cpu = cpu_of(rq);
>>> +
>>> +     if (!sched_energy_enabled())
>>> +             return;
>>> +
>>> +     if (WARN_ON(!p))
>>> +             return;
>>> +
>>> +     if (WARN_ON(p != rq->curr))
>>> +             return;
>>> +
>>> +     if (is_migration_disabled(p))
>>> +             return;
>>> +
>>> +     if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))
>>> +             return;
>>
>> I tried the code on a Pixel6 with the following setup:
>> - without the above (rq->nr_running > 1) condition
>> - without the push task mechanism
>> i.e. tasks without regular wakeups only have the opportunity to
>> run feec() via the sched_tick. It seemed sufficient to avoid
>> the problematic you mentioned:
>> - having unbalanced UCLAMP_MAX tasks in a pd, e.g. 1 UCLAMP_MAX task
>>     per little CPU, except one little CPU with N UCLAMP_MAX tasks
>> - downgrading UCLAMP_MAX tasks that could run on smaller CPUs
>>     but have no wakeups and thus don't run feec()
> 
> The main problem with your test is that you always call feec() for the
> running task so you always have to wake up the migration thread to
> migrate the current running thread which is quite inefficient. The
> push mechanism only takes a task which is not the current running one
> and we don't need to wake up migration thread which is simpler and
> more efficient. We check only one task at a time and will not loop on
> an unbounded number of tasks after a task switch or a tick
> 
>>
>> Thus I was wondering it it would not be better to integrate the
>> EAS to the load balancer instead (not my idea, but don't remember
>> who suggested that).
> 
> My 1st thought was also to use load balance to pull tasks which were
> stuck on the wrong CPU (as mentioned in [1]) but this solution is not
> scalable as we don't want to test all runnable task on a cpu and it's
> not really easy to know which cpu and which tasks should be checked
> 
> [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
> 
>> Or otherwise if just running feec() through the sched_tick path
>> would not be sufficient (i.e. this patch minus the push mechanism).
> 
> As mentioned above, the push mechanism is more efficient than active migration.
> 
> 
>>
>>> +
>>> +     if (!task_misfit_cpu(p, cpu))
>>> +             return;
>>> +
>>> +     new_cpu = find_energy_efficient_cpu(p, cpu);
>>> +
>>> +     if (new_cpu == cpu)
>>> +             return;
>>> +
>>> +     /*
>>> +      * ->active_balance synchronizes accesses to
>>> +      * ->active_balance_work.  Once set, it's cleared
>>> +      * only after active load balance is finished.
>>> +      */
>>> +     if (!rq->active_balance) {
>>> +             rq->active_balance = 1;
>>> +             rq->push_cpu = new_cpu;
>>> +     } else
>>> +             return;
>>> +
>>> +     raw_spin_rq_unlock(rq);
>>> +     stop_one_cpu_nowait(cpu,
>>> +             active_load_balance_cpu_stop, rq,
>>> +             &rq->active_balance_work);
>>> +     raw_spin_rq_lock(rq);
>>
>> I didn't hit any error, but isn't it eligible to the following ?
>>     commit f0498d2a54e7 ("sched: Fix stop_one_cpu_nowait() vs hotplug")
>>
> 
> I will recheck but being called from the tick, for the local cpu and
> with a running thread no being cpu_stopper_thread, should protect us
> from the case describe in this commit
> 
>>
>>> +}
>>> +
>>> +static inline int has_pushable_tasks(struct rq *rq)
>>> +{
>>> +     return !plist_head_empty(&rq->cfs.pushable_tasks);
>>> +}
>>> +
>>> +static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
>>> +{
>>> +     struct task_struct *p;
>>> +
>>> +     if (!has_pushable_tasks(rq))
>>> +             return NULL;
>>> +
>>> +     p = plist_first_entry(&rq->cfs.pushable_tasks,
>>> +                           struct task_struct, pushable_tasks);
>>> +
>>> +     WARN_ON_ONCE(rq->cpu != task_cpu(p));
>>> +     WARN_ON_ONCE(task_current(rq, p));
>>> +     WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
>>> +
>>> +     WARN_ON_ONCE(!task_on_rq_queued(p));
>>> +
>>> +     /*
>>> +      * Remove task from the pushable list as we try only once after
>>> +      * task has been put back in enqueued list.
>>> +      */
>>> +     plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
>>> +
>>> +     return p;
>>> +}
>>> +
>>> +/*
>>> + * See if the non running fair tasks on this rq
>>> + * can be sent to some other CPU that fits better with
>>> + * their profile.
>>> + */
>>> +static int push_fair_task(struct rq *rq)
>>> +{
>>> +     struct task_struct *next_task;
>>> +     struct rq *new_rq;
>>> +     int prev_cpu, new_cpu;
>>> +     int ret = 0;
>>> +
>>> +     next_task = pick_next_pushable_fair_task(rq);
>>> +     if (!next_task)
>>> +             return 0;
>>> +
>>> +     if (is_migration_disabled(next_task))
>>> +             return 0;
>>> +
>>> +     if (WARN_ON(next_task == rq->curr))
>>> +             return 0;
>>> +
>>> +     /* We might release rq lock */
>>> +     get_task_struct(next_task);
>>> +
>>> +     prev_cpu = rq->cpu;
>>> +
>>> +     new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
>>> +
>>> +     if (new_cpu == prev_cpu)
>>> +             goto out;
>>> +
>>> +     new_rq = cpu_rq(new_cpu);
>>> +
>>> +     if (double_lock_balance(rq, new_rq)) {
>>
>>
>> I think it might be necessary to check the following:
>>     if (task_cpu(next_task) != rq->cpu) {
>>       double_unlock_balance(rq, new_rq);
> 
> Yes good point
> 
>>       goto out;
>>     }
>>
>> Indeed I've been hitting the following warnings:
>> - uclamp_rq_dec_id():SCHED_WARN_ON(!bucket->tasks)
>> - set_task_cpu()::WARN_ON_ONCE(state == TASK_RUNNING &&
>>                       p->sched_class == &fair_sched_class &&
>>                       (p->on_rq && !task_on_rq_migrating(p)))
>> - update_entity_lag()::SCHED_WARN_ON(!se->on_rq)
>>
>> and it seemed to be caused by the task not being on the initial rq anymore.
> 
> Do you have a particular use case to trigger this ? I haven't faced
> this in the various stress tests that  I did

It was triggered by this workload, but it was on a Pixel6 device,
so there might be more background activity:
- 8 tasks with: [UCLAMP_MIN:0, UCLAMP_MAX:1, duty_cycle:100%, period:16ms]
- 4 tasks each bound to a little CPU with: [UCLAMP_MIN:0, UCLAMP_MAX:1024, duty_cycle:4%, period:4ms]

Regards,
Pierre