linux-kernel - Re: [RFC PATCH 2/4] sched/fair: Introduce cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALOAHbAHQ0UBn2GqRNWQwH32UPOuFo0b550oi6WCKr8+wFgdsw@mail.gmail.com>
Date:   Sat, 6 Nov 2021 15:40:02 +0800
From:   Yafang Shao <laoar.shao@...il.com>
To:     Valentin Schneider <valentin.schneider@....com>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Benjamin Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 2/4] sched/fair: Introduce cfs_migration

On Sat, Nov 6, 2021 at 1:01 AM Valentin Schneider
<valentin.schneider@....com> wrote:
>
> On 04/11/21 14:57, Yafang Shao wrote:
> > A new per-cpu kthread named "cfs_migration/N" is introduced to do
> > cfs specific balance works. It is a FIFO task with priority FIFO-1,
> > which means it can preempt any cfs tasks but can't preempt other FIFO
> > tasks. The kthreads as follows,
> >
> >     PID     COMMAND
> >     13      [cfs_migration/0]
> >     20      [cfs_migration/1]
> >     25      [cfs_migration/2]
> >     32      [cfs_migration/3]
> >     38      [cfs_migration/4]
> >     ...
> >
> >     $ cat /proc/13/sched
> >     ...
> >     policy                                       :                    1
> >     prio                                         :                   98
> >     ...
> >
> >     $ cat /proc/13/status
> >     ...
> >     Cpus_allowed:     0001
> >     Cpus_allowed_list:        0
> >     ...
> >
> > All the works need to be done will be queued into a singly linked list,
> > in which the first queued will be first serviced.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@...il.com>
> > Cc: Valentin Schneider <valentin.schneider@....com>
> > Cc: Vincent Guittot <vincent.guittot@...aro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@....com>
> > ---
> >  kernel/sched/fair.c | 93 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 93 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 87db481e8a56..56b3fa91828b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -20,6 +20,8 @@
> >   *  Adaptive scheduling granularity, math enhancements by Peter Zijlstra
> >   *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
> >   */
> > +#include <linux/smpboot.h>
> > +#include <linux/stop_machine.h>
> >  #include "sched.h"
> >
> >  /*
> > @@ -11915,3 +11917,94 @@ int sched_trace_rq_nr_running(struct rq *rq)
> >          return rq ? rq->nr_running : -1;
> >  }
> >  EXPORT_SYMBOL_GPL(sched_trace_rq_nr_running);
> > +
> > +#ifdef CONFIG_SMP
> > +struct cfs_migrater {
> > +     struct task_struct *thread;
> > +     struct list_head works;
> > +     raw_spinlock_t lock;
>
> Hm so the handler of that work queue will now be a SCHED_FIFO task (which
> can block) rather than a CPU stopper (which can't), but AFAICT the
> callsites that would enqueue an item can't block, so having this as a
> raw_spinlock_t should still make sense.
>
> > +};
> > +
> > +DEFINE_PER_CPU(struct cfs_migrater, cfs_migrater);
> > +
> > +static int cfs_migration_should_run(unsigned int cpu)
> > +{
> > +     struct cfs_migrater *migrater = &per_cpu(cfs_migrater, cpu);
> > +     unsigned long flags;
> > +     int run;
> > +
> > +     raw_spin_lock_irqsave(&migrater->lock, flags);
> > +     run = !list_empty(&migrater->works);
> > +     raw_spin_unlock_irqrestore(&migrater->lock, flags);
> > +
> > +     return run;
> > +}
> > +
> > +static void cfs_migration_setup(unsigned int cpu)
> > +{
> > +     /* cfs_migration should have a higher priority than normal tasks,
> > +      * but a lower priority than other FIFO tasks.
> > +      */
> > +     sched_set_fifo_low(current);
> > +}
> > +
> > +static void cfs_migrater_thread(unsigned int cpu)
> > +{
> > +     struct cfs_migrater *migrater = &per_cpu(cfs_migrater, cpu);
> > +     struct cpu_stop_work *work;
> > +
> > +repeat:
> > +     work = NULL;
> > +     raw_spin_lock_irq(&migrater->lock);
> > +     if (!list_empty(&migrater->works)) {
> > +             work = list_first_entry(&migrater->works,
> > +                                     struct cpu_stop_work, list);
> > +             list_del_init(&work->list);
> > +     }
> > +     raw_spin_unlock_irq(&migrater->lock);
> > +
> > +     if (work) {
> > +             struct cpu_stop_done *done = work->done;
> > +             cpu_stop_fn_t fn = work->fn;
> > +             void *arg = work->arg;
> > +             int ret;
> > +
> > +             preempt_count_inc();
> > +             ret = fn(arg);
> > +             if (done) {
> > +                     if (ret)
> > +                             done->ret = ret;
> > +                     cpu_stop_signal_done(done);
> > +             }
> > +             preempt_count_dec();
> > +             goto repeat;
> > +     }
> > +}
>
> You're pretty much copying the CPU stopper setup, but that seems overkill
> for the functionality we're after: migrate a CFS task from one CPU to
> another. This shouldn't need to be able to run any arbitrary callback
> function.
>
> Unfortunately you are tackling both CFS active balancing and NUMA balancing
> at the same time, and right now they're plumbed a bit differently which
> probably drove you to use something a bit for polymorphic. Ideally we
> should be making them use the same paths, but IMO it would be acceptable as
> a first step to just cater to CFS active balancing - folks that really care
> about their RT tasks can disable CONFIG_NUMA_BALANCING, but there is
> nothing to disable CFS active balancing.
>

Right. The code will be more simplified if we only care about CFS
active balancing in this patchset.
We have disabled the numa balancing through
/proc/sys/kernel/numa_balancing, so it is not a critical issue now.

>
> Now, I'm thinking the bare information we need is:
>
> - a task to migrate
> - a CPU to move it to
>
> And then you can do something like...
>
> trigger_migration(task_struct *p, unsigned int dst_cpu)
> {
>         work = { p, dst_cpu };
>         get_task_struct(p);
>         /* queue work + wake migrater + wait for completion */
> }
>
> cfs_migrater_thread()
> {
>         /* ... */
>         p = work->p;
>
>         if (task_rq(p) != this_rq())
>                 goto out;
>
>         /* migrate task to work->dst_cpu */
> out:
>         complete(<some completion struct>);
>         put_task_struct(p);
> }
>

Agreed.

>
> We should also probably add something to prevent the migration from
> happening after it is no longer relevant. Say if we have something like:
>
>   <queue work to migrate p from CPU0 to CPU1>
>   <FIFO-2 task runs for 42 seconds on CPU0>
>   <cfs_migration/0 now gets to run>
>
> p could have moved elsewhere while cfs_migration/0. I'm thinking source CPU
> could be a useful information, but that doesn't tell you if the task moved
> around in the meantime...
>
> WDYT?

Agreed.
It seems we'd better take the patch[1] I sent several weeks back.

[1]. https://lore.kernel.org/lkml/20210615121551.31138-1-laoar.shao@gmail.com/

-- 
Thanks
Yafang