linux-kernel - Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAMd_KNKhXXGk5MEibzzQUX3BFkWgxtEW2o8FFTX99DKw@mail.gmail.com>
Date:   Wed, 15 Nov 2023 21:01:59 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     "Zhang, Rui" <rui.zhang@...el.com>
Cc:     "Lu, Aaron" <aaron.lu@...el.com>,
        "pierre.gondois@....com" <pierre.gondois@....com>,
        "tj@...nel.org" <tj@...nel.org>,
        "dietmar.eggemann@....com" <dietmar.eggemann@....com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "longman@...hat.com" <longman@...hat.com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "Pandruvada, Srinivas" <srinivas.pandruvada@...el.com>,
        Frederic Weisbecker <frederic@...nel.org>
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached
 during NOHZ idle balance

Hi Rui,

On Wed, 20 Sept 2023 at 09:24, Zhang, Rui <rui.zhang@...el.com> wrote:
>
> Hi, Pierre,
>
> Sorry for the late response. I'm still ramping up on the related code.
>
> On Thu, 2023-09-14 at 16:53 +0200, Pierre Gondois wrote:
> >
> >
> > On 9/14/23 11:23, Zhang, Rui wrote:
> > > Hi, Pierre,
> > >
> > > >
> > > > Yes right indeed,
> > > > This happens when putting a CPU offline (as you mentioned
> > > > earlier,
> > > > putting a CPU offline clears the CPU in the idle_cpus_mask).
> > > >
> > > > The load balancing related variables
> > >
> > > including?
> >
> > I meant the nohz idle variables in the load balancing, so I was
> > referring to:
> > (struct sched_domain_shared).nr_busy_cpus
> > (struct sched_domain).nohz_idle
> > nohz.idle_cpus_mask
> > nohz.nr_cpus
> > (struct rq).nohz_tick_stopped
>
> IMO, the problem is that, for an isolated CPU,
> 1. it is not an idle cpu (nohz.idle_cpus_mask should be cleared)
> 2. it is not a busy cpu (sds->nr_busy_cpus should be decreased)
>
> But current code does not have a third state to describe this, so we
> need to either
> 1. add extra logic, like on_null_domain() checks
> or
> 2. rely on current logic, but update all related variables correctly,
> like you proposed.

Isn't the housekeeping cpu mask there to manage such a case ? I was
expecting that your isolated cpu should be cleared from the
housekeeping cpumask used by scheduler and ILB

I think that your solution is the comment of the ffind_new_ilb() unction:
"
 * - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED is not set
 *   anywhere yet.
"

IMO, you should look at enabling and using the HK_TYPE_SCHED for isolated CPU

CCed Frederic to get his opinion

>
> But in any case, we should stick with one direction.
>
> If we follow the first one, the original patch should be used, which
> IMO is simple and straight forward.
> If we follow the later one, we'd better audit and remove the current
> on_null_domain() usage at the same time. TBH, I'm not confident enough
> to make such a change. But if you want to propose something, I'd glad
> to test it.
>
> thanks,
> rui
>
> >
> > >
> > > >   are unused if a CPU has a NULL
> > > > rq as it cannot pull any task. Ideally we should clear them once,
> > > > when attaching a NULL sd to the CPU.
> > >
> > > This sounds good to me. But TBH, I don't have enough confidence to
> > > do
> > > so because I'm not crystal clear about how these variables are
> > > used.
> > >
> > > Some questions about the code below.
> > > >
> > > > The following snipped should do that and solve the issue you
> > > > mentioned:
> > > > --- snip ---
> > > > --- a/include/linux/sched/nohz.h
> > > > +++ b/include/linux/sched/nohz.h
> > > > @@ -9,8 +9,10 @@
> > > >    #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
> > > >    extern void nohz_balance_enter_idle(int cpu);
> > > >    extern int get_nohz_timer_target(void);
> > > > +extern void nohz_clean_sd_state(int cpu);
> > > >    #else
> > > >    static inline void nohz_balance_enter_idle(int cpu) { }
> > > > +static inline void nohz_clean_sd_state(int cpu) { }
> > > >    #endif
> > > >
> > > >    #ifdef CONFIG_NO_HZ_COMMON
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index b3e25be58e2b..6fcabe5d08f5 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq
> > > > *rq)
> > > >    {
> > > >           SCHED_WARN_ON(rq != this_rq());
> > > >
> > > > +       if (on_null_domain(rq))
> > > > +               return;
> > > > +
> > > >           if (likely(!rq->nohz_tick_stopped))
> > > >                   return;
> > > >
> > > if we force clearing rq->nohz_tick_stopped when detaching domain,
> > > why
> > > bother adding the first check?
> >
> > Yes you're right. I added this check for safety, but this is not
> > mandatory.
> >
> > >
> > > >
> > > > @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int
> > > > cpu)
> > > >           rcu_read_unlock();
> > > >    }
> > > >
> > > > +void nohz_clean_sd_state(int cpu) {
> > > > +       struct rq *rq = cpu_rq(cpu);
> > > > +
> > > > +       rq->nohz_tick_stopped = 0;
> > > > +       if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
> > > > +               cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> > > > +               atomic_dec(&nohz.nr_cpus);
> > > > +       }
> > > > +       set_cpu_sd_state_idle(cpu);
> > > > +}
> > > > +
> > >
> > > detach_destroy_domains
> > >         cpu_attach_domain
> > >                 update_top_cache_domain
> > >
> > > as we clears per_cpu(sd_llc, cpu) for the isolated cpu in
> > > cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op
> > > here,
> > > no?
> >
> > Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() calls
> > have to be inverted to avoid what you just described.
> >
> > It also seems that the current kernel doesn't decrease nr_busy_cpus
> > when putting CPUs in an isolated partition. Indeed if a CPU is
> > counted
> > in nr_busy_cpus, putting the CPU in an isolated partition doesn't
> > trigger
> > any call to set_cpu_sd_state_idle().
> > So it might an additional argument.
> >
> > Thanks for reading the patch,
> > Regards,
> > Pierre
> >
> > >
> > > thanks,
> > > rui
> > > >    /*
> > > >     * This routine will record that the CPU is going idle with
> > > > tick
> > > > stopped.
> > > >     * This info will be used in performing idle load balancing in
> > > > the
> > > > future.
> > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > index d3a3b2646ec4..d31137b5f0ce 100644
> > > > --- a/kernel/sched/topology.c
> > > > +++ b/kernel/sched/topology.c
> > > > @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const
> > > > struct cpumask *cpu_map)
> > > >
> > > > static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
> > > >
> > > >           rcu_read_lock();
> > > > -       for_each_cpu(i, cpu_map)
> > > > +       for_each_cpu(i, cpu_map) {
> > > >                   cpu_attach_domain(NULL, &def_root_domain, i);
> > > > +               nohz_clean_sd_state(i);
> > > > +       }
> > > >           rcu_read_unlock();
> > > >    }
> > > >
> > > > --- snip ---
> > > >
> > > > Regards,
> > > > Pierre
> > > >
> > > > >
> > > > > >
> > > > > > > +       }
> > > > > > > +
> > > > > > >            /*
> > > > > > >             * The tick is still stopped but load could have
> > > > > > > been
> > > > > > > added in the
> > > > > > >             * meantime. We set the nohz.has_blocked flag to
> > > > > > > trig
> > > > > > > a
> > > > > > > check of the
> > > > > > > @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int
> > > > > > > cpu)
> > > > > > >            if (rq->nohz_tick_stopped)
> > > > > > >                    goto out;
> > > > > > > -       /* If we're a completely isolated CPU, we don't
> > > > > > > play:
> > > > > > > */
> > > > > > > -       if (on_null_domain(rq))
> > > > > > > -               return;
> > > > > > > -
> > > > > > >            rq->nohz_tick_stopped = 1;
> > > > > > >            cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> > > > > > >
> > > > > > > Otherwise I could reproduce the issue and the patch was
> > > > > > > solving
> > > > > > > it,
> > > > > > > so:
> > > > > > > Tested-by: Pierre Gondois <pierre.gondois@....com>
> > > > >
> > > > > Thanks for testing, really appreciated!
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Also, your patch doesn't aim to solve that, but I think
> > > > > > > there
> > > > > > > is an
> > > > > > > issue
> > > > > > > when updating cpuset.cpus when an isolated partition was
> > > > > > > already
> > > > > > > created:
> > > > > > >
> > > > > > > // Create an isolated partition containing CPU0
> > > > > > > # mkdir cgroup
> > > > > > > # mount -t cgroup2 none cgroup/
> > > > > > > # mkdir cgroup/Testing
> > > > > > > # echo "+cpuset" > cgroup/cgroup.subtree_control
> > > > > > > # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
> > > > > > > # echo 0 > cgroup/Testing/cpuset.cpus
> > > > > > > # echo isolated > cgroup/Testing/cpuset.cpus.partition
> > > > > > >
> > > > > > > // CPU0's sched domain is detached:
> > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > > > > domain0  domain1
> > > > > > >
> > > > > > > // Change the isolated partition to be CPU1
> > > > > > > # echo 1 > cgroup/Testing/cpuset.cpus
> > > > > > >
> > > > > > > // CPU[0-1] sched domains are not updated:
> > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > > > > domain0  domain1
> > > > > > >
> > > > > Interesting. Let me check and get back to you later on this. :)
> > > > >
> > > > > thanks,
> > > > > rui
> > >
>