linux-kernel - Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a2a16c0e198a6d722b8923b0eec15dd2b32e4320.camel@intel.com>
Date:   Thu, 14 Sep 2023 09:23:18 +0000
From:   "Zhang, Rui" <rui.zhang@...el.com>
To:     "Lu, Aaron" <aaron.lu@...el.com>,
        "pierre.gondois@....com" <pierre.gondois@....com>
CC:     "peterz@...radead.org" <peterz@...radead.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "Pandruvada, Srinivas" <srinivas.pandruvada@...el.com>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "dietmar.eggemann@....com" <dietmar.eggemann@....com>,
        "tj@...nel.org" <tj@...nel.org>
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached
 during NOHZ idle balance

Hi, Pierre,

> 
> Yes right indeed,
> This happens when putting a CPU offline (as you mentioned earlier,
> putting a CPU offline clears the CPU in the idle_cpus_mask).
> 
> The load balancing related variables

including?

>  are unused if a CPU has a NULL
> rq as it cannot pull any task. Ideally we should clear them once,
> when attaching a NULL sd to the CPU.

This sounds good to me. But TBH, I don't have enough confidence to do
so because I'm not crystal clear about how these variables are used.

Some questions about the code below.
> 
> The following snipped should do that and solve the issue you
> mentioned:
> --- snip ---
> --- a/include/linux/sched/nohz.h
> +++ b/include/linux/sched/nohz.h
> @@ -9,8 +9,10 @@
>   #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
>   extern void nohz_balance_enter_idle(int cpu);
>   extern int get_nohz_timer_target(void);
> +extern void nohz_clean_sd_state(int cpu);
>   #else
>   static inline void nohz_balance_enter_idle(int cpu) { }
> +static inline void nohz_clean_sd_state(int cpu) { }
>   #endif
>   
>   #ifdef CONFIG_NO_HZ_COMMON
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b3e25be58e2b..6fcabe5d08f5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq)
>   {
>          SCHED_WARN_ON(rq != this_rq());
>   
> +       if (on_null_domain(rq))
> +               return;
> +
>          if (likely(!rq->nohz_tick_stopped))
>                  return;
> 
if we force clearing rq->nohz_tick_stopped when detaching domain, why
bother adding the first check?

>   
> @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu)
>          rcu_read_unlock();
>   }
>   
> +void nohz_clean_sd_state(int cpu) {
> +       struct rq *rq = cpu_rq(cpu);
> +
> +       rq->nohz_tick_stopped = 0;
> +       if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
> +               cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> +               atomic_dec(&nohz.nr_cpus);
> +       }
> +       set_cpu_sd_state_idle(cpu);
> +}
> +

detach_destroy_domains
	cpu_attach_domain
		update_top_cache_domain

as we clears per_cpu(sd_llc, cpu) for the isolated cpu in
cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op here,
no?

thanks,
rui
>   /*
>    * This routine will record that the CPU is going idle with tick
> stopped.
>    * This info will be used in performing idle load balancing in the
> future.
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d3a3b2646ec4..d31137b5f0ce 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const
> struct cpumask *cpu_map)
>                 
> static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
>   
>          rcu_read_lock();
> -       for_each_cpu(i, cpu_map)
> +       for_each_cpu(i, cpu_map) {
>                  cpu_attach_domain(NULL, &def_root_domain, i);
> +               nohz_clean_sd_state(i);
> +       }
>          rcu_read_unlock();
>   }
> 
> --- snip ---
> 
> Regards,
> Pierre
> 
> > 
> > > 
> > > > +       }
> > > > +
> > > >           /*
> > > >            * The tick is still stopped but load could have been
> > > > added in the
> > > >            * meantime. We set the nohz.has_blocked flag to trig
> > > > a
> > > > check of the
> > > > @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
> > > >           if (rq->nohz_tick_stopped)
> > > >                   goto out;
> > > > -       /* If we're a completely isolated CPU, we don't play:
> > > > */
> > > > -       if (on_null_domain(rq))
> > > > -               return;
> > > > -
> > > >           rq->nohz_tick_stopped = 1;
> > > >           cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> > > > 
> > > > Otherwise I could reproduce the issue and the patch was solving
> > > > it,
> > > > so:
> > > > Tested-by: Pierre Gondois <pierre.gondois@....com>
> > 
> > Thanks for testing, really appreciated!
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > Also, your patch doesn't aim to solve that, but I think there
> > > > is an
> > > > issue
> > > > when updating cpuset.cpus when an isolated partition was
> > > > already
> > > > created:
> > > > 
> > > > // Create an isolated partition containing CPU0
> > > > # mkdir cgroup
> > > > # mount -t cgroup2 none cgroup/
> > > > # mkdir cgroup/Testing
> > > > # echo "+cpuset" > cgroup/cgroup.subtree_control
> > > > # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
> > > > # echo 0 > cgroup/Testing/cpuset.cpus
> > > > # echo isolated > cgroup/Testing/cpuset.cpus.partition
> > > > 
> > > > // CPU0's sched domain is detached:
> > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > domain0  domain1
> > > > 
> > > > // Change the isolated partition to be CPU1
> > > > # echo 1 > cgroup/Testing/cpuset.cpus
> > > > 
> > > > // CPU[0-1] sched domains are not updated:
> > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > domain0  domain1
> > > > 
> > Interesting. Let me check and get back to you later on this. :)
> > 
> > thanks,
> > rui