linux-kernel - Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ea8da512-73df-59ed-6c47-912dd9ef9dba@arm.com>
Date:   Thu, 14 Sep 2023 16:53:12 +0200
From:   Pierre Gondois <pierre.gondois@....com>
To:     "Zhang, Rui" <rui.zhang@...el.com>,
        "Lu, Aaron" <aaron.lu@...el.com>
Cc:     "peterz@...radead.org" <peterz@...radead.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "Pandruvada, Srinivas" <srinivas.pandruvada@...el.com>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "dietmar.eggemann@....com" <dietmar.eggemann@....com>,
        "tj@...nel.org" <tj@...nel.org>
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached
 during NOHZ idle balance



On 9/14/23 11:23, Zhang, Rui wrote:
> Hi, Pierre,
> 
>>
>> Yes right indeed,
>> This happens when putting a CPU offline (as you mentioned earlier,
>> putting a CPU offline clears the CPU in the idle_cpus_mask).
>>
>> The load balancing related variables
> 
> including?

I meant the nohz idle variables in the load balancing, so I was referring to:
(struct sched_domain_shared).nr_busy_cpus
(struct sched_domain).nohz_idle
nohz.idle_cpus_mask
nohz.nr_cpus
(struct rq).nohz_tick_stopped

> 
>>   are unused if a CPU has a NULL
>> rq as it cannot pull any task. Ideally we should clear them once,
>> when attaching a NULL sd to the CPU.
> 
> This sounds good to me. But TBH, I don't have enough confidence to do
> so because I'm not crystal clear about how these variables are used.
> 
> Some questions about the code below.
>>
>> The following snipped should do that and solve the issue you
>> mentioned:
>> --- snip ---
>> --- a/include/linux/sched/nohz.h
>> +++ b/include/linux/sched/nohz.h
>> @@ -9,8 +9,10 @@
>>    #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
>>    extern void nohz_balance_enter_idle(int cpu);
>>    extern int get_nohz_timer_target(void);
>> +extern void nohz_clean_sd_state(int cpu);
>>    #else
>>    static inline void nohz_balance_enter_idle(int cpu) { }
>> +static inline void nohz_clean_sd_state(int cpu) { }
>>    #endif
>>    
>>    #ifdef CONFIG_NO_HZ_COMMON
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b3e25be58e2b..6fcabe5d08f5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq)
>>    {
>>           SCHED_WARN_ON(rq != this_rq());
>>    
>> +       if (on_null_domain(rq))
>> +               return;
>> +
>>           if (likely(!rq->nohz_tick_stopped))
>>                   return;
>>
> if we force clearing rq->nohz_tick_stopped when detaching domain, why
> bother adding the first check?

Yes you're right. I added this check for safety, but this is not
mandatory.

> 
>>    
>> @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu)
>>           rcu_read_unlock();
>>    }
>>    
>> +void nohz_clean_sd_state(int cpu) {
>> +       struct rq *rq = cpu_rq(cpu);
>> +
>> +       rq->nohz_tick_stopped = 0;
>> +       if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
>> +               cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
>> +               atomic_dec(&nohz.nr_cpus);
>> +       }
>> +       set_cpu_sd_state_idle(cpu);
>> +}
>> +
> 
> detach_destroy_domains
> 	cpu_attach_domain
> 		update_top_cache_domain
> 
> as we clears per_cpu(sd_llc, cpu) for the isolated cpu in
> cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op here,
> no?

Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() calls
have to be inverted to avoid what you just described.

It also seems that the current kernel doesn't decrease nr_busy_cpus
when putting CPUs in an isolated partition. Indeed if a CPU is counted
in nr_busy_cpus, putting the CPU in an isolated partition doesn't trigger
any call to set_cpu_sd_state_idle().
So it might an additional argument.

Thanks for reading the patch,
Regards,
Pierre

> 
> thanks,
> rui
>>    /*
>>     * This routine will record that the CPU is going idle with tick
>> stopped.
>>     * This info will be used in performing idle load balancing in the
>> future.
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index d3a3b2646ec4..d31137b5f0ce 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const
>> struct cpumask *cpu_map)
>>                  
>> static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
>>    
>>           rcu_read_lock();
>> -       for_each_cpu(i, cpu_map)
>> +       for_each_cpu(i, cpu_map) {
>>                   cpu_attach_domain(NULL, &def_root_domain, i);
>> +               nohz_clean_sd_state(i);
>> +       }
>>           rcu_read_unlock();
>>    }
>>
>> --- snip ---
>>
>> Regards,
>> Pierre
>>
>>>
>>>>
>>>>> +       }
>>>>> +
>>>>>            /*
>>>>>             * The tick is still stopped but load could have been
>>>>> added in the
>>>>>             * meantime. We set the nohz.has_blocked flag to trig
>>>>> a
>>>>> check of the
>>>>> @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
>>>>>            if (rq->nohz_tick_stopped)
>>>>>                    goto out;
>>>>> -       /* If we're a completely isolated CPU, we don't play:
>>>>> */
>>>>> -       if (on_null_domain(rq))
>>>>> -               return;
>>>>> -
>>>>>            rq->nohz_tick_stopped = 1;
>>>>>            cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
>>>>>
>>>>> Otherwise I could reproduce the issue and the patch was solving
>>>>> it,
>>>>> so:
>>>>> Tested-by: Pierre Gondois <pierre.gondois@....com>
>>>
>>> Thanks for testing, really appreciated!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Also, your patch doesn't aim to solve that, but I think there
>>>>> is an
>>>>> issue
>>>>> when updating cpuset.cpus when an isolated partition was
>>>>> already
>>>>> created:
>>>>>
>>>>> // Create an isolated partition containing CPU0
>>>>> # mkdir cgroup
>>>>> # mount -t cgroup2 none cgroup/
>>>>> # mkdir cgroup/Testing
>>>>> # echo "+cpuset" > cgroup/cgroup.subtree_control
>>>>> # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
>>>>> # echo 0 > cgroup/Testing/cpuset.cpus
>>>>> # echo isolated > cgroup/Testing/cpuset.cpus.partition
>>>>>
>>>>> // CPU0's sched domain is detached:
>>>>> # ls /sys/kernel/debug/sched/domains/cpu0/
>>>>> # ls /sys/kernel/debug/sched/domains/cpu1/
>>>>> domain0  domain1
>>>>>
>>>>> // Change the isolated partition to be CPU1
>>>>> # echo 1 > cgroup/Testing/cpuset.cpus
>>>>>
>>>>> // CPU[0-1] sched domains are not updated:
>>>>> # ls /sys/kernel/debug/sched/domains/cpu0/
>>>>> # ls /sys/kernel/debug/sched/domains/cpu1/
>>>>> domain0  domain1
>>>>>
>>> Interesting. Let me check and get back to you later on this. :)
>>>
>>> thanks,
>>> rui
>