[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ea8da512-73df-59ed-6c47-912dd9ef9dba@arm.com>
Date: Thu, 14 Sep 2023 16:53:12 +0200
From: Pierre Gondois <pierre.gondois@....com>
To: "Zhang, Rui" <rui.zhang@...el.com>,
"Lu, Aaron" <aaron.lu@...el.com>
Cc: "peterz@...radead.org" <peterz@...radead.org>,
"mingo@...hat.com" <mingo@...hat.com>,
"Pandruvada, Srinivas" <srinivas.pandruvada@...el.com>,
"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"dietmar.eggemann@....com" <dietmar.eggemann@....com>,
"tj@...nel.org" <tj@...nel.org>
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached
during NOHZ idle balance
On 9/14/23 11:23, Zhang, Rui wrote:
> Hi, Pierre,
>
>>
>> Yes right indeed,
>> This happens when putting a CPU offline (as you mentioned earlier,
>> putting a CPU offline clears the CPU in the idle_cpus_mask).
>>
>> The load balancing related variables
>
> including?
I meant the nohz idle variables in the load balancing, so I was referring to:
(struct sched_domain_shared).nr_busy_cpus
(struct sched_domain).nohz_idle
nohz.idle_cpus_mask
nohz.nr_cpus
(struct rq).nohz_tick_stopped
>
>> are unused if a CPU has a NULL
>> rq as it cannot pull any task. Ideally we should clear them once,
>> when attaching a NULL sd to the CPU.
>
> This sounds good to me. But TBH, I don't have enough confidence to do
> so because I'm not crystal clear about how these variables are used.
>
> Some questions about the code below.
>>
>> The following snipped should do that and solve the issue you
>> mentioned:
>> --- snip ---
>> --- a/include/linux/sched/nohz.h
>> +++ b/include/linux/sched/nohz.h
>> @@ -9,8 +9,10 @@
>> #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
>> extern void nohz_balance_enter_idle(int cpu);
>> extern int get_nohz_timer_target(void);
>> +extern void nohz_clean_sd_state(int cpu);
>> #else
>> static inline void nohz_balance_enter_idle(int cpu) { }
>> +static inline void nohz_clean_sd_state(int cpu) { }
>> #endif
>>
>> #ifdef CONFIG_NO_HZ_COMMON
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b3e25be58e2b..6fcabe5d08f5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq)
>> {
>> SCHED_WARN_ON(rq != this_rq());
>>
>> + if (on_null_domain(rq))
>> + return;
>> +
>> if (likely(!rq->nohz_tick_stopped))
>> return;
>>
> if we force clearing rq->nohz_tick_stopped when detaching domain, why
> bother adding the first check?
Yes you're right. I added this check for safety, but this is not
mandatory.
>
>>
>> @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu)
>> rcu_read_unlock();
>> }
>>
>> +void nohz_clean_sd_state(int cpu) {
>> + struct rq *rq = cpu_rq(cpu);
>> +
>> + rq->nohz_tick_stopped = 0;
>> + if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
>> + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
>> + atomic_dec(&nohz.nr_cpus);
>> + }
>> + set_cpu_sd_state_idle(cpu);
>> +}
>> +
>
> detach_destroy_domains
> cpu_attach_domain
> update_top_cache_domain
>
> as we clears per_cpu(sd_llc, cpu) for the isolated cpu in
> cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op here,
> no?
Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() calls
have to be inverted to avoid what you just described.
It also seems that the current kernel doesn't decrease nr_busy_cpus
when putting CPUs in an isolated partition. Indeed if a CPU is counted
in nr_busy_cpus, putting the CPU in an isolated partition doesn't trigger
any call to set_cpu_sd_state_idle().
So it might an additional argument.
Thanks for reading the patch,
Regards,
Pierre
>
> thanks,
> rui
>> /*
>> * This routine will record that the CPU is going idle with tick
>> stopped.
>> * This info will be used in performing idle load balancing in the
>> future.
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index d3a3b2646ec4..d31137b5f0ce 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const
>> struct cpumask *cpu_map)
>>
>> static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
>>
>> rcu_read_lock();
>> - for_each_cpu(i, cpu_map)
>> + for_each_cpu(i, cpu_map) {
>> cpu_attach_domain(NULL, &def_root_domain, i);
>> + nohz_clean_sd_state(i);
>> + }
>> rcu_read_unlock();
>> }
>>
>> --- snip ---
>>
>> Regards,
>> Pierre
>>
>>>
>>>>
>>>>> + }
>>>>> +
>>>>> /*
>>>>> * The tick is still stopped but load could have been
>>>>> added in the
>>>>> * meantime. We set the nohz.has_blocked flag to trig
>>>>> a
>>>>> check of the
>>>>> @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
>>>>> if (rq->nohz_tick_stopped)
>>>>> goto out;
>>>>> - /* If we're a completely isolated CPU, we don't play:
>>>>> */
>>>>> - if (on_null_domain(rq))
>>>>> - return;
>>>>> -
>>>>> rq->nohz_tick_stopped = 1;
>>>>> cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
>>>>>
>>>>> Otherwise I could reproduce the issue and the patch was solving
>>>>> it,
>>>>> so:
>>>>> Tested-by: Pierre Gondois <pierre.gondois@....com>
>>>
>>> Thanks for testing, really appreciated!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Also, your patch doesn't aim to solve that, but I think there
>>>>> is an
>>>>> issue
>>>>> when updating cpuset.cpus when an isolated partition was
>>>>> already
>>>>> created:
>>>>>
>>>>> // Create an isolated partition containing CPU0
>>>>> # mkdir cgroup
>>>>> # mount -t cgroup2 none cgroup/
>>>>> # mkdir cgroup/Testing
>>>>> # echo "+cpuset" > cgroup/cgroup.subtree_control
>>>>> # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
>>>>> # echo 0 > cgroup/Testing/cpuset.cpus
>>>>> # echo isolated > cgroup/Testing/cpuset.cpus.partition
>>>>>
>>>>> // CPU0's sched domain is detached:
>>>>> # ls /sys/kernel/debug/sched/domains/cpu0/
>>>>> # ls /sys/kernel/debug/sched/domains/cpu1/
>>>>> domain0 domain1
>>>>>
>>>>> // Change the isolated partition to be CPU1
>>>>> # echo 1 > cgroup/Testing/cpuset.cpus
>>>>>
>>>>> // CPU[0-1] sched domains are not updated:
>>>>> # ls /sys/kernel/debug/sched/domains/cpu0/
>>>>> # ls /sys/kernel/debug/sched/domains/cpu1/
>>>>> domain0 domain1
>>>>>
>>> Interesting. Let me check and get back to you later on this. :)
>>>
>>> thanks,
>>> rui
>
Powered by blists - more mailing lists