[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c8baa176-e4cd-41f6-35a9-c69f89b32e79@arm.com>
Date: Tue, 12 Sep 2023 16:32:59 +0200
From: Pierre Gondois <pierre.gondois@....com>
To: "Zhang, Rui" <rui.zhang@...el.com>,
"Lu, Aaron" <aaron.lu@...el.com>
Cc: "peterz@...radead.org" <peterz@...radead.org>,
"mingo@...hat.com" <mingo@...hat.com>,
"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"tj@...nel.org" <tj@...nel.org>,
"dietmar.eggemann@....com" <dietmar.eggemann@....com>,
"Pandruvada, Srinivas" <srinivas.pandruvada@...el.com>
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached
during NOHZ idle balance
Hello Rui and Aaron,
On 9/11/23 18:23, Zhang, Rui wrote:
> On Mon, 2023-09-11 at 19:42 +0800, Aaron Lu wrote:
>> Hi Pierre,
>>
>> On Fri, Sep 08, 2023 at 11:43:50AM +0200, Pierre Gondois wrote:
>>> Hello Rui,
>>>
>>> On 8/14/23 10:30, Zhang, Rui wrote:
>>>> On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote:
>>>>> Hi Rui,
>>>>>
>>>>> On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote:
>>>>>> Problem statement
>>>>>> -----------------
>>>>>> When using cgroup isolated partition to isolate cpus
>>>>>> including
>>>>>> cpu0, it
>>>>>> is observed that cpu0 is woken up frequenctly but doing
>>>>>> nothing.
>>>>>> This is
>>>>>> not good for power efficiency.
>>>>>>
>>>>>> <idle>-0 [000] 616.491602: hrtimer_cancel:
>>>>>> hrtimer=0xffff8e8fdf623c10
>>>>>> <idle>-0 [000] 616.491608: hrtimer_start:
>>>>>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
>>>>>> expires=615996000000 softexpires=615996000000
>>>>>> <idle>-0 [000] 616.491616: rcu_utilization: Start
>>>>>> context switch
>>>>>> <idle>-0 [000] 616.491618: rcu_utilization: End
>>>>>> context
>>>>>> switch
>>>>>> <idle>-0 [000] 616.491637: tick_stop:
>>>>>> success=1
>>>>>> dependency=NONE
>>>>>> <idle>-0 [000] 616.491637: hrtimer_cancel:
>>>>>> hrtimer=0xffff8e8fdf623c10
>>>>>> <idle>-0 [000] 616.491638: hrtimer_start:
>>>>>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
>>>>>> expires=616420000000 softexpires=616420000000
>>>>>>
>>>>>> The above pattern repeats every one or multiple ticks,
>>>>>> results in
>>>>>> total
>>>>>> 2000+ wakeups on cpu0 in 60 seconds, when running workload on
>>>>>> the
>>>>>> cpus that are not in the isolated partition.
>>>>>>
>>>>>> Rootcause
>>>>>> ---------
>>>>>> In NOHZ mode, an active cpu either sends an IPI or touches
>>>>>> the idle
>>>>>> cpu's polling flag to wake it up, so that the idle cpu can
>>>>>> pull
>>>>>> tasks
>>>>>> from the busy cpu. The logic for selecting the target cpu is
>>>>>> to use
>>>>>> the
>>>>>> first idle cpu that presents in both nohz.idle_cpus_mask and
>>>>>> housekeeping_cpumask.
>>>>>>
>>>>>> In the above scenario, when cpu0 is in the cgroup isolated
>>>>>> partition,
>>>>>> its sched domain is deteched, but it is still available in
>>>>>> both of
>>>>>> the
>>>>>> above cpumasks. As a result, cpu0
>>>>>
>>>>> I saw in nohz_balance_enter_idle(), if a cpu is isolated, it
>>>>> will not
>>>>> set itself in nohz.idle_cpus_mask and thus should not be chosen
>>>>> as
>>>>> ilb_cpu. I wonder what's stopping this from working?
>>>>
>>>> One thing I forgot to mention is that the problem is gone if we
>>>> offline
>>>> and re-online those cpus. In that case, the isolated cpus
>>>> are removed
>>>> from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are
>>>> never
>>>> added back.
>>>>
>>>> At runtime, the cpus can be removed from the nohz.idle_cpus_mask
>>>> in
>>>> below case only
>>>> trigger_load_balance()
>>>> if (unlikely(on_null_domain(rq) ||
>>>> !cpu_active(cpu_of(rq))))
>>>> return;
>>>> nohz_balancer_kick(rq);
>>>> nohz_balance_exit_idle()
>>>>
>>>> My understanding is that if a cpu is in nohz.idle_cpus_mask when
>>>> it is
>>>> isolated, there is no chance to remove it from that mask later,
>>>> so the
>>>> check in nohz_balance_enter_idle() does not help.
>>>
>>>
>>> As you mentioned, once a CPU is isolated, its rq->nohz_tick_stopped
>>> is
>>> never updated. Instead of avoiding isolated CPUs at each
>>> find_new_ilb() call
>>> as this patch does, maybe it would be better to permanently remove
>>> these CPUs
>>> from nohz.idle_cpus_mask. This would lower the number of checks
>>> done.
>>
>> I agree.
>
> Sounds reasonable to me.
>
>>
>>> This could be done with something like:
>>> @@ -11576,6 +11586,20 @@ void nohz_balance_enter_idle(int cpu)
>>> */
>>> rq->has_blocked_load = 1;
>>> + /* If we're a completely isolated CPU, we don't play: */
>>> + if (on_null_domain(rq)) {
>>> + if (cpumask_test_cpu(rq->cpu, nohz.idle_cpus_mask))
>>> {
>>> + cpumask_clear_cpu(rq->cpu,
>>> nohz.idle_cpus_mask);
>>> + atomic_dec(&nohz.nr_cpus);
>>> + }
>>> +
>>
>>
>>> + /*
>>> + * Set nohz_tick_stopped on isolated CPUs to avoid
>>> keeping the
>>> + * value that was stored when
>>> rebuild_sched_domains() was called.
>>> + */
>>> + rq->nohz_tick_stopped = 1;
>>
>> It's not immediately clear to me what the above comment and code
>> mean,
>> maybe that's because I know very little about sched domain related
>> code.
>> Other than this, your change looks good to me.
>>
>
> If we set rq->nohz_tick_stopped for the isolated cpu, next time when we
> invoke nohz_balance_exit_idle(), we clear rq->nohz_tick_stopped and
> decrease nohz.nr_cpus (for the second time). This seems like a problem.
>
> In the current code, when rq->nohz_tick_stopped is set, it means the
> cpu is in the nohz.idle_cpus_mask, and the code above breaks this
> assumption.
Yes right indeed,
This happens when putting a CPU offline (as you mentioned earlier,
putting a CPU offline clears the CPU in the idle_cpus_mask).
The load balancing related variables are unused if a CPU has a NULL
rq as it cannot pull any task. Ideally we should clear them once,
when attaching a NULL sd to the CPU.
The following snipped should do that and solve the issue you mentioned:
--- snip ---
--- a/include/linux/sched/nohz.h
+++ b/include/linux/sched/nohz.h
@@ -9,8 +9,10 @@
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
extern void nohz_balance_enter_idle(int cpu);
extern int get_nohz_timer_target(void);
+extern void nohz_clean_sd_state(int cpu);
#else
static inline void nohz_balance_enter_idle(int cpu) { }
+static inline void nohz_clean_sd_state(int cpu) { }
#endif
#ifdef CONFIG_NO_HZ_COMMON
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..6fcabe5d08f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq)
{
SCHED_WARN_ON(rq != this_rq());
+ if (on_null_domain(rq))
+ return;
+
if (likely(!rq->nohz_tick_stopped))
return;
@@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu)
rcu_read_unlock();
}
+void nohz_clean_sd_state(int cpu) {
+ struct rq *rq = cpu_rq(cpu);
+
+ rq->nohz_tick_stopped = 0;
+ if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
+ cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+ atomic_dec(&nohz.nr_cpus);
+ }
+ set_cpu_sd_state_idle(cpu);
+}
+
/*
* This routine will record that the CPU is going idle with tick stopped.
* This info will be used in performing idle load balancing in the future.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d3a3b2646ec4..d31137b5f0ce 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const struct cpumask *cpu_map)
static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
rcu_read_lock();
- for_each_cpu(i, cpu_map)
+ for_each_cpu(i, cpu_map) {
cpu_attach_domain(NULL, &def_root_domain, i);
+ nohz_clean_sd_state(i);
+ }
rcu_read_unlock();
}
--- snip ---
Regards,
Pierre
>
>>
>>> + }
>>> +
>>> /*
>>> * The tick is still stopped but load could have been
>>> added in the
>>> * meantime. We set the nohz.has_blocked flag to trig a
>>> check of the
>>> @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
>>> if (rq->nohz_tick_stopped)
>>> goto out;
>>> - /* If we're a completely isolated CPU, we don't play: */
>>> - if (on_null_domain(rq))
>>> - return;
>>> -
>>> rq->nohz_tick_stopped = 1;
>>> cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
>>>
>>> Otherwise I could reproduce the issue and the patch was solving it,
>>> so:
>>> Tested-by: Pierre Gondois <pierre.gondois@....com>
>
> Thanks for testing, really appreciated!
>>>
>>>
>>>
>>>
>>>
>>> Also, your patch doesn't aim to solve that, but I think there is an
>>> issue
>>> when updating cpuset.cpus when an isolated partition was already
>>> created:
>>>
>>> // Create an isolated partition containing CPU0
>>> # mkdir cgroup
>>> # mount -t cgroup2 none cgroup/
>>> # mkdir cgroup/Testing
>>> # echo "+cpuset" > cgroup/cgroup.subtree_control
>>> # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
>>> # echo 0 > cgroup/Testing/cpuset.cpus
>>> # echo isolated > cgroup/Testing/cpuset.cpus.partition
>>>
>>> // CPU0's sched domain is detached:
>>> # ls /sys/kernel/debug/sched/domains/cpu0/
>>> # ls /sys/kernel/debug/sched/domains/cpu1/
>>> domain0 domain1
>>>
>>> // Change the isolated partition to be CPU1
>>> # echo 1 > cgroup/Testing/cpuset.cpus
>>>
>>> // CPU[0-1] sched domains are not updated:
>>> # ls /sys/kernel/debug/sched/domains/cpu0/
>>> # ls /sys/kernel/debug/sched/domains/cpu1/
>>> domain0 domain1
>>>
> Interesting. Let me check and get back to you later on this. :)
>
> thanks,
> rui
Powered by blists - more mailing lists