[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0d1c440becca151db3f3b05b3fb2c63fe69c2904.camel@intel.com>
Date: Mon, 11 Sep 2023 16:23:58 +0000
From: "Zhang, Rui" <rui.zhang@...el.com>
To: "Lu, Aaron" <aaron.lu@...el.com>,
"pierre.gondois@....com" <pierre.gondois@....com>
CC: "dietmar.eggemann@....com" <dietmar.eggemann@....com>,
"peterz@...radead.org" <peterz@...radead.org>,
"mingo@...hat.com" <mingo@...hat.com>,
"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"tj@...nel.org" <tj@...nel.org>,
"Pandruvada, Srinivas" <srinivas.pandruvada@...el.com>
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached
during NOHZ idle balance
On Mon, 2023-09-11 at 19:42 +0800, Aaron Lu wrote:
> Hi Pierre,
>
> On Fri, Sep 08, 2023 at 11:43:50AM +0200, Pierre Gondois wrote:
> > Hello Rui,
> >
> > On 8/14/23 10:30, Zhang, Rui wrote:
> > > On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote:
> > > > Hi Rui,
> > > >
> > > > On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote:
> > > > > Problem statement
> > > > > -----------------
> > > > > When using cgroup isolated partition to isolate cpus
> > > > > including
> > > > > cpu0, it
> > > > > is observed that cpu0 is woken up frequenctly but doing
> > > > > nothing.
> > > > > This is
> > > > > not good for power efficiency.
> > > > >
> > > > > <idle>-0 [000] 616.491602: hrtimer_cancel:
> > > > > hrtimer=0xffff8e8fdf623c10
> > > > > <idle>-0 [000] 616.491608: hrtimer_start:
> > > > > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > > > > expires=615996000000 softexpires=615996000000
> > > > > <idle>-0 [000] 616.491616: rcu_utilization: Start
> > > > > context switch
> > > > > <idle>-0 [000] 616.491618: rcu_utilization: End
> > > > > context
> > > > > switch
> > > > > <idle>-0 [000] 616.491637: tick_stop:
> > > > > success=1
> > > > > dependency=NONE
> > > > > <idle>-0 [000] 616.491637: hrtimer_cancel:
> > > > > hrtimer=0xffff8e8fdf623c10
> > > > > <idle>-0 [000] 616.491638: hrtimer_start:
> > > > > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > > > > expires=616420000000 softexpires=616420000000
> > > > >
> > > > > The above pattern repeats every one or multiple ticks,
> > > > > results in
> > > > > total
> > > > > 2000+ wakeups on cpu0 in 60 seconds, when running workload on
> > > > > the
> > > > > cpus that are not in the isolated partition.
> > > > >
> > > > > Rootcause
> > > > > ---------
> > > > > In NOHZ mode, an active cpu either sends an IPI or touches
> > > > > the idle
> > > > > cpu's polling flag to wake it up, so that the idle cpu can
> > > > > pull
> > > > > tasks
> > > > > from the busy cpu. The logic for selecting the target cpu is
> > > > > to use
> > > > > the
> > > > > first idle cpu that presents in both nohz.idle_cpus_mask and
> > > > > housekeeping_cpumask.
> > > > >
> > > > > In the above scenario, when cpu0 is in the cgroup isolated
> > > > > partition,
> > > > > its sched domain is deteched, but it is still available in
> > > > > both of
> > > > > the
> > > > > above cpumasks. As a result, cpu0
> > > >
> > > > I saw in nohz_balance_enter_idle(), if a cpu is isolated, it
> > > > will not
> > > > set itself in nohz.idle_cpus_mask and thus should not be chosen
> > > > as
> > > > ilb_cpu. I wonder what's stopping this from working?
> > >
> > > One thing I forgot to mention is that the problem is gone if we
> > > offline
> > > and re-online those cpus. In that case, the isolated cpus
> > > are removed
> > > from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are
> > > never
> > > added back.
> > >
> > > At runtime, the cpus can be removed from the nohz.idle_cpus_mask
> > > in
> > > below case only
> > > trigger_load_balance()
> > > if (unlikely(on_null_domain(rq) ||
> > > !cpu_active(cpu_of(rq))))
> > > return;
> > > nohz_balancer_kick(rq);
> > > nohz_balance_exit_idle()
> > >
> > > My understanding is that if a cpu is in nohz.idle_cpus_mask when
> > > it is
> > > isolated, there is no chance to remove it from that mask later,
> > > so the
> > > check in nohz_balance_enter_idle() does not help.
> >
> >
> > As you mentioned, once a CPU is isolated, its rq->nohz_tick_stopped
> > is
> > never updated. Instead of avoiding isolated CPUs at each
> > find_new_ilb() call
> > as this patch does, maybe it would be better to permanently remove
> > these CPUs
> > from nohz.idle_cpus_mask. This would lower the number of checks
> > done.
>
> I agree.
Sounds reasonable to me.
>
> > This could be done with something like:
> > @@ -11576,6 +11586,20 @@ void nohz_balance_enter_idle(int cpu)
> > */
> > rq->has_blocked_load = 1;
> > + /* If we're a completely isolated CPU, we don't play: */
> > + if (on_null_domain(rq)) {
> > + if (cpumask_test_cpu(rq->cpu, nohz.idle_cpus_mask))
> > {
> > + cpumask_clear_cpu(rq->cpu,
> > nohz.idle_cpus_mask);
> > + atomic_dec(&nohz.nr_cpus);
> > + }
> > +
>
>
> > + /*
> > + * Set nohz_tick_stopped on isolated CPUs to avoid
> > keeping the
> > + * value that was stored when
> > rebuild_sched_domains() was called.
> > + */
> > + rq->nohz_tick_stopped = 1;
>
> It's not immediately clear to me what the above comment and code
> mean,
> maybe that's because I know very little about sched domain related
> code.
> Other than this, your change looks good to me.
>
If we set rq->nohz_tick_stopped for the isolated cpu, next time when we
invoke nohz_balance_exit_idle(), we clear rq->nohz_tick_stopped and
decrease nohz.nr_cpus (for the second time). This seems like a problem.
In the current code, when rq->nohz_tick_stopped is set, it means the
cpu is in the nohz.idle_cpus_mask, and the code above breaks this
assumption.
>
> > + }
> > +
> > /*
> > * The tick is still stopped but load could have been
> > added in the
> > * meantime. We set the nohz.has_blocked flag to trig a
> > check of the
> > @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
> > if (rq->nohz_tick_stopped)
> > goto out;
> > - /* If we're a completely isolated CPU, we don't play: */
> > - if (on_null_domain(rq))
> > - return;
> > -
> > rq->nohz_tick_stopped = 1;
> > cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> >
> > Otherwise I could reproduce the issue and the patch was solving it,
> > so:
> > Tested-by: Pierre Gondois <pierre.gondois@....com>
Thanks for testing, really appreciated!
> >
> >
> >
> >
> >
> > Also, your patch doesn't aim to solve that, but I think there is an
> > issue
> > when updating cpuset.cpus when an isolated partition was already
> > created:
> >
> > // Create an isolated partition containing CPU0
> > # mkdir cgroup
> > # mount -t cgroup2 none cgroup/
> > # mkdir cgroup/Testing
> > # echo "+cpuset" > cgroup/cgroup.subtree_control
> > # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
> > # echo 0 > cgroup/Testing/cpuset.cpus
> > # echo isolated > cgroup/Testing/cpuset.cpus.partition
> >
> > // CPU0's sched domain is detached:
> > # ls /sys/kernel/debug/sched/domains/cpu0/
> > # ls /sys/kernel/debug/sched/domains/cpu1/
> > domain0 domain1
> >
> > // Change the isolated partition to be CPU1
> > # echo 1 > cgroup/Testing/cpuset.cpus
> >
> > // CPU[0-1] sched domains are not updated:
> > # ls /sys/kernel/debug/sched/domains/cpu0/
> > # ls /sys/kernel/debug/sched/domains/cpu1/
> > domain0 domain1
> >
Interesting. Let me check and get back to you later on this. :)
thanks,
rui
Powered by blists - more mailing lists