[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ca8805c9368f0b9f7459b7ecadd963e95fb32d98.camel@intel.com>
Date: Fri, 11 Aug 2023 08:49:25 +0000
From: "Zhang, Rui" <rui.zhang@...el.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
CC: "peterz@...radead.org" <peterz@...radead.org>,
"mingo@...hat.com" <mingo@...hat.com>,
"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"Pandruvada, Srinivas" <srinivas.pandruvada@...el.com>,
"tj@...nel.org" <tj@...nel.org>
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached
during NOHZ idle balance
Hi, Yu,
On Wed, 2023-08-09 at 15:00 +0800, Chen Yu wrote:
> On 2023-08-04 at 17:08:58 +0800, Zhang Rui wrote:
> > Problem statement
> > -----------------
> > When using cgroup isolated partition to isolate cpus including
> > cpu0, it
> > is observed that cpu0 is woken up frequenctly but doing nothing.
> > This is
> > not good for power efficiency.
> >
> > <idle>-0 [000] 616.491602: hrtimer_cancel:
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0 [000] 616.491608: hrtimer_start:
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=615996000000 softexpires=615996000000
> > <idle>-0 [000] 616.491616: rcu_utilization: Start
> > context switch
> > <idle>-0 [000] 616.491618: rcu_utilization: End context
> > switch
> > <idle>-0 [000] 616.491637: tick_stop: success=1
> > dependency=NONE
> > <idle>-0 [000] 616.491637: hrtimer_cancel:
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0 [000] 616.491638: hrtimer_start:
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=616420000000 softexpires=616420000000
> >
> > The above pattern repeats every one or multiple ticks, results in
> > total
> > 2000+ wakeups on cpu0 in 60 seconds, when running workload on the
> > cpus that are not in the isolated partition.
> >
> > Rootcause
> > ---------
> > In NOHZ mode, an active cpu either sends an IPI or touches the idle
> > cpu's polling flag to wake it up, so that the idle cpu can pull
> > tasks
> > from the busy cpu. The logic for selecting the target cpu is to use
> > the
> > first idle cpu that presents in both nohz.idle_cpus_mask and
> > housekeeping_cpumask.
> >
> > In the above scenario, when cpu0 is in the cgroup isolated
> > partition,
> > its sched domain is deteched, but it is still available in both of
> > the
> > above cpumasks. As a result, cpu0
> > 1. is always selected when kicking idle load balance
> > 2. is woken up from the idle loop
> > 3. calls __schedule() but cannot find any task to pull because it
> > is not
> > in any sched_domain, thus it does nothing and reenters idle.
> >
> > Solution
> > --------
> > Fix the problem by skipping cpus with no sched domain attached
> > during
> > NOHZ idle balance.
> >
> > Signed-off-by: Zhang Rui <rui.zhang@...el.com>
> > ---
> > kernel/sched/fair.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b3e25be58e2b..ea3185a46962 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void)
> > if (ilb == smp_processor_id())
> > continue;
> >
> > + if (unlikely(on_null_domain(cpu_rq(ilb))))
> > + continue;
> > +
> > if (idle_cpu(ilb))
> > return ilb;
> > }
>
> Is it possible to pass a valid cpumask to kick_ilb() via
> nohz_balancer_kick()
> and let find_new_ilb() scan in that mask? So we could shrink the scan
> range
> and also reduce the null domain check in each loop. CPUs in different
> cpuset are in different root domains, the busy CPU(in cpuset0) will
> not ask
> nohz idle CPU0(in isolated cpuset1) to launch idle load balance.
>
> struct root_domain *rd = rq->rd;
> ...
> kick_ilb(flags, rd->span)
>
Yeah. This also sounds like a reasonable approach. I can make a patch
to confirm it works as expected.
thanks,
rui
Powered by blists - more mailing lists