linux-kernel - Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ca8805c9368f0b9f7459b7ecadd963e95fb32d98.camel@intel.com>
Date:   Fri, 11 Aug 2023 08:49:25 +0000
From:   "Zhang, Rui" <rui.zhang@...el.com>
To:     "Chen, Yu C" <yu.c.chen@...el.com>
CC:     "peterz@...radead.org" <peterz@...radead.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "Pandruvada, Srinivas" <srinivas.pandruvada@...el.com>,
        "tj@...nel.org" <tj@...nel.org>
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached
 during NOHZ idle balance

Hi, Yu,

On Wed, 2023-08-09 at 15:00 +0800, Chen Yu wrote:
> On 2023-08-04 at 17:08:58 +0800, Zhang Rui wrote:
> > Problem statement
> > -----------------
> > When using cgroup isolated partition to isolate cpus including
> > cpu0, it
> > is observed that cpu0 is woken up frequenctly but doing nothing.
> > This is
> > not good for power efficiency.
> > 
> > <idle>-0     [000]   616.491602: hrtimer_cancel:      
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0     [000]   616.491608: hrtimer_start:       
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=615996000000 softexpires=615996000000
> > <idle>-0     [000]   616.491616: rcu_utilization:      Start
> > context switch
> > <idle>-0     [000]   616.491618: rcu_utilization:      End context
> > switch
> > <idle>-0     [000]   616.491637: tick_stop:            success=1
> > dependency=NONE
> > <idle>-0     [000]   616.491637: hrtimer_cancel:      
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0     [000]   616.491638: hrtimer_start:       
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=616420000000 softexpires=616420000000
> > 
> > The above pattern repeats every one or multiple ticks, results in
> > total
> > 2000+ wakeups on cpu0 in 60 seconds, when running workload on the
> > cpus that are not in the isolated partition.
> > 
> > Rootcause
> > ---------
> > In NOHZ mode, an active cpu either sends an IPI or touches the idle
> > cpu's polling flag to wake it up, so that the idle cpu can pull
> > tasks
> > from the busy cpu. The logic for selecting the target cpu is to use
> > the
> > first idle cpu that presents in both nohz.idle_cpus_mask and
> > housekeeping_cpumask.
> > 
> > In the above scenario, when cpu0 is in the cgroup isolated
> > partition,
> > its sched domain is deteched, but it is still available in both of
> > the
> > above cpumasks. As a result, cpu0
> > 1. is always selected when kicking idle load balance
> > 2. is woken up from the idle loop
> > 3. calls __schedule() but cannot find any task to pull because it
> > is not
> >    in any sched_domain, thus it does nothing and reenters idle.
> > 
> > Solution
> > --------
> > Fix the problem by skipping cpus with no sched domain attached
> > during
> > NOHZ idle balance.
> > 
> > Signed-off-by: Zhang Rui <rui.zhang@...el.com>
> > ---
> >  kernel/sched/fair.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b3e25be58e2b..ea3185a46962 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void)
> >                 if (ilb == smp_processor_id())
> >                         continue;
> >  
> > +               if (unlikely(on_null_domain(cpu_rq(ilb))))
> > +                       continue;
> > +
> >                 if (idle_cpu(ilb))
> >                         return ilb;
> >         }
> 
> Is it possible to pass a valid cpumask to kick_ilb() via
> nohz_balancer_kick()
> and let find_new_ilb() scan in that mask? So we could shrink the scan
> range
> and also reduce the null domain check in each loop. CPUs in different
> cpuset are in different root domains, the busy CPU(in cpuset0) will
> not ask
> nohz idle CPU0(in isolated cpuset1) to launch idle load balance.
> 
> struct root_domain *rd = rq->rd;
> ...
> kick_ilb(flags, rd->span)
>          

Yeah. This also sounds like a reasonable approach. I can make a patch
to confirm it works as expected.

thanks,
rui