linux-kernel - Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <cb305abedea24980c93ce2b436e64039d3796812.camel@intel.com>
Date:   Mon, 14 Aug 2023 08:30:39 +0000
From:   "Zhang, Rui" <rui.zhang@...el.com>
To:     "Lu, Aaron" <aaron.lu@...el.com>
CC:     "peterz@...radead.org" <peterz@...radead.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "Pandruvada, Srinivas" <srinivas.pandruvada@...el.com>,
        "tj@...nel.org" <tj@...nel.org>
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached
 during NOHZ idle balance

On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote:
> Hi Rui,
> 
> On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote:
> > Problem statement
> > -----------------
> > When using cgroup isolated partition to isolate cpus including
> > cpu0, it
> > is observed that cpu0 is woken up frequenctly but doing nothing.
> > This is
> > not good for power efficiency.
> > 
> > <idle>-0     [000]   616.491602: hrtimer_cancel:      
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0     [000]   616.491608: hrtimer_start:       
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=615996000000 softexpires=615996000000
> > <idle>-0     [000]   616.491616: rcu_utilization:      Start
> > context switch
> > <idle>-0     [000]   616.491618: rcu_utilization:      End context
> > switch
> > <idle>-0     [000]   616.491637: tick_stop:            success=1
> > dependency=NONE
> > <idle>-0     [000]   616.491637: hrtimer_cancel:      
> > hrtimer=0xffff8e8fdf623c10
> > <idle>-0     [000]   616.491638: hrtimer_start:       
> > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > expires=616420000000 softexpires=616420000000
> > 
> > The above pattern repeats every one or multiple ticks, results in
> > total
> > 2000+ wakeups on cpu0 in 60 seconds, when running workload on the
> > cpus that are not in the isolated partition.
> > 
> > Rootcause
> > ---------
> > In NOHZ mode, an active cpu either sends an IPI or touches the idle
> > cpu's polling flag to wake it up, so that the idle cpu can pull
> > tasks
> > from the busy cpu. The logic for selecting the target cpu is to use
> > the
> > first idle cpu that presents in both nohz.idle_cpus_mask and
> > housekeeping_cpumask.
> > 
> > In the above scenario, when cpu0 is in the cgroup isolated
> > partition,
> > its sched domain is deteched, but it is still available in both of
> > the
> > above cpumasks. As a result, cpu0
> 
> I saw in nohz_balance_enter_idle(), if a cpu is isolated, it will not
> set itself in nohz.idle_cpus_mask and thus should not be chosen as
> ilb_cpu. I wonder what's stopping this from working?

One thing I forgot to mention is that the problem is gone if we offline
and re-online those cpus. In that case, the isolated cpus are removed
from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are never
added back.

At runtime, the cpus can be removed from the nohz.idle_cpus_mask in
below case only
	trigger_load_balance()
	        if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq))))
	                return;
	        nohz_balancer_kick(rq);
			nohz_balance_exit_idle()

My understanding is that if a cpu is in nohz.idle_cpus_mask when it is
isolated, there is no chance to remove it from that mask later, so the
check in nohz_balance_enter_idle() does not help.

thanks,
rui


> 
> > 1. is always selected when kicking idle load balance
> > 2. is woken up from the idle loop
> > 3. calls __schedule() but cannot find any task to pull because it
> > is not
> >    in any sched_domain, thus it does nothing and reenters idle.
> > 
> > Solution
> > --------
> > Fix the problem by skipping cpus with no sched domain attached
> > during
> > NOHZ idle balance.
> > 
> > Signed-off-by: Zhang Rui <rui.zhang@...el.com>
> > ---
> >  kernel/sched/fair.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b3e25be58e2b..ea3185a46962 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void)
> >                 if (ilb == smp_processor_id())
> >                         continue;
> >  
> > +               if (unlikely(on_null_domain(cpu_rq(ilb))))
> > +                       continue;
> > +
> >                 if (idle_cpu(ilb))
> >                         return ilb;
> >         }
> > -- 
> > 2.34.1
> >