linux-kernel - Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b886a9af-6c90-c12a-f700-1a9141e25e7a@arm.com>
Date:   Fri, 8 Sep 2023 11:43:50 +0200
From:   Pierre Gondois <pierre.gondois@....com>
To:     "Zhang, Rui" <rui.zhang@...el.com>,
        "Lu, Aaron" <aaron.lu@...el.com>
Cc:     "peterz@...radead.org" <peterz@...radead.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "Pandruvada, Srinivas" <srinivas.pandruvada@...el.com>,
        "tj@...nel.org" <tj@...nel.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>
Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached
 during NOHZ idle balance

Hello Rui,

On 8/14/23 10:30, Zhang, Rui wrote:
> On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote:
>> Hi Rui,
>>
>> On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote:
>>> Problem statement
>>> -----------------
>>> When using cgroup isolated partition to isolate cpus including
>>> cpu0, it
>>> is observed that cpu0 is woken up frequenctly but doing nothing.
>>> This is
>>> not good for power efficiency.
>>>
>>> <idle>-0     [000]   616.491602: hrtimer_cancel:
>>> hrtimer=0xffff8e8fdf623c10
>>> <idle>-0     [000]   616.491608: hrtimer_start:
>>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
>>> expires=615996000000 softexpires=615996000000
>>> <idle>-0     [000]   616.491616: rcu_utilization:      Start
>>> context switch
>>> <idle>-0     [000]   616.491618: rcu_utilization:      End context
>>> switch
>>> <idle>-0     [000]   616.491637: tick_stop:            success=1
>>> dependency=NONE
>>> <idle>-0     [000]   616.491637: hrtimer_cancel:
>>> hrtimer=0xffff8e8fdf623c10
>>> <idle>-0     [000]   616.491638: hrtimer_start:
>>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
>>> expires=616420000000 softexpires=616420000000
>>>
>>> The above pattern repeats every one or multiple ticks, results in
>>> total
>>> 2000+ wakeups on cpu0 in 60 seconds, when running workload on the
>>> cpus that are not in the isolated partition.
>>>
>>> Rootcause
>>> ---------
>>> In NOHZ mode, an active cpu either sends an IPI or touches the idle
>>> cpu's polling flag to wake it up, so that the idle cpu can pull
>>> tasks
>>> from the busy cpu. The logic for selecting the target cpu is to use
>>> the
>>> first idle cpu that presents in both nohz.idle_cpus_mask and
>>> housekeeping_cpumask.
>>>
>>> In the above scenario, when cpu0 is in the cgroup isolated
>>> partition,
>>> its sched domain is deteched, but it is still available in both of
>>> the
>>> above cpumasks. As a result, cpu0
>>
>> I saw in nohz_balance_enter_idle(), if a cpu is isolated, it will not
>> set itself in nohz.idle_cpus_mask and thus should not be chosen as
>> ilb_cpu. I wonder what's stopping this from working?
> 
> One thing I forgot to mention is that the problem is gone if we offline
> and re-online those cpus. In that case, the isolated cpus are removed
> from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are never
> added back.
> 
> At runtime, the cpus can be removed from the nohz.idle_cpus_mask in
> below case only
> 	trigger_load_balance()
> 	        if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq))))
> 	                return;
> 	        nohz_balancer_kick(rq);
> 			nohz_balance_exit_idle()
> 
> My understanding is that if a cpu is in nohz.idle_cpus_mask when it is
> isolated, there is no chance to remove it from that mask later, so the
> check in nohz_balance_enter_idle() does not help.


As you mentioned, once a CPU is isolated, its rq->nohz_tick_stopped is
never updated. Instead of avoiding isolated CPUs at each find_new_ilb() call
as this patch does, maybe it would be better to permanently remove these CPUs
from nohz.idle_cpus_mask. This would lower the number of checks done.
This could be done with something like:
@@ -11576,6 +11586,20 @@ void nohz_balance_enter_idle(int cpu)
           */
          rq->has_blocked_load = 1;
   
+       /* If we're a completely isolated CPU, we don't play: */
+       if (on_null_domain(rq)) {
+               if (cpumask_test_cpu(rq->cpu, nohz.idle_cpus_mask)) {
+                       cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
+                       atomic_dec(&nohz.nr_cpus);
+               }
+
+               /*
+                * Set nohz_tick_stopped on isolated CPUs to avoid keeping the
+                * value that was stored when rebuild_sched_domains() was called.
+                */
+               rq->nohz_tick_stopped = 1;
+       }
+
          /*
           * The tick is still stopped but load could have been added in the
           * meantime. We set the nohz.has_blocked flag to trig a check of the
@@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
          if (rq->nohz_tick_stopped)
                  goto out;
   
-       /* If we're a completely isolated CPU, we don't play: */
-       if (on_null_domain(rq))
-               return;
-
          rq->nohz_tick_stopped = 1;
   
          cpumask_set_cpu(cpu, nohz.idle_cpus_mask);

Otherwise I could reproduce the issue and the patch was solving it, so:
Tested-by: Pierre Gondois <pierre.gondois@....com>





Also, your patch doesn't aim to solve that, but I think there is an issue
when updating cpuset.cpus when an isolated partition was already created:

// Create an isolated partition containing CPU0
# mkdir cgroup
# mount -t cgroup2 none cgroup/
# mkdir cgroup/Testing
# echo "+cpuset" > cgroup/cgroup.subtree_control
# echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
# echo 0 > cgroup/Testing/cpuset.cpus
# echo isolated > cgroup/Testing/cpuset.cpus.partition

// CPU0's sched domain is detached:
# ls /sys/kernel/debug/sched/domains/cpu0/
# ls /sys/kernel/debug/sched/domains/cpu1/
domain0  domain1

// Change the isolated partition to be CPU1
# echo 1 > cgroup/Testing/cpuset.cpus

// CPU[0-1] sched domains are not updated:
# ls /sys/kernel/debug/sched/domains/cpu0/
# ls /sys/kernel/debug/sched/domains/cpu1/
domain0  domain1

Regards,
Pierre