lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20230804090858.7605-1-rui.zhang@intel.com>
Date:   Fri,  4 Aug 2023 17:08:58 +0800
From:   Zhang Rui <rui.zhang@...el.com>
To:     mingo@...hat.com, peterz@...radead.org, vincent.guittot@...aro.org
Cc:     linux-kernel@...r.kernel.org, tj@...nel.org,
        srinivas.pandruvada@...el.com
Subject: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

Problem statement
-----------------
When using cgroup isolated partition to isolate cpus including cpu0, it
is observed that cpu0 is woken up frequenctly but doing nothing. This is
not good for power efficiency.

<idle>-0     [000]   616.491602: hrtimer_cancel:       hrtimer=0xffff8e8fdf623c10
<idle>-0     [000]   616.491608: hrtimer_start:        hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=615996000000 softexpires=615996000000
<idle>-0     [000]   616.491616: rcu_utilization:      Start context switch
<idle>-0     [000]   616.491618: rcu_utilization:      End context switch
<idle>-0     [000]   616.491637: tick_stop:            success=1 dependency=NONE
<idle>-0     [000]   616.491637: hrtimer_cancel:       hrtimer=0xffff8e8fdf623c10
<idle>-0     [000]   616.491638: hrtimer_start:        hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 expires=616420000000 softexpires=616420000000

The above pattern repeats every one or multiple ticks, results in total
2000+ wakeups on cpu0 in 60 seconds, when running workload on the
cpus that are not in the isolated partition.

Rootcause
---------
In NOHZ mode, an active cpu either sends an IPI or touches the idle
cpu's polling flag to wake it up, so that the idle cpu can pull tasks
from the busy cpu. The logic for selecting the target cpu is to use the
first idle cpu that presents in both nohz.idle_cpus_mask and
housekeeping_cpumask.

In the above scenario, when cpu0 is in the cgroup isolated partition,
its sched domain is deteched, but it is still available in both of the
above cpumasks. As a result, cpu0
1. is always selected when kicking idle load balance
2. is woken up from the idle loop
3. calls __schedule() but cannot find any task to pull because it is not
   in any sched_domain, thus it does nothing and reenters idle.

Solution
--------
Fix the problem by skipping cpus with no sched domain attached during
NOHZ idle balance.

Signed-off-by: Zhang Rui <rui.zhang@...el.com>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..ea3185a46962 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11340,6 +11340,9 @@ static inline int find_new_ilb(void)
 		if (ilb == smp_processor_id())
 			continue;
 
+		if (unlikely(on_null_domain(cpu_rq(ilb))))
+			continue;
+
 		if (idle_cpu(ilb))
 			return ilb;
 	}
-- 
2.34.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ