linux-kernel - [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220428182442.659294-1-yu.c.chen@intel.com>
Date:   Fri, 29 Apr 2022 02:24:42 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Mel Gorman <mgorman@...e.de>,
        Yicong Yang <yangyicong@...ilicon.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        Tim Chen <tim.c.chen@...el.com>
Cc:     Chen Yu <yu.chen.surf@...il.com>, Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Barry Song <21cnbao@...il.com>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
        Len Brown <len.brown@...el.com>,
        Ben Segall <bsegall@...gle.com>,
        Aubrey Li <aubrey.li@...el.com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Zhang Rui <rui.zhang@...el.com>, linux-kernel@...r.kernel.org,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Chen Yu <yu.c.chen@...el.com>
Subject: [PATCH v3] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg

[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.

The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:

@usecs:
[0]                  533 |                                                    |
[1]                 5495 |                                                    |
[2, 4)             12008 |                                                    |
[4, 8)            239252 |                                                    |
[8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
[16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
[512, 1K)        2600472 |@@@@@@@@@                                           |
[1K, 2K)          927912 |@@@                                                 |
[2K, 4K)          218720 |                                                    |
[4K, 8K)           98161 |                                                    |
[8K, 16K)          37722 |                                                    |
[16K, 32K)          6715 |                                                    |
[32K, 64K)           477 |                                                    |
[64K, 128K)            7 |                                                    |

netperf latency usecs:
=======
case            	load    	    Lat_99th	    std%
TCP_RR          	thread-224	      257.39	(  0.21)

The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.

[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:

SIS Search Efficiency(se_eff%):
        A ratio expressed as a percentage of runqueues scanned versus
        idle CPUs found. A 100% efficiency indicates that the target,
        prev or recent CPU of a task was idle at wakeup. The lower the
        efficiency, the more runqueues were scanned before an idle CPU
        was found.

SIS Domain Search Efficiency(dom_eff%):
        Similar, except only for the slower SIS
	patch.

SIS Fast Success Rate(fast_rate%):
        Percentage of SIS that used target, prev or
	recent CPUs.

SIS Success rate(success_rate%):
        Percentage of scans that found an idle CPU.

The test is based on Aubrey's schedtests tool, netperf, hackbench,
schbench and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Each test lasts for 100 seconds
and repeats 3 times. The system reboots into a fresh environment for
each test.

Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860

schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549

schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312

schedstat_parse.py -f tbench_vanilla.log
case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
loopback	 224 threads	        0.000	       0.000	       0.000	       0.000

According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.

[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time. As Peter suggested[2], the lower the util_avg is, the
more select_idle_cpu() should scan for idle CPU, and vice versa.

Introduce the quadratic function:

y = a - bx^2

x is the sum_util ratio [0, 100] of this LLC domain, and y is the percentage
of CPUs to be scanned in the LLC domain. The number of CPUs to search drops
as sum_util increases. When sum_util hits 85% or above, the scan stops.
Choosing 85% is because it is the threshold of an overloaded LLC sched group
(imbalance_pct = 117). Choosing quadratic function is because:

[1] Compared to the linear function, it scans more aggressively when the
    sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
    and the number of CPUs to be scanned. Use heuristic scan for now.

The steps to calculate scan_nr are as followed:
[1] scan_percent = 100 - (x/8.5)^2
    when utilization reaches 85%, scan_percent becomes 0.
[2] scan_nr = nr_llc * scan_percent / 100
[3] scan_nr = max(scan_nr, 0)

For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85 ...
scan_ns   112  112  108  103  92  80  64   47   24    0 ...

Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112ms would decay
to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in
reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache false sharing. And Tim also mentioned
that, it is allowed to be non-optimal in terms of scheduling for the
short term variations, but if there is a long term trend in the load
behavior, the scheduler can adjust for that.

SIS_UTIL is disabled by default. When it is enabled, the select_idle_cpu()
will use the nr_scan calculated by SIS_UTIL instead of the one from
SIS_PROP. Later SIS_UTIL and SIS_PROP could be made mutually exclusive.

[Test result]

The following is the benchmark result comparison between
baseline:vanilla and compare:patched kernel. Positive compare%
indicates better performance.

netperf.throughput
each thread: netperf -4 -H 127.0.0.1 -t TCP/UDP_RR -c -C -l 100
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	28 threads	 1.00 (  0.40)	 +1.14 (  0.37)
TCP_RR          	56 threads	 1.00 (  0.49)	 +0.62 (  0.31)
TCP_RR          	84 threads	 1.00 (  0.50)	 +0.26 (  0.55)
TCP_RR          	112 threads	 1.00 (  0.27)	 +0.29 (  0.28)
TCP_RR          	140 threads	 1.00 (  0.22)	 +0.14 (  0.23)
TCP_RR          	168 threads	 1.00 (  0.21)	 +0.40 (  0.19)
TCP_RR          	196 threads	 1.00 (  0.18)	+183.40 ( 16.43)
TCP_RR          	224 threads	 1.00 (  0.16)	+188.44 (  9.29)
UDP_RR          	28 threads	 1.00 (  0.47)	 +1.45 (  0.47)
UDP_RR          	56 threads	 1.00 (  0.28)	 -0.22 (  0.30)
UDP_RR          	84 threads	 1.00 (  0.38)	 +1.72 ( 27.10)
UDP_RR          	112 threads	 1.00 (  0.16)	 +0.01 (  0.18)
UDP_RR          	140 threads	 1.00 ( 14.10)	 +0.32 ( 11.15)
UDP_RR          	168 threads	 1.00 ( 12.75)	 +0.91 ( 11.62)
UDP_RR          	196 threads	 1.00 ( 14.41)	+191.97 ( 19.34)
UDP_RR          	224 threads	 1.00 ( 15.34)	+194.88 ( 17.06)

Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:

    vanilla                    patched
   4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
     38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
  128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
   5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg

There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This could help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPU's
lock.

Other benchmarks:

hackbench.throughput
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1 group 	 1.00 (  0.09)	 -0.54 (  0.82)
process-pipe    	2 groups 	 1.00 (  0.47)	 +0.89 (  0.61)
process-pipe    	4 groups 	 1.00 (  0.83)	 +0.90 (  0.15)
process-pipe    	8 groups 	 1.00 (  0.09)	 +0.31 (  0.07)
process-sockets 	1 group 	 1.00 (  0.13)	 -0.58 (  0.49)
process-sockets 	2 groups 	 1.00 (  0.41)	 -0.58 (  0.52)
process-sockets 	4 groups 	 1.00 (  0.61)	 -0.37 (  0.50)
process-sockets 	8 groups 	 1.00 (  0.22)	 +1.15 (  0.10)
threads-pipe    	1 group 	 1.00 (  0.35)	 -0.28 (  0.78)
threads-pipe    	2 groups 	 1.00 (  0.65)	 +0.03 (  0.96)
threads-pipe    	4 groups 	 1.00 (  0.43)	 +0.81 (  0.38)
threads-pipe    	8 groups 	 1.00 (  0.11)	 -1.56 (  0.07)
threads-sockets 	1 group 	 1.00 (  0.30)	 -0.39 (  0.41)
threads-sockets 	2 groups 	 1.00 (  0.21)	 -0.23 (  0.27)
threads-sockets 	4 groups 	 1.00 (  0.23)	 +0.36 (  0.19)
threads-sockets 	8 groups 	 1.00 (  0.13)	 +1.57 (  0.06)

tbench.throughput
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	28 threads	 1.00 (  0.15)	 +1.05 (  0.08)
loopback        	56 threads	 1.00 (  0.09)	 +0.36 (  0.04)
loopback        	84 threads	 1.00 (  0.12)	 +0.26 (  0.06)
loopback        	112 threads	 1.00 (  0.12)	 +0.04 (  0.09)
loopback        	140 threads	 1.00 (  0.04)	 +2.98 (  0.18)
loopback        	168 threads	 1.00 (  0.10)	 +2.88 (  0.30)
loopback        	196 threads	 1.00 (  0.06)	 +2.63 (  0.03)
loopback        	224 threads	 1.00 (  0.08)	 +2.60 (  0.06)

schbench.latency_90%_us
========
case            	load    	baseline	compare%
normal          	1 mthread	 1.00 	        -1.7%
normal          	2 mthreads	 1.00           +1.6%
normal          	4 mthreads	 1.00           +1.4%
normal          	8 mthreads	 1.00    	+21.0%

Limitations:
[1]
This patch is based on the util_avg, which is very sensitive to the CPU
frequency invariance. The util_avg would decay quite fast when the
CPU is idle, if the max frequency has been limited by the user.
Patch [3] should be applied if turbo is disabled manually on Intel
platforms.

[2]
There may be unbalanced tasks among CPUs due to CPU affinity. For example,
suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to
CPU0~CPU6, while CPU7 is idle:

          CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
util_avg  1024    1024    1024    1024    1024    1024    1024    0

Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected.

A possible workaround to mitigate this problem is that the nr_scan should
be increased if idle CPUs are detected during periodic load balance. And the
problem could be mitigated by idle load balance that, CPU7 might pull some
tasks on it.

[3]
Prateek mentioned that we should scan aggressively in an LLC domain
with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is
negligible. The current patch aims to propose a generic solution and only
considers the util_avg. A follow-up change could enhance the scan policy
to adjust the scan_percent according to the CPU number in LLC.

v2->v3:
 - Use 85% as the threshold again, because the CPU frequency invariance issue
   has been fixed and the patch is queued for 5.19.

 - Stop the scan if 85% is reached, rather than scanning for at least 4 CPUs.
   According to the feedback from Yicong, it might be better to stop scanning
   entirely when the LLC is overloaded.

 - Replace linear scan with quadratic function scan, to let the SIS scan
   aggressively when the LLC is not busy. Prateek mentioned there was slight
   regression from ycsb-mongodb in v2, which might be due to fewer CPUs
   scanned when the utilization is around 20%.

 - Add back the logic to stop the CPU scan even if has_idle_core is true.
   It might be a waste of time to search for an idle Core if the LLC is
   overloaded. Besides, according to the tbench result from Prateek, stop idle
   Core scan would bring extra performance improvement.

 - Provide the SIS search statistics in the commit log, based on Mel Gorman's
   patch, which is suggested by Adel.

 - Introduce SIS_UTIL sched feature rather than changing the logic of SIS_PROP
   directly, which can be reviewed easier.

v2->v1:
 - As suggested by Peter, introduce an idle CPU scan strategy that is based on
   the util_avg metric. When util_avg is very low it scans more, while when
   util_avg hits the threshold we naturally stop scanning entirely. The threshold
   has been decreased from 85% to 50%, because this is the threshold when the
   CPU is nearly 100% but with turbo disabled. At least scan for 4 CPUs even
   when the LLC is overloaded, to keep it consistent with the current logic of
   select_idle_cpu().

v1:
 - Stop scanning the idle CPU in select_idle_cpu(), if the sum of util_avg in
   the LLC domain has reached 85%.

[Resend to include the missing mailing list, sorry for any inconvenience.]

Link: https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net #1
Link: https://lore.kernel.org/lkml/20220207135253.GF23216@worktop.programming.kicks-ass.net #2
Link: https://lore.kernel.org/lkml/20220407234258.569681-1-yu.c.chen@intel.com #3
Suggested-by: Tim Chen <tim.c.chen@...el.com>
Suggested-by: Peter Zijlstra <peterz@...radead.org>
Signed-off-by: Chen Yu <yu.c.chen@...el.com>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 56 ++++++++++++++++++++++++++++++++++
 kernel/sched/features.h        |  1 +
 3 files changed, 58 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..816df6cc444e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,6 +81,7 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+	int		nr_idle_scan;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23c7d0f617ee..50c9d5b2b338 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6327,6 +6327,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
+	struct sched_domain_shared *sd_share;
 	struct rq *this_rq = this_rq();
 	int this = smp_processor_id();
 	struct sched_domain *this_sd;
@@ -6366,6 +6367,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		time = cpu_clock(this);
 	}
 
+	if (sched_feat(SIS_UTIL)) {
+		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
+		if (sd_share) {
+			/* because !--nr is the condition to stop scan */
+			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
+			/* overloaded LLC is unlikely to have idle cpu/core */
+			if (nr == 1)
+				return -1;
+		}
+	}
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -9267,6 +9279,46 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+static inline void update_idle_cpu_scan(struct lb_env *env,
+					unsigned long sum_util)
+{
+	struct sched_domain_shared *sd_share;
+	int nr_scan, nr_llc, llc_util_pct;
+
+	if (!sched_feat(SIS_UTIL))
+		return;
+	/*
+	 * Update the number of CPUs to scan in LLC domain, which could
+	 * be used as a hint in select_idle_cpu(). The update of this hint
+	 * occurs during periodic load balancing, rather than frequent
+	 * newidle balance.
+	 */
+	nr_llc = per_cpu(sd_llc_size, env->dst_cpu);
+	if (env->idle == CPU_NEWLY_IDLE ||
+	    env->sd->span_weight != nr_llc)
+		return;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu));
+	if (!sd_share)
+		return;
+
+	/*
+	 * The number of CPUs to search drops as sum_util increases, when
+	 * sum_util hits 85% or above, the scan stops.
+	 * The reason to choose 85% as the threshold is because this is the
+	 * imbalance_pct when a LLC sched group is overloaded.
+	 * let y = 100 - (x/8.5)^2 = 100 - x^2/72
+	 * y is the percentage of CPUs to be scanned in the LLC
+	 * domain, x is the ratio of sum_util compared to the
+	 * CPU capacity, which ranges in [0, 100], thus
+	 * nr_scan = nr_llc * y / 100
+	 */
+	llc_util_pct = (sum_util * 100) / (nr_llc * SCHED_CAPACITY_SCALE);
+	nr_scan = (100 - (llc_util_pct * llc_util_pct / 72)) * nr_llc / 100;
+	nr_scan = max(nr_scan, 0);
+	WRITE_ONCE(sd_share->nr_idle_scan, nr_scan);
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -9279,6 +9331,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
+	unsigned long sum_util = 0;
 	int sg_status = 0;
 
 	do {
@@ -9311,6 +9364,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
 
+		sum_util += sgs->group_util;
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
@@ -9336,6 +9390,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
 		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
 	}
+
+	update_idle_cpu_scan(env, sum_util);
 }
 
 #define NUMA_IMBALANCE_MIN 2
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1cf435bbcd9c..69be099019f4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -61,6 +61,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_UTIL, false)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.25.1