linux-kernel - Re: [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZjpFruUiBiNi6VSO@chenyu5-mobl2>
Date: Tue, 7 May 2024 23:15:58 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: <mingo@...hat.com>, <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
	<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
	<mgorman@...e.de>, <bristot@...hat.com>, <vschneid@...hat.com>,
	<linux-kernel@...r.kernel.org>, <kprateek.nayak@....com>,
	<wuyun.abel@...edance.com>, <tglx@...utronix.de>, <efault@....de>,
	<tim.c.chen@...el.com>, <yu.c.chen.y@...il.com>
Subject: Re: [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to
 set request/slice suggestion

On 2024-04-05 at 12:28:04 +0200, Peter Zijlstra wrote:
> Allow applications to directly set a suggested request/slice length using
> sched_attr::sched_runtime.
> 
> The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
> which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.
> 
> Applications should strive to use their periodic runtime at a high
> confidence interval (95%+) as the target slice. Using a smaller slice
> will introduce undue preemptions, while using a larger value will
> increase latency.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>

Is it possible to leverage this task slice to do better task wakeup placement?
The idea is that, the smaller the slice the wakee has, the less idle CPU it
should scan. This can reduce wake latency and inhibit costly task migration,
especially on large systems.

We did some experiments and got some performance improvements:


>From 9cb806476586d7048fcbd0f66d0101f0dbb8fd2b Mon Sep 17 00:00:00 2001
From: Chen Yu <yu.c.chen@...el.com>
Date: Tue, 7 May 2024 22:36:29 +0800
Subject: [RFC PATCH] sched/eevdf: Use customized slice to reduce wakeup latency
 and inhibit task migration

Problem 1:
The overhead of task migration is high on many-core system. The overhead
brings performance penalty due to broken cache locality/higher cache-to-cache
latency.

Problem 2:
During wakeup, the time spent on searching for an idle CPU is costly on
many-core system. Besides, access to other CPU's rq statistics brings
cace contention:

available_idle_cpu(cpu) -> idle_cpu(cpu) -> {rq->curr, rq->nr_running}

Although SIS_UTIL throttles the scan depth based on system utilization,
there is requirement to further limit the scan depth for specific workload,
especially for short duration wakee.

Now we have the interface to customize the request/slice. The smaller the
slice is, the earlier the task can be picked up, and the lower wakeup latency
the task expects. Leverage the wakee's slice to further throttle the
idle CPU scan depth - the shorter slice, the less CPUs to scan.

Test on 240 CPUs, 2 sockets system. With SNC(sub-numa-cluster) enabled,
each LLC domain has 60 CPUs. There is noticeable improvement of netperf.
(With SNC disabled, more improvements should be seen because C2C is higher)

The global slice is 3 msec(sysctl_sched_base_slice) by default on my ubuntu
22.04, and the customized slice is set to 0.1 msec for both netperf and netserver:

for i in $(seq 1 $job); do
	netperf_slice -e 100000 -4 -H 127.0.01 -t TCP_RR -c -C -l 100 &
done

case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	60-threads	 1.00 (  1.60)	 +0.35 (  1.73)
TCP_RR          	120-threads	 1.00 (  1.34)	 -0.96 (  1.24)
TCP_RR          	180-threads	 1.00 (  1.59)	+92.20 (  4.24)
TCP_RR          	240-threads	 1.00 (  9.71)	+43.11 (  2.97)

Suggested-by: Tim Chen <tim.c.chen@...el.com>
Signed-off-by: Chen Yu <yu.c.chen@...el.com>
---
 kernel/sched/fair.c     | 23 ++++++++++++++++++++---
 kernel/sched/features.h |  1 +
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index edc23f6588a3..f269ae7d6e24 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7368,6 +7368,24 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 
 #endif /* CONFIG_SCHED_SMT */
 
+/*
+ * Scale the scan number of idle CPUs according to customized
+ * wakee's slice. The smaller the slice is, the earlier the task
+ * wants be picked up, thus the lower wakeup latency the task expects.
+ * The baseline is the global sysctl_sched_base_slice. Task slice
+ * smaller than the global one would shrink the scan number.
+ */
+static int adjust_idle_scan(struct task_struct *p, int nr)
+{
+	if (!sched_feat(SIS_FAST))
+		return nr;
+
+	if (!p->se.custom_slice || p->se.slice >= sysctl_sched_base_slice)
+		return nr;
+
+	return div_u64(nr * p->se.slice, sysctl_sched_base_slice);
+}
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -7384,10 +7402,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	if (sched_feat(SIS_UTIL)) {
 		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
 		if (sd_share) {
-			/* because !--nr is the condition to stop scan */
-			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
+			nr = adjust_idle_scan(p, READ_ONCE(sd_share->nr_idle_scan));
 			/* overloaded LLC is unlikely to have idle cpu/core */
-			if (nr == 1)
+			if (nr <= 0)
 				return -1;
 		}
 	}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 143f55df890b..176324236018 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -50,6 +50,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
 SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_FAST, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.25.1