lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 25 Jul 2023 15:30:48 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Ingo Molnar <mingo@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@....com>,
        Aaron Lu <aaron.lu@...el.com>, x86@...nel.org
Subject: [RFC PATCH 1/1] sched: Extend cpu idle state for 1ms

Allow select_task_rq to consider a cpu as idle for 1ms after that cpu
has exited the idle loop.

This speeds up the following hackbench workload on a 192 cores AMD EPYC
9654 96-Core Processor (over 2 sockets):

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

from 49s to 34s. (30% speedup)

My working hypothesis for why this helps is: queuing more than a single
task on the runqueue of a cpu which just exited idle rather than
spreading work over other idle cpus helps power efficiency on systems
with large number of cores.

This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench.

It turned out that changing this raw spinlock for a loop of 10000x
cpu_relax within do_idle() had similar benefits.

This patch achieve a similar effect without the busy-waiting by
introducing a runqueue state sampling the sched_clock() when exiting
idle, which allows select_task_rq to consider "as idle" a cpu which has
recently exited idle.

This patch should be considered "food for thoughts", and I would be glad
to hear feedback on whether it causes regressions on _other_ workloads,
and whether it helps with the hackbench workload on large Intel system
as well.

Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Ingo Molnar <mingo@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Valentin Schneider <vschneid@...hat.com>
Cc: Steven Rostedt <rostedt@...dmis.org>
Cc: Ben Segall <bsegall@...gle.com>
Cc: Mel Gorman <mgorman@...e.de>
Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Juri Lelli <juri.lelli@...hat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@....com>
Cc: Aaron Lu <aaron.lu@...el.com>
Cc: x86@...nel.org
---
 kernel/sched/core.c  | 4 ++++
 kernel/sched/sched.h | 3 +++
 2 files changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..d40e3a0a5ced 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6769,6 +6769,7 @@ void __sched schedule_idle(void)
 	 * TASK_RUNNING state.
 	 */
 	WARN_ON_ONCE(current->__state);
+	WRITE_ONCE(this_rq()->idle_end_time, sched_clock());
 	do {
 		__schedule(SM_NONE);
 	} while (need_resched());
@@ -7300,6 +7301,9 @@ int idle_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
+	if (sched_clock() < READ_ONCE(rq->idle_end_time) + IDLE_CPU_DELAY_NS)
+		return 1;
+
 	if (rq->curr != rq->idle)
 		return 0;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 81ac605b9cd5..8932e198a33a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -97,6 +97,8 @@
 # define SCHED_WARN_ON(x)      ({ (void)(x), 0; })
 #endif
 
+#define IDLE_CPU_DELAY_NS	1000000		/* 1ms */
+
 struct rq;
 struct cpuidle_state;
 
@@ -1010,6 +1012,7 @@ struct rq {
 
 	struct task_struct __rcu	*curr;
 	struct task_struct	*idle;
+	u64			idle_end_time;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
 	struct mm_struct	*prev_mm;
-- 
2.39.2

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ