[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240809092240.6921-1-kprateek.nayak@amd.com>
Date: Fri, 9 Aug 2024 09:22:40 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, <linux-kernel@...r.kernel.org>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>
Subject: [PATCH v2] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
From: Peter Zijlstra <peterz@...radead.org>
Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
IPI without actually sending an interrupt. Even in cases where the IPI
handler does not queue a task on the idle CPU, do_idle() will call
__schedule() since need_resched() returns true in these cases.
Introduce and use SM_IDLE to identify call to __schedule() from
schedule_idle() and shorten the idle re-entry time by skipping
pick_next_task() when nr_running is 0 and the previous task is the idle
task.
With the SM_IDLE fast-path, the time taken to complete a fixed set of
IPIs using ipistorm improves noticeably. Following are the numbers
from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T) and
3rd Generation AMD EPYC system (2 x 64C/128T) (boost on, C2 disabled)
running ipistorm between CPU8 and CPU16:
cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1
==================================================================
Test : ipistorm (modified)
Units : Normalized runtime
Interpretation: Lower is better
Statistic : AMean
======================= Intel Ice Lake Xeon ======================
kernel: time [pct imp]
tip:sched/core 1.00 [baseline]
tip:sched/core + SM_IDLE 0.80 [20.51%]
==================== 3rd Generation AMD EPYC =====================
kernel: time [pct imp]
tip:sched/core 1.00 [baseline]
tip:sched/core + SM_IDLE 0.90 [10.17%]
==================================================================
[ kprateek: Commit message, SM_RTLOCK_WAIT fix ]
Link: https://lore.kernel.org/lkml/20240615012814.GP8774@noisy.programming.kicks-ass.net/
Not-yet-signed-off-by: Peter Zijlstra <peterz@...radead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
v1..v2:
- Fixed SM_RTLOCK_WAIT being considered as preemption for task state
change on PREEMPT_RT kernels. Since (sched_mode & SM_MASK_PREEMPT) was
used in a couple of places, I decided to reuse the preempt variable.
(Vincent, Peter)
- Seperated this patch from the newidle_balance() fixes series since
there are PREEMPT_RT bits that requires deeper review whereas this is
an independent enhancement on its own.
- Updated the numbers based on latest tip:sched/core. In my testing the
v6.11-rc1 based tip gives better IPI throughput out of the box which
is why the improvements are respectable and not as massive as what was
reported on v6.10 based tip in v1.
This series is based on tip:sched/core at commit cea5a3472ac4
("sched/fair: Cleanup fair_server")
v1: https://lore.kernel.org/all/20240710090210.41856-1-kprateek.nayak@amd.com/
---
kernel/sched/core.c | 45 ++++++++++++++++++++++++++-------------------
1 file changed, 26 insertions(+), 19 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 29fde993d3f8..6d55a30bb017 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6380,19 +6380,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
* Constants for the sched_mode argument of __schedule().
*
* The mode argument allows RT enabled kernels to differentiate a
- * preemption from blocking on an 'sleeping' spin/rwlock. Note that
- * SM_MASK_PREEMPT for !RT has all bits set, which allows the compiler to
- * optimize the AND operation out and just check for zero.
+ * preemption from blocking on an 'sleeping' spin/rwlock.
*/
-#define SM_NONE 0x0
-#define SM_PREEMPT 0x1
-#define SM_RTLOCK_WAIT 0x2
-
-#ifndef CONFIG_PREEMPT_RT
-# define SM_MASK_PREEMPT (~0U)
-#else
-# define SM_MASK_PREEMPT SM_PREEMPT
-#endif
+#define SM_IDLE (-1)
+#define SM_NONE 0
+#define SM_PREEMPT 1
+#define SM_RTLOCK_WAIT 2
/*
* __schedule() is the main scheduler function.
@@ -6433,9 +6426,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*
* WARNING: must be called with preemption disabled!
*/
-static void __sched notrace __schedule(unsigned int sched_mode)
+static void __sched notrace __schedule(int sched_mode)
{
struct task_struct *prev, *next;
+ /*
+ * On PREEMPT_RT kernel, SM_RTLOCK_WAIT is noted
+ * as a preemption by schedule_debug() and RCU.
+ */
+ bool preempt = sched_mode > SM_NONE;
unsigned long *switch_count;
unsigned long prev_state;
struct rq_flags rf;
@@ -6446,13 +6444,13 @@ static void __sched notrace __schedule(unsigned int sched_mode)
rq = cpu_rq(cpu);
prev = rq->curr;
- schedule_debug(prev, !!sched_mode);
+ schedule_debug(prev, preempt);
if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
hrtick_clear(rq);
local_irq_disable();
- rcu_note_context_switch(!!sched_mode);
+ rcu_note_context_switch(preempt);
/*
* Make sure that signal_pending_state()->signal_pending() below
@@ -6481,12 +6479,20 @@ static void __sched notrace __schedule(unsigned int sched_mode)
switch_count = &prev->nivcsw;
+ /* Task state changes only considers SM_PREEMPT as preemption */
+ preempt = sched_mode == SM_PREEMPT;
+
/*
* We must load prev->state once (task_struct::state is volatile), such
* that we form a control dependency vs deactivate_task() below.
*/
prev_state = READ_ONCE(prev->__state);
- if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
+ if (sched_mode == SM_IDLE) {
+ if (!rq->nr_running) {
+ next = prev;
+ goto picked;
+ }
+ } else if (!preempt && prev_state) {
if (signal_pending_state(prev_state, prev)) {
WRITE_ONCE(prev->__state, TASK_RUNNING);
} else {
@@ -6520,6 +6526,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
}
next = pick_next_task(rq, prev, &rf);
+picked:
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
@@ -6561,7 +6568,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
psi_account_irqtime(rq, prev, next);
psi_sched_switch(prev, next, !task_on_rq_queued(prev));
- trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);
+ trace_sched_switch(preempt, prev, next, prev_state);
/* Also unlocks the rq: */
rq = context_switch(rq, prev, next, &rf);
@@ -6637,7 +6644,7 @@ static void sched_update_worker(struct task_struct *tsk)
}
}
-static __always_inline void __schedule_loop(unsigned int sched_mode)
+static __always_inline void __schedule_loop(int sched_mode)
{
do {
preempt_disable();
@@ -6682,7 +6689,7 @@ void __sched schedule_idle(void)
*/
WARN_ON_ONCE(current->__state);
do {
- __schedule(SM_NONE);
+ __schedule(SM_IDLE);
} while (need_resched());
}
base-commit: cea5a3472ac43f18590e1bd6b842f808347a810c
--
2.34.1
Powered by blists - more mailing lists