linux-kernel - [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240119084548.2788-1-kprateek.nayak@amd.com>
Date: Fri, 19 Jan 2024 14:15:48 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: <linux-kernel@...r.kernel.org>
CC: <mingo@...hat.com>, <peterz@...radead.org>, <juri.lelli@...hat.com>,
	<vincent.guittot@...aro.org>, <dietmar.eggemann@....com>,
	<rostedt@...dmis.org>, <bsegall@...gle.com>, <mgorman@...e.de>,
	<bristot@...hat.com>, <vschneid@...hat.com>, <gautham.shenoy@....com>,
	<kprateek.nayak@....com>
Subject: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

When measuring IPI throughput using a modified version of Anton
Blanchard's ipistorm benchmark [1], configured to measure time taken to
perform a fixed number of smp_call_function_single() (with wait set to
1), an increase in benchmark time was observed between v5.7 and v5.8.
Bisection pointed to commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") as the reason behind this increase in
runtime.

Since a clean revert is not possible with the latest kernel, to restore
the older behavior, the call_function_single_prep_ipi() check in
send_call_function_single_ipi() was skipped leading to
send_call_function_single_ipi() always calling
arch_send_call_function_single_ipi(). The
flush_smp_call_function_queue() introduced in do_idle() by the same
commit was removed as well since the above change would unconditionally
send an IPI to the idle CPU in TIF_POLLING mode.

With the above changes (which will be referred to as "revert"
henceforth) the ipistorm benchmark time improves. Following are
normalized results from test runs (average runtime from 10 runs of
ipistorm) on a dual socket 4th Generation EPYC system (2 x 128C/256T):

cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1

  ==================================================================
  Test          : ipistorm (modified)
  Units         : Normalized runtime
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  kernel:		time [pct imp]
  v6.7-rc5		1.00 [0.00]
  v6.7-rc5 + revert	0.66 [34.41]

Although the revert improves ipistorm performance, it also regresses
tbench and netperf when applied on top of latest tip:sched/core
supporting the validity of the optimization

tip:sched/core was at tag "sched-core-2024-01-08" during the test. C2
was disabled but boost remained on coupled with performance governor for
all the tests.

  ==================================================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:    tip[pct imp](CV)       revert[pct imp](CV)
      1     1.00 [  0.00]( 0.24)     0.91 [ -8.96]( 0.30)
      2     1.00 [  0.00]( 0.25)     0.92 [ -8.20]( 0.97)
      4     1.00 [  0.00]( 0.23)     0.91 [ -9.20]( 1.75)
      8     1.00 [  0.00]( 0.69)     0.91 [ -9.48]( 1.56)
     16     1.00 [  0.00]( 0.66)     0.92 [ -8.49]( 2.43)
     32     1.00 [  0.00]( 0.96)     0.89 [-11.13]( 0.96)
     64     1.00 [  0.00]( 1.06)     0.90 [ -9.72]( 2.49)
    128     1.00 [  0.00]( 0.70)     0.92 [ -8.36]( 1.26)
    256     1.00 [  0.00]( 0.72)     0.97 [ -3.30]( 1.10)
    512     1.00 [  0.00]( 0.42)     0.98 [ -1.73]( 0.37)
   1024     1.00 [  0.00]( 0.28)     0.99 [ -1.39]( 0.43)

  ==================================================================
  Test          : netperf
  Units         : Normalized Througput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:         tip[pct imp](CV)       revert[pct imp](CV)
   1-clients     1.00 [  0.00]( 0.50)     0.89 [-10.51]( 0.20)
   2-clients     1.00 [  0.00]( 1.16)     0.89 [-11.10]( 0.59)
   4-clients     1.00 [  0.00]( 1.03)     0.89 [-10.68]( 0.38)
   8-clients     1.00 [  0.00]( 0.99)     0.89 [-10.54]( 0.50)
  16-clients     1.00 [  0.00]( 0.87)     0.89 [-10.92]( 0.95)
  32-clients     1.00 [  0.00]( 1.24)     0.89 [-10.85]( 0.63)
  64-clients     1.00 [  0.00]( 1.58)     0.90 [-10.11]( 1.18)
  128-clients    1.00 [  0.00]( 0.87)     0.89 [-10.94]( 1.11)
  256-clients    1.00 [  0.00]( 4.77)     1.00 [ -0.16]( 3.45)
  512-clients    1.00 [  0.00](56.16)     1.02 [  2.10](56.05)

Looking further into the difference in the code path taken, it was
noticed that to pull a TIF_POLLING thread out of idle to process an IPI,
the sender sets the TIF_NEED_RESCHED bit in the idle task's thread info.
As a result, the scheduler expects a task to be enqueued when exiting
the idle path. This is not the case with non-polling idle states where
the idle CPU exits the non-polling idle state to process the interrupt,
and since need_resched() returns false, soon goes back to idle again.

When TIF_NEED_RESCHED flag is set, do_idle() will call schedule_idle(),
a large part of which runs with local IRQ disabled. In case of ipistorm,
when measuring IPI throughput, this large IRQ disabled section delays
processing of IPIs. Further auditing revealed that in absence of any
runnable tasks, pick_next_task_fair(), which is called from the
pick_next_task() fast path, will always call newidle_balance() in this
scenario, further increasing the time spent in the IRQ disabled section.

In cases where an idle CPU in TIF_POLLING mode wakes up to process only
an interrupt, with no new tasks enqueued, skip newidle_balance(). This
also makes the behavior consistent with scenarios where an idle CPU in
non-polling mode is woken up only to process an interrupt where
need_resched() will return false leading to an immediate idle re-entry.

Following is the ipistorm results on the same test system with this
patch:

  ==================================================================
  Test          : ipistorm (modified)
  Units         : Normalized runtime
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  kernel:		time [pct imp]
  v6.7-rc5		1.00 [0.00]
  v6.7-rc5 + revert	0.66 [34.41]
  v6.7-rc5 + patch	0.55 [44.81]

netperf and tbench results with the patch match the results on tip.
Additionally, hackbench, stream, and schbench too were tested, with
results from the patched kernel matching that of the tip.

Link: https://github.com/antonblanchard/ipistorm [1]
Suggested-by: Gautham R. Shenoy <gautham.shenoy@....com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
This patch is based on tip:sched/core at tag "sched-core-2024-01-08".
---
 kernel/sched/fair.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b803030c3a03..1fedc7e29c98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8499,6 +8499,16 @@ done: __maybe_unused;
 	if (!rf)
 		return NULL;
 
+	/*
+	 * An idle CPU in TIF_POLLING mode might end up here after processing
+	 * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
+	 * sending an actual IPI. In such cases, where an idle CPU was woken
+	 * up only to process an interrupt, without necessarily queuing a task
+	 * on it, skip newidle_balance() to facilitate faster idle re-entry.
+	 */
+	if (prev == rq->idle)
+		return NULL;
+
 	new_tasks = newidle_balance(rq, rf);
 
 	/*
-- 
2.25.1