lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220910105326.1797-4-kprateek.nayak@amd.com>
Date:   Sat, 10 Sep 2022 16:23:24 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     <linux-kernel@...r.kernel.org>
CC:     <aubrey.li@...ux.intel.com>, <efault@....de>,
        <gautham.shenoy@....com>, <libo.chen@...cle.com>,
        <mgorman@...hsingularity.net>, <mingo@...nel.org>,
        <peterz@...radead.org>, <song.bao.hua@...ilicon.com>,
        <srikar@...ux.vnet.ibm.com>, <tglx@...utronix.de>,
        <valentin.schneider@....com>, <vincent.guittot@...aro.org>,
        <wuyun.abel@...edance.com>, <wyes.karny@....com>,
        <yu.c.chen@...el.com>, <yangyicong@...wei.com>
Subject: [PATCH 3/5] sched/fair: Add support for hints in the subsequent wakeup path

Hints are adhered to as long as there are idle cores in the target MC
domain. Beyond that, the default behavior is followed.

- Hinting flow in the wakeup path

Following is the flow with wakeup hints:

o Check if the task has a wakeup hint set and whether the current
  CPU and the CPU where the task previously ran are on two different
  LLCs. If either is false, bail out and follow the default logic.
o Check whether the previous CPU or the current CPU is the desired
  CPU according to the set hint.
o Test for idle cores in the MC domain of the hinted CPU.
o If yes, set the desired CPU as the target for wakeup. The scheduler
  will then look for an idle CPU withing the MC domain of the target.
o If test_idle_cores returns false, follow the default wakeup path.

PR_SCHED_HINT_WAKE_AFFINE will favor an affine wakeup if the MC where
the waker is running advertises idle core. PR_SCHED_HINT_WAKE_HOLD will
bias the wakeup to MC domain where the task previously ran.

- Results

Following are results from running hackbench with only wakeup hints on a
dual socket Zen3 system in NPS1 mode:

o Hackbench

  Test:                   tip                     no-hint             wake_affine         wake_hold
   1-groups:         4.31 (0.00 pct)         4.46 (-3.48 pct)       4.20 (2.55 pct)    4.11 (4.64 pct)
   2-groups:         4.93 (0.00 pct)         4.85 (1.62 pct)        4.74 (3.85 pct)    5.15 (-4.46 pct)
   4-groups:         5.38 (0.00 pct)         5.35 (0.55 pct)        5.04 (6.31 pct)    4.54 (15.61 pct)
   8-groups:         5.59 (0.00 pct)         5.49 (1.78 pct)        5.39 (3.57 pct)    5.71 (-2.14 pct)
  16-groups:         7.18 (0.00 pct)         7.38 (-2.78 pct)       7.24 (-0.83 pct)   7.76 (-8.07 pct)

As we can observe, the hint PR_SCHED_HINT_WAKE_AFFINE helps performance
across all hackbench configurations. PR_SCHED_HINT_WAKE_HOLD does not
show any consistent behavior and can lead to unpredictable behavior in
hackbench.

- Shortcomings

In schbench, the delay to indicate that no idle core is available in
target MC domain leads to pileup and severe degradation in p99 latency

o schbench

   workers:     tip                     no-hint                 wake_affine		     wake_hold
    1:      37.00 (0.00 pct)        38.00 (-2.70 pct)        18.00 (51.35 pct)      	 32.00 (13.51 pct)
    2:      39.00 (0.00 pct)        36.00 (7.69 pct)         18.00 (53.84 pct)      	 36.00 (7.69 pct)
    4:      41.00 (0.00 pct)        41.00 (0.00 pct)         21.00 (48.78 pct)      	 33.00 (19.51 pct)
    8:      53.00 (0.00 pct)        54.00 (-1.88 pct)        31.00 (41.50 pct)      	 51.00 (3.77 pct)
   16:      73.00 (0.00 pct)        74.00 (-1.36 pct)      2636.00 (-3510.95 pct)   	 75.00 (-2.73 pct)
   32:     116.00 (0.00 pct)       124.00 (-6.89 pct)     15696.00 (-13431.03 pct)      124.00 (-6.89 pct)
   64:     217.00 (0.00 pct)       215.00 (0.92 pct)      15280.00 (-6941.47 pct)       224.00 (-3.22 pct)
  128:     477.00 (0.00 pct)       440.00 (7.75 pct)      14800.00 (-3002.72 pct)       493.00 (-3.35 pct)
  256:     1062.00 (0.00 pct)      1026.00 (3.38 pct)     15696.00 (-1377.96 pct)      1026.00 (3.38 pct)
  512:     47552.00 (0.00 pct)     47168.00 (0.80 pct)    60736.00 (-27.72 pct)       49856.00 (-4.84 pct)

Wake hold seems to still do well by reducing the larger latency samples
that we observe during task migration.

- Potential Solution

One potential solution is to atomically read nr_busy_cpus member of
sched_domain_shared struct but the performance impact of this is yet to
be evaluated in the wakeup path.

Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
 kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efceb670e755..90e523cd8de8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -51,6 +51,8 @@
 
 #include <linux/sched/cond_resched.h>
 
+#include <uapi/linux/prctl.h>
+
 #include "sched.h"
 #include "stats.h"
 #include "autogroup.h"
@@ -7031,6 +7033,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	int want_affine = 0;
 	/* SD_flags and WF_flags share the first nibble */
 	int sd_flag = wake_flags & 0xF;
+	bool use_hint = false;
+	unsigned int task_hint = READ_ONCE(p->hint);
+	unsigned int wakeup_hint = task_hint &
+		(PR_SCHED_HINT_WAKE_AFFINE | PR_SCHED_HINT_WAKE_HOLD);
 
 	/*
 	 * required for stable ->cpus_allowed
@@ -7046,6 +7052,37 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 			new_cpu = prev_cpu;
 		}
 
+		/*
+		 * Handle the case where a hint is set and the current CPU
+		 * and the previous CPU where task ran don't share caches.
+		 */
+		if (wakeup_hint && !cpus_share_cache(cpu, prev_cpu)) {
+			/*
+			 * Start by assuming the hint is PR_SCHED_HINT_WAKE_AFFINE
+			 * setting the target_cpu to the current CPU.
+			 */
+			int target_cpu = cpu;
+
+			/*
+			 * If the hint is PR_SCHED_HINT_WAKE_HOLD
+			 * change target_cpu to the prev_cpu.
+			 */
+
+			if (wakeup_hint & PR_SCHED_HINT_WAKE_HOLD)
+				target_cpu = prev_cpu;
+
+			/*
+			 * If a wakeup hint is set, try to bias the
+			 * task placement towards the preferred node
+			 * as long as there is an idle core in the
+			 * targetted LLC.
+			 */
+			if (test_idle_cores(target_cpu, false)) {
+				use_hint = true;
+				new_cpu = target_cpu;
+			}
+		}
+
 		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
 	}
 
@@ -7057,7 +7094,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		 */
 		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
 		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
-			if (cpu != prev_cpu)
+			/*
+			 * In case it is optimal to follow the hints,
+			 * do not re-evaluate the target CPU.
+			 */
+			if (cpu != prev_cpu && !use_hint)
 				new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);
 
 			sd = NULL; /* Prefer wake_affine over balance flags */
-- 
2.25.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ