linux-kernel - [PATCH 4/5] sched/fair: Consider hints in the initial task wakeup path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220910105326.1797-5-kprateek.nayak@amd.com>
Date:   Sat, 10 Sep 2022 16:23:25 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     <linux-kernel@...r.kernel.org>
CC:     <aubrey.li@...ux.intel.com>, <efault@....de>,
        <gautham.shenoy@....com>, <libo.chen@...cle.com>,
        <mgorman@...hsingularity.net>, <mingo@...nel.org>,
        <peterz@...radead.org>, <song.bao.hua@...ilicon.com>,
        <srikar@...ux.vnet.ibm.com>, <tglx@...utronix.de>,
        <valentin.schneider@....com>, <vincent.guittot@...aro.org>,
        <wuyun.abel@...edance.com>, <wyes.karny@....com>,
        <yu.c.chen@...el.com>, <yangyicong@...wei.com>
Subject: [PATCH 4/5] sched/fair: Consider hints in the initial task wakeup path

These hints influence the behavior of the initial task placement and bias
the placement towards or away from the CPU where the task is forked.

The flow is as follows:
- When a fork time hint is set, the NUMA biases are overlooked and only
  the sched_group's statistics computed by update_sg_wakeup_stats()
  (Number of idle CPUs and total utilization of the group) for the local
  group and the idlest group is considered while making initial task
  placement decision when both groups have idle CPUs.
- In case a bias towards local group is hinted, go for the local group as
  long as an equivalent of idle core is present.
  Note: The current implements assume the system running the patch is
  SMT-2. Further optimizations can be made for systems with SMT-4,
  SMT-8, or with no SMT.
- If a hint for spread is set, and there is a tie in number of idle CPUs
  in local group and idlest group, use the utilization of group as the
  tie breaking metric.

PR_SCHED_HINT_FORK_AFFINE enables consolidation until half of the local
group is filled. PR_SCHED_HINT_FORK_SPREAD will choose the target group
based on the utilization if there is a tie in number of idle CPUs.

These hints can be set individually in addition to wakeup hints.

- Results

Following are results from using individual fork time hints and
combination of fork time hints and wakeup hints on various benchmark on
a dual socket Zen3 system:

o Only fork time hint:

- Hackbench

Test:                   tip                     no-hint              fork_affine             fork_spread
 1-groups:         4.31 (0.00 pct)         4.46 (-3.48 pct)        4.27 (0.92 pct)         4.28 (0.69 pct)
 2-groups:         4.93 (0.00 pct)         4.85 (1.62 pct)         4.91 (0.40 pct)         5.15 (-4.46 pct)
 4-groups:         5.38 (0.00 pct)         5.35 (0.55 pct)         5.36 (0.37 pct)         5.31 (1.30 pct)
 8-groups:         5.59 (0.00 pct)         5.49 (1.78 pct)         5.51 (1.43 pct)         5.51 (1.43 pct)
16-groups:         7.18 (0.00 pct)         7.38 (-2.78 pct)        7.31 (-1.81 pct)        7.25 (-0.97 pct)

- schbench

 workers:     tip                     no-hint                fork_affine
  1:      37.00 (0.00 pct)        38.00 (-2.70 pct)       17.00 (54.05 pct)
  2:      39.00 (0.00 pct)        36.00 (7.69 pct)        21.00 (46.15 pct)
  4:      41.00 (0.00 pct)        41.00 (0.00 pct)        28.00 (31.70 pct)
  8:      53.00 (0.00 pct)        54.00 (-1.88 pct)       39.00 (26.41 pct)
 16:      73.00 (0.00 pct)        74.00 (-1.36 pct)       68.00 (6.84 pct)
 32:     116.00 (0.00 pct)       124.00 (-6.89 pct)      113.00 (2.58 pct)
 64:     217.00 (0.00 pct)       215.00 (0.92 pct)       205.00 (5.52 pct)
128:     477.00 (0.00 pct)       440.00 (7.75 pct)       445.00 (6.70 pct)
256:     1062.00 (0.00 pct)      1026.00 (3.38 pct)      1007.00 (5.17 pct)
512:     47552.00 (0.00 pct)     47168.00 (0.80 pct)     47296.00 (0.53 pct)

- tbench

Clients:      tip                    no-hint               fork_affine              fork_spread
    1    573.26 (0.00 pct)       572.29 (-0.16 pct)      572.70 (-0.09 pct)      569.64 (-0.63 pct)
    2    1131.19 (0.00 pct)      1119.57 (-1.02 pct)     1131.97 (0.06 pct)      1101.03 (-2.66 pct)
    4    2100.07 (0.00 pct)      2070.66 (-1.40 pct)     2094.80 (-0.25 pct)     2011.64 (-4.21 pct)
    8    3809.88 (0.00 pct)      3784.16 (-0.67 pct)     3458.94 (-9.21 pct)     3867.70 (1.51 pct)
   16    6560.72 (0.00 pct)      6449.64 (-1.69 pct)     6342.78 (-3.32 pct)     6700.50 (2.13 pct)
   32    12203.23 (0.00 pct)     12180.02 (-0.19 pct)    10411.44 (-14.68 pct)   13104.29 (7.38 pct)
   64    22389.81 (0.00 pct)     23084.51 (3.10 pct)     16614.14 (-25.79 pct)   24353.76 (8.77 pct)
  128    32449.37 (0.00 pct)     33561.28 (3.42 pct)     19971.67 (-38.45 pct)   36201.16 (11.56 pct)
  256    58962.40 (0.00 pct)     59118.43 (0.26 pct)     26836.13 (-54.48 pct)   61721.06 (4.67 pct)
  512    59608.71 (0.00 pct)     60246.78 (1.07 pct)     36889.55 (-38.11 pct)   59696.57 (0.14 pct)
 1024    58037.02 (0.00 pct)     58532.41 (0.85 pct)     39936.06 (-31.18 pct)   57445.62 (-1.01 pct)

 All these benchmarks show noticeable improvements only with a slightly
 different initial placement. A placement in line with benchmark
 behavior improves benchmark results.

o Combination of hints

- Hackbench

Test:                   tip                     no-hint       fork_affine + wake_affine   fork_spread + wake_hold
 1-groups:         4.31 (0.00 pct)         4.46 (-3.48 pct)    	  4.20 (2.55 pct)          4.81 (-11.60 pct)
 2-groups:         4.93 (0.00 pct)         4.85 (1.62 pct)     	  4.74 (3.85 pct)          5.09 (-3.24 pct)
 4-groups:         5.38 (0.00 pct)         5.35 (0.55 pct)     	  5.01 (6.87 pct)          5.62 (-4.46 pct)
 8-groups:         5.59 (0.00 pct)         5.49 (1.78 pct)     	  5.38 (3.75 pct)          5.69 (-1.78 pct)
16-groups:         7.18 (0.00 pct)         7.38 (-2.78 pct)    	  7.25 (-0.97 pct)         7.97 (-11.00 pct)

Hackbench improves further with pairing of correct wakeup hint with
correct fork time hint. The regression is equally bad with wrong hints
set.

Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
 kernel/sched/fair.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 90e523cd8de8..4c61bd0e93b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9262,6 +9262,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 	struct sg_lb_stats local_sgs, tmp_sgs;
 	struct sg_lb_stats *sgs;
 	unsigned long imbalance;
+	unsigned int task_hint, fork_hint;
 	struct sg_lb_stats idlest_sgs = {
 			.avg_load = UINT_MAX,
 			.group_type = group_overloaded,
@@ -9365,8 +9366,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 		break;
 
 	case group_has_spare:
+		task_hint = READ_ONCE(p->hint);
+		fork_hint = task_hint &
+			(PR_SCHED_HINT_FORK_SPREAD | PR_SCHED_HINT_FORK_AFFINE);
 #ifdef CONFIG_NUMA
-		if (sd->flags & SD_NUMA) {
+		/*
+		 * If a hint is set, override any NUMA preference behavior.
+		 */
+		if ((sd->flags & SD_NUMA) && !fork_hint) {
 			int imb_numa_nr = sd->imb_numa_nr;
 #ifdef CONFIG_NUMA_BALANCING
 			int idlest_cpu;
@@ -9406,14 +9413,37 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 		}
 #endif /* CONFIG_NUMA */
 
+		/*
+		 * FIXME: Currently the system is assumed to be SMT-2
+		 * and that the number of cores in a group can be
+		 * estimated by halving the group_weight. Determine a
+		 * more generic logic for other SMT possibilities or
+		 * derive it at runtime from the topology.
+		 */
+		if ((task_hint & PR_SCHED_HINT_FORK_AFFINE) &&
+		    local_sgs.idle_cpus > local->group_weight / 2)
+			return NULL;
 		/*
 		 * Select group with highest number of idle CPUs. We could also
 		 * compare the utilization which is more stable but it can end
 		 * up that the group has less spare capacity but finally more
 		 * idle CPUs which means more opportunity to run task.
 		 */
-		if (local_sgs.idle_cpus >= idlest_sgs.idle_cpus)
+		if (local_sgs.idle_cpus > idlest_sgs.idle_cpus)
+			return NULL;
+
+		if (local_sgs.idle_cpus == idlest_sgs.idle_cpus) {
+			/*
+			 * In case of a tie between number of idle CPUs and if
+			 * the task hints a benefit from spreading, go with the
+			 * group with the lesser utilization.
+			 */
+			if ((task_hint & PR_SCHED_HINT_FORK_SPREAD) &&
+			    local_sgs.group_util > idlest_sgs.group_util)
+				return idlest;
+
 			return NULL;
+		}
 		break;
 	}
 
-- 
2.25.1