linux-kernel - Re: [RFC PATCH v2 0/2] sched/fair migration reduction features

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <85b710a9-5b26-b0df-8c21-c2768a21e182@amd.com>
Date:   Fri, 27 Oct 2023 08:57:00 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@....com>,
        Aaron Lu <aaron.lu@...el.com>, Chen Yu <yu.c.chen@...el.com>,
        Tim Chen <tim.c.chen@...el.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>, x86@...nel.org
Subject: Re: [RFC PATCH v2 0/2] sched/fair migration reduction features

Hello Mathieu,

On 10/19/2023 9:35 PM, Mathieu Desnoyers wrote:
> Hi,
> 
> This series introduces two new scheduler features: UTIL_FITS_CAPACITY
> and SELECT_BIAS_PREV. When used together, they achieve a 41% speedup of
> a hackbench workload which leaves some idle CPU time on a 192-core AMD
> EPYC.
> 
> The main metrics which are significantly improved are:
> 
> - cpu-migrations are reduced by 80%,
> - CPU utilization is increased by 17%.
> 
> Feedback is welcome. I am especially interested to learn whether this
> series has positive or detrimental effects on performance of other
> workloads.

I got a chance to test this series on a dual socket 3rd Generation EPYC
System (2 x 64C/128T). Following is a quick summary:

- stream and ycsb-mongodb don't see any changes.

- hackbench and DeathStarBench see a major improvement. Both are high
  utilization workloads with CPUs being overloaded most of the time.
  DeathStarBench is known to benefit from lower migration count. It was
  discussed by Gautham at OSPM '23.

- tbench, netperf, and sch bench regresses. The former two when the
  system is near fully loaded, and the latter for most cases. All these
  benchmarks are client-server / messenger-worker oriented and is
  known to perform better when client-server / messenger-worker are on
  same CCX (LLC domain).

Detailed results are as follows:

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel Details

- tip:	tip:sched/core at commit 984ffb6a4366 ("sched/fair: Remove
	SIS_PROP")

- wake_prev_bias: tip + this series + Peter's suggestion to optimize
		  sched_util_fits_capacity_active()

I've taken liberty at resolving the conflict with recently added cluster
wakeup optimization by prioritizing "SELECT_BIAS_PREV" feature.
select_idle_sibling() looks as follows:

	select_idle_sibling(...)
	{

		...

		/*
		 * With the SELECT_BIAS_PREV feature, if the previous CPU is
		 * cache affine, prefer the previous CPU when all CPUs are busy
		 * to inhibit migration.
		 */
		if (sched_feat(SELECT_BIAS_PREV) &&
		    prev != target && cpus_share_cache(prev, target))
			return prev;

		/*
		 * For cluster machines which have lower sharing cache like L2 or
		 * LLC Tag, we tend to find an idle CPU in the target's cluster
		 * first. But prev_cpu or recent_used_cpu may also be a good candidate,
		 * use them if possible when no idle CPU found in select_idle_cpu().
		 */
		if ((unsigned int)prev_aff < nr_cpumask_bits)
			return prev_aff;
		if ((unsigned int)recent_used_cpu < nr_cpumask_bits)
			return recent_used_cpu;

		return target;
	}

Please let me know if you have a different ordering in mind.

o Benchmark results

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           tip[pct imp](CV)    wake_prev_bias[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.88)     0.97 [  2.88]( 1.78)
 2-groups     1.00 [ -0.00]( 2.03)     0.91 [  8.79]( 1.19)
 4-groups     1.00 [ -0.00]( 1.42)     0.87 [ 13.07]( 1.77)
 8-groups     1.00 [ -0.00]( 1.37)     0.86 [ 13.70]( 0.98)
16-groups     1.00 [ -0.00]( 2.54)     0.90 [  9.74]( 1.65)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:    tip[pct imp](CV)    wake_prev_bias[pct imp](CV)
    1     1.00 [  0.00]( 0.63)     0.99 [ -0.53]( 0.97)
    2     1.00 [  0.00]( 0.89)     1.00 [  0.21]( 0.99)
    4     1.00 [  0.00]( 1.34)     1.01 [  0.70]( 0.88)
    8     1.00 [  0.00]( 0.49)     1.00 [  0.40]( 0.55)
   16     1.00 [  0.00]( 1.51)     0.99 [ -0.51]( 1.23)
   32     1.00 [  0.00]( 0.74)     0.97 [ -2.57]( 0.59)
   64     1.00 [  0.00]( 0.92)     0.95 [ -4.69]( 0.70)
  128     1.00 [  0.00]( 0.97)     0.91 [ -8.58]( 0.29)
  256     1.00 [  0.00]( 1.14)     0.90 [ -9.86]( 2.40)
  512     1.00 [  0.00]( 0.35)     0.97 [ -2.91]( 1.78)
 1024     1.00 [  0.00]( 0.07)     0.96 [ -4.15]( 1.43)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)    wake_prev_bias[pct imp](CV)
 Copy     1.00 [  0.00]( 8.25)     1.04 [  3.53](10.84)
Scale     1.00 [  0.00]( 5.65)     0.99 [ -0.85]( 5.94)
  Add     1.00 [  0.00]( 5.73)     1.00 [  0.50]( 7.68)
Triad     1.00 [  0.00]( 3.41)     1.00 [  0.12]( 6.25)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)    wake_prev_bias[pct imp](CV)
 Copy     1.00 [  0.00]( 1.75)     1.01 [  1.18]( 1.61)
Scale     1.00 [  0.00]( 0.92)     1.00 [ -0.14]( 1.37)
  Add     1.00 [  0.00]( 0.32)     0.99 [ -0.54]( 1.34)
Triad     1.00 [  0.00]( 5.97)     1.00 [  0.37]( 6.34)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:         tip[pct imp](CV)    wake_prev_bias[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.67)     1.00 [  0.08]( 0.15)
 2-clients     1.00 [  0.00]( 0.15)     1.00 [  0.10]( 0.57)
 4-clients     1.00 [  0.00]( 0.58)     1.00 [  0.10]( 0.74)
 8-clients     1.00 [  0.00]( 0.46)     1.00 [  0.31]( 0.64)
16-clients     1.00 [  0.00]( 0.84)     0.99 [ -0.56]( 1.78)
32-clients     1.00 [  0.00]( 1.07)     1.00 [  0.04]( 1.40)
64-clients     1.00 [  0.00]( 1.53)     1.01 [  0.68]( 2.27)
128-clients    1.00 [  0.00]( 1.17)     0.99 [ -0.70]( 1.17)
256-clients    1.00 [  0.00]( 5.42)     0.91 [ -9.31](10.74)
512-clients    1.00 [  0.00](48.07)     1.00 [ -0.07](47.71)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers: tip[pct imp](CV)    wake_prev_bias[pct imp](CV)
  1     1.00 [ -0.00](12.00)     1.06 [ -5.56]( 2.99)
  2     1.00 [ -0.00]( 6.96)     1.08 [ -7.69]( 2.38)
  4     1.00 [ -0.00](13.57)     1.07 [ -7.32](12.95)
  8     1.00 [ -0.00]( 6.45)     0.98 [  2.08](10.86)
 16     1.00 [ -0.00]( 3.45)     1.02 [ -1.72]( 1.69)
 32     1.00 [ -0.00]( 3.00)     1.05 [ -5.00](10.92)
 64     1.00 [ -0.00]( 2.18)     1.04 [ -4.17]( 1.15)
128     1.00 [ -0.00]( 7.15)     1.07 [ -6.65]( 8.45)
256     1.00 [ -0.00](30.20)     1.72 [-72.03](30.62)
512     1.00 [ -0.00]( 4.90)     0.97 [  3.25]( 1.92) 


==================================================================
Test          : ycsb-mondodb
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
metric      tip     wake_prev_bias(%diff)
throughput  1.00    0.99 (%diff: -0.94%)


==================================================================
Test          : DeathStarBench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Pinning   scaling   tip     wake_prev_bias(%diff)
1CCD        1       1.00    1.10 (%diff: 10.04%)
2CCD        2       1.00    1.06 (%diff: 5.90%)
4CCD        4       1.00    1.04 (%diff: 3.74%)
8CCD        8       1.00    1.03 (%diff: 2.98%)

--
It is a mixed bag of results, as expected. I would love to hear your
thoughts on the results. Meanwhile, I'll try to get some more data
from other benchmarks.

> 
> [..snip..]
> 

--
Thanks and Regards,
Prateek