linux-kernel - Re: [tip: sched/core] sched/fair: Multi-LLC select_idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZK7groqlP3S7r8vt@chenyu5-mobl2>
Date:   Thu, 13 Jul 2023 01:19:42 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     K Prateek Nayak <kprateek.nayak@....com>,
        <linux-kernel@...r.kernel.org>,
        <linux-tip-commits@...r.kernel.org>, Tejun Heo <tj@...nel.org>,
        Tim Chen <tim.c.chen@...el.com>, <x86@...nel.org>,
        Gautham Shenoy <gautham.shenoy@....com>
Subject: Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

On 2023-07-08 at 21:17:10 +0800, Chen Yu wrote:
> On 2023-07-05 at 13:57:02 +0200, Peter Zijlstra wrote:
> > On Fri, Jun 16, 2023 at 12:04:48PM +0530, K Prateek Nayak wrote:
> > 
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -596,7 +596,7 @@ static inline int x86_sched_itmt_flags(v
> >  #ifdef CONFIG_SCHED_MC
> >  static int x86_core_flags(void)
> >  {
> > -	return cpu_core_flags() | x86_sched_itmt_flags();
> > +	return cpu_core_flags() | x86_sched_itmt_flags() | SD_IDLE_SIBLING;
> >  }
> I guess this flag might need to be added into the valid mask:
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d3a3b2646ec4..4a563e9f7b10 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1540,6 +1540,7 @@ static struct cpumask		***sched_domains_numa_masks;
>  #define TOPOLOGY_SD_FLAGS		\
>  	(SD_SHARE_CPUCAPACITY	|	\
>  	 SD_SHARE_PKG_RESOURCES |	\
> +	 SD_IDLE_SIBLING	|	\
>  	 SD_NUMA		|	\
>  	 SD_ASYM_PACKING)
> >  #endif
> >  #ifdef CONFIG_SCHED_SMT
> > --- a/include/linux/sched/sd_flags.h
> > +++ b/include/linux/sched/sd_flags.h
> > @@ -161,3 +161,10 @@ SD_FLAG(SD_OVERLAP, SDF_SHARED_PARENT |
> >   * NEEDS_GROUPS: No point in preserving domain if it has a single group.
> >   */
> >  SD_FLAG(SD_NUMA, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
> > +
> > +/*
> > + * Search for idle CPUs in sibling groups
> > + *
> > + * NEEDS_GROUPS: Load balancing flag.
> > + */
> > +SD_FLAG(SD_IDLE_SIBLING, SDF_NEEDS_GROUPS)
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7046,6 +7046,38 @@ static int select_idle_cpu(struct task_s
> >  }
> >  
> >  /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > +	struct sched_domain *parent = sd->parent;
> > +	struct sched_group *sg;
> > +
> > +	/* Make sure to not cross nodes. */
> > +	if (!parent || parent->flags & SD_NUMA)
> > +		return -1;
> > +
> > +	sg = parent->groups;
> > +	do {
> > +		int cpu = cpumask_first(sched_group_span(sg));
> > +		struct sched_domain *sd_child = per_cpu(sd_llc, cpu);
> >
> I wonder if we can use rcu_dereference() in case the cpu hotplug
> changes the content sd_llc points to. (I'm still thinking of the
> symptom you described here:)
> https://lore.kernel.org/lkml/20230605190746.GX83892@hirez.programming.kicks-ass.net/
> 
> I'll launch some tests with this version on Sapphire Rapids(and with/without LLC-split hack patch).

Tested on Sapphire Rapids, which has 2 x 56C/112T and 224 CPUs in total. C-states
deeper than C1E are disabled. Turbo is disabled. CPU frequency governor is performance.

The baseline is v6.4-rc1 tip:sched/core, on top of
commit 637c9509f3db ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")

patch0: this SD_IDLE_SIBLING patch with above change to TOPOLOGY_SD_FLAGS
patch1: hack patch to split 1 LLC domain into 4 smaller LLC domains(with some fixes on top of
        https://lore.kernel.org/lkml/ZJKjvx%2FNxooM5z1Y@chenyu5-mobl2.ccr.corp.intel.com/)
        The test data in above link is invalid due to bugs in the hack patch, fixed in this version)


Baseline vs Baseline+patch0:
There is no much difference between the two, and it is expected because Sapphire Rapids
does not have multiple LLC domains within 1 Numa node(also consider the run to run variation):

hackbench
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1-groups	 1.00 (  2.66)	+13.84 ( 12.80)
process-pipe    	2-groups	 1.00 (  3.67)	 -8.37 (  2.33)
process-pipe    	4-groups	 1.00 (  6.45)	 +4.17 (  6.36)
process-pipe    	8-groups	 1.00 (  1.69)	 +2.28 (  1.72)
process-sockets 	1-groups	 1.00 (  1.73)	 +0.61 (  0.69)
process-sockets 	2-groups	 1.00 (  2.68)	 -2.20 (  0.55)
process-sockets 	4-groups	 1.00 (  0.03)	 -0.34 (  0.17)
process-sockets 	8-groups	 1.00 (  0.09)	 -0.28 (  0.09)
threads-pipe    	1-groups	 1.00 (  2.42)	 +6.95 (  3.86)
threads-pipe    	2-groups	 1.00 (  2.26)	 +2.68 (  6.56)
threads-pipe    	4-groups	 1.00 (  5.08)	 +3.57 (  4.61)
threads-pipe    	8-groups	 1.00 (  7.89)	 -2.52 (  3.45)
threads-sockets 	1-groups	 1.00 (  1.15)	 +0.87 (  3.13)
threads-sockets 	2-groups	 1.00 (  0.63)	 -0.02 (  1.27)
threads-sockets 	4-groups	 1.00 (  0.27)	 +0.29 (  0.17)
threads-sockets 	8-groups	 1.00 (  0.07)	 -0.42 (  0.40)

netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	56-threads	 1.00 (  2.56)	 -0.25 (  3.27)
TCP_RR          	112-threads	 1.00 (  2.26)	 +0.04 (  2.18)
TCP_RR          	168-threads	 1.00 (  0.81)	 +0.01 (  0.74)
TCP_RR          	224-threads	 1.00 (  0.65)	 +0.04 (  0.66)
TCP_RR          	280-threads	 1.00 ( 64.56)	+69.47 ( 56.78)
TCP_RR          	336-threads	 1.00 ( 20.39)	 +0.08 ( 19.58)
TCP_RR          	392-threads	 1.00 ( 31.63)	 +0.17 ( 31.08)
TCP_RR          	448-threads	 1.00 ( 39.72)	 -0.14 ( 39.14)
UDP_RR          	56-threads	 1.00 (  8.94)	 -0.71 ( 12.03)
UDP_RR          	112-threads	 1.00 ( 18.72)	 +0.78 ( 16.71)
UDP_RR          	168-threads	 1.00 ( 11.39)	 -0.18 (  8.34)
UDP_RR          	224-threads	 1.00 (  9.02)	 +0.81 ( 11.47)
UDP_RR          	280-threads	 1.00 ( 15.87)	 -0.12 ( 12.87)
UDP_RR          	336-threads	 1.00 ( 39.89)	 +2.25 ( 32.35)
UDP_RR          	392-threads	 1.00 ( 28.17)	 +3.47 ( 25.99)
UDP_RR          	448-threads	 1.00 ( 58.68)	 +0.35 ( 56.16)

tbench
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	56-threads	 1.00 (  0.94)	 +0.24 (  0.69)
loopback        	112-threads	 1.00 (  0.19)	 +0.18 (  0.25)
loopback        	168-threads	 1.00 ( 52.17)	 -1.42 ( 50.95)
loopback        	224-threads	 1.00 (  0.86)	 -0.38 (  0.19)
loopback        	280-threads	 1.00 (  0.12)	 -0.28 (  0.17)
loopback        	336-threads	 1.00 (  0.10)	 -0.33 (  0.19)
loopback        	392-threads	 1.00 (  0.27)	 -0.49 (  0.26)
loopback        	448-threads	 1.00 (  0.06)	 -0.88 (  0.59)

schbench
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1-mthreads	 1.00 (  0.72)	 -1.47 (  0.41)
normal          	2-mthreads	 1.00 (  1.66)	 +1.18 (  2.63)
normal          	4-mthreads	 1.00 (  1.12)	 +1.20 (  4.52)
normal          	8-mthreads	 1.00 ( 11.03)	 -3.87 (  5.14)


Baseline+patch1    vs    Baseline+patch0+patch1:

With multiple LLC domains in 1 Numa node, SD_IDLE_SIBLING brings improvement
to hackbench/schbench, while brings downgrading to netperf/tbench. This is aligned
with what was observed previously, if the waker and wakee wakes up each other
frequently, they would like to be put together for cache locality. While for
other tasks do not have shared resource, always choosing an idle CPU is better.
Maybe in the future we can look back at SIS_SHORT and terminates scan in
select_idle_node() if the waker and wakee have close relationship with
each other.


hackbench
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1-groups	 1.00 (  0.25)	+31.65 (  6.77)
process-pipe    	2-groups	 1.00 (  0.28)	+29.50 (  5.35)
process-pipe    	4-groups	 1.00 (  0.08)	+16.77 (  1.30)
process-pipe    	8-groups	 1.00 (  0.20)	 -5.18 (  0.04)
process-sockets 	1-groups	 1.00 (  0.23)	+13.68 (  1.28)
process-sockets 	2-groups	 1.00 (  0.16)	+11.18 (  1.87)
process-sockets 	4-groups	 1.00 (  0.23)	 -0.06 (  0.21)
process-sockets 	8-groups	 1.00 (  0.36)	 +2.34 (  0.15)
threads-pipe    	1-groups	 1.00 (  5.23)	+16.38 ( 12.10)
threads-pipe    	2-groups	 1.00 (  1.63)	+28.52 (  5.17)
threads-pipe    	4-groups	 1.00 (  0.77)	+23.28 (  2.42)
threads-pipe    	8-groups	 1.00 (  2.27)	 +2.35 (  5.75)
threads-sockets 	1-groups	 1.00 (  2.31)	 +0.42 (  1.68)
threads-sockets 	2-groups	 1.00 (  0.56)	 +3.98 (  0.65)
threads-sockets 	4-groups	 1.00 (  0.12)	 +0.29 (  0.32)
threads-sockets 	8-groups	 1.00 (  0.86)	 +1.92 (  0.27)

netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	56-threads	 1.00 ( 12.46)	 -1.62 ( 12.14)
TCP_RR          	112-threads	 1.00 (  1.34)	 -0.16 (  1.42)
TCP_RR          	168-threads	 1.00 (  6.26)	 -0.88 (  6.08)
TCP_RR          	224-threads	 1.00 (  2.19)	-90.18 (  6.12)
TCP_RR          	280-threads	 1.00 ( 12.27)	-63.81 ( 74.25)
TCP_RR          	336-threads	 1.00 ( 29.28)	 -6.21 ( 18.48)
TCP_RR          	392-threads	 1.00 ( 39.39)	 -3.87 ( 26.63)
TCP_RR          	448-threads	 1.00 ( 47.45)	 -2.34 ( 32.37)
UDP_RR          	56-threads	 1.00 (  3.28)	 -0.31 (  2.81)
UDP_RR          	112-threads	 1.00 (  7.03)	 +0.55 (  7.03)
UDP_RR          	168-threads	 1.00 ( 17.42)	 -0.51 ( 15.63)
UDP_RR          	224-threads	 1.00 ( 20.79)	-68.28 ( 14.32)
UDP_RR          	280-threads	 1.00 ( 26.23)	-68.58 ( 18.60)
UDP_RR          	336-threads	 1.00 ( 38.99)	 -0.55 ( 21.19)
UDP_RR          	392-threads	 1.00 ( 44.22)	 -1.91 ( 27.44)
UDP_RR          	448-threads	 1.00 ( 55.11)	 -2.74 ( 38.55)

tbench
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	56-threads	 1.00 (  2.69)	 -2.30 (  2.69)
loopback        	112-threads	 1.00 (  1.92)	 +0.62 (  1.46)
loopback        	168-threads	 1.00 (  0.97)	-67.69 (  0.06)
loopback        	224-threads	 1.00 (  0.24)	 -6.79 (  8.81)
loopback        	280-threads	 1.00 (  0.10)	 +0.47 (  0.62)
loopback        	336-threads	 1.00 (  0.85)	 -0.05 (  0.05)
loopback        	392-threads	 1.00 (  0.62)	 +0.77 (  0.50)
loopback        	448-threads	 1.00 (  0.36)	 +0.77 (  0.77)

schbench
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1-mthreads	 1.00 (  0.82)	 +1.44 (  1.24)
normal          	2-mthreads	 1.00 (  2.13)	 +1.16 (  0.41)
normal          	4-mthreads	 1.00 (  3.82)	 -0.30 (  1.48)
normal          	8-mthreads	 1.00 (  4.80)	+22.43 ( 13.03)

But since the multiple LLC is just a simulation on Intel platform for now,
the patch is ok and:

Tested-by: Chen Yu <yu.c.chen@...el.com>

thanks,
Chenyu