linux-kernel - Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YbddCcGJUpcPc8nS@BLR-5CG11610CF.amd.com>
Date:   Mon, 13 Dec 2021 20:17:37 +0530
From:   "Gautham R. Shenoy" <gautham.shenoy@....com>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Valentin Schneider <valentin.schneider@....com>,
        Aubrey Li <aubrey.li@...ux.intel.com>,
        Barry Song <song.bao.hua@...ilicon.com>,
        Mike Galbraith <efault@....de>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when
 SD_NUMA spans multiple LLCs

Hello Mel,

On Mon, Dec 13, 2021 at 01:01:31PM +0000, Mel Gorman wrote:
> On Mon, Dec 13, 2021 at 01:58:03PM +0530, Gautham R. Shenoy wrote:
> > > On a Zen3 machine running STREAM parallelised with OMP to have on instance
> > > per LLC the results and without binding, the results are
> > > 
> > >                             5.16.0-rc1             5.16.0-rc1
> > >                                vanilla       sched-numaimb-v4
> > > MB/sec copy-16    166712.18 (   0.00%)   651540.22 ( 290.82%)
> > > MB/sec scale-16   140109.66 (   0.00%)   382254.74 ( 172.83%)
> > > MB/sec add-16     160791.18 (   0.00%)   623073.98 ( 287.51%)
> > > MB/sec triad-16   160043.84 (   0.00%)   633964.52 ( 296.12%)
> > 
> > 
> > Could you please share the size of the stream array ? These numbers
> > are higher than what I am observing.
> > 
> 
> 512MB

Thanks, I will try with this one.

> 
> > > @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > >  	}
> > >  }
> > >  
> > > -#define NUMA_IMBALANCE_MIN 2
> > > -
> > >  static inline long adjust_numa_imbalance(int imbalance,
> > > -				int dst_running, int dst_weight)
> > > +				int dst_running, int dst_weight,
> > > +				int imb_numa_nr)
> > >  {
> > >  	if (!allow_numa_imbalance(dst_running, dst_weight))
> > >  		return imbalance;
> > >
> > 
> > if (4 * dst_running >= dst_weight) we return imbalance here. The
> > dst_weight here corresponds to the span of the domain, while
> > dst_running is the nr_running in busiest.
> > 
> 
> Yes, once dst_running is high enough, no imbalance is allowed. In
> previous versions I changed this but that was a mistake and in this
> version, the threshold where imbalance is not allowed remains the same.
> 
> > On Zen3, at the top most NUMA domain, the dst_weight = 256 across in
> > all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are
> > two groups, where each group is a socket. So, unless there are at
> > least 64 tasks running in one of the sockets, we would not return
> > imbalance here and go to the next step.
> > 
> 
> Yes
> 
> > 
> > > -	/*
> > > -	 * Allow a small imbalance based on a simple pair of communicating
> > > -	 * tasks that remain local when the destination is lightly loaded.
> > > -	 */
> > > -	if (imbalance <= NUMA_IMBALANCE_MIN)
> > > +	if (imbalance <= imb_numa_nr)
> > 
> > imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains
> > don't have PREFER_SIBLING, we would be balancing the number of idle
> > CPUs. We will end up doing the imbalance, as long as the difference
> > between the idle CPUs is at least 8.
> > 
> > In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we
> > will not rebalance unless the difference between the idle CPUs is 16.
> > 
> > In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the
> > threshold is now bumped up to 32.
> > 
> > >  		return 0;
> > 
> > 
> > 
> > >  
> > >  	return imbalance;
> > > @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >  		/* Consider allowing a small imbalance between NUMA groups */
> > >  		if (env->sd->flags & SD_NUMA) {
> > >  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> > > -				busiest->sum_nr_running, env->sd->span_weight);
> > > +				busiest->sum_nr_running, env->sd->span_weight,
> > > +				env->sd->imb_numa_nr);
> > >  		}
> > >  
> > >  		return;
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index d201a7052a29..bacec575ade2 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > >  		}
> > >  	}
> > >  
> > > +	/*
> > > +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> > > +	 * imbalanced.
> > > +	 */
> > > +	for_each_cpu(i, cpu_map) {
> > > +		unsigned int imb = 0;
> > > +		unsigned int imb_span = 1;
> > > +
> > > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +			struct sched_domain *child = sd->child;
> > > +
> > > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > > +				struct sched_domain *top = sd;
> > 
> > 
> > We don't seem to be using top anywhere where sd may not be used since
> > we already have variables imb and imb_span to record the
> > top->imb_numa_nr and top->span_weight.
> > 
> 
> Top could have been removed but we might still need it.
> 
> > 
> > > +				unsigned int llc_sq;
> > > +
> > > +				/*
> > > +				 * nr_llcs = (top->span_weight / llc_weight);
> > > +				 * imb = (child_weight / nr_llcs) >> 2
> > 
> > child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2.
> > 
> 
> That is be clearer.
> 
> > > +				 *
> > > +				 * is equivalent to
> > > +				 *
> > > +				 * imb = (llc_weight^2 / top->span_weight) >> 2
> > > +				 *
> > > +				 */
> > > +				llc_sq = child->span_weight * child->span_weight;
> > > +
> > > +				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> > > +				imb_span = sd->span_weight;
> > 
> > On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256.
> >    with NPS=1
> >       top = DIE.
> >       top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128.
> > 
> >    with NPS=2
> >       top = NODE.
> >       top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64.
> > 
> >    with NPS=4      
> >       top = NODE.
> >       top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32.
> > 
> > On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64.
> >    with NPS=1
> >       top = DIE.
> >       top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128.
> > 
> >    with NPS=2
> >       top = NODE.
> >       top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64.
> > 
> >    with NPS=4      
> >       top = NODE.
> >       top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32.
> > 
> > 
> > > +
> > > +				sd->imb_numa_nr = imb;
> > > +			} else {
> > > +				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> > > +			}
> > 
> > On Zen3,
> >    with NPS=1
> >         sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4.
> > 
> >    with NPS=2
> >         sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4
> > 	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8
> > 
> >    with NPS=4
> >         sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8
> > 	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16
> > 
> > 
> > For Zen2, since the imb_span and imb values are the same as the
> > corresponding NPS=x values on Zen3, the imb_numa_nr values are the
> > same as well since the corresponding sd->span_weight is the same.
> > 
> > If we look at the highest NUMA domain, there are two groups in all the
> > NPS configurations. There are the same number of LLCs in each of these
> > groups across the different NPS configurations (nr_llcs=8 on Zen3, 16
> > on Zen2) . However, the imb_numa_nr at this domain varies with the NPS
> > value, since we compute the imb_numa_nr value relative to the number
> > of "top" domains that can be fit within this NUMA domain. This is
> > because the size of the "top" domain varies with the NPS value. This
> > shows up in the benchmark results.
> > 
> 
> This was intentional to have some scaling but based on your results, the
> scaling might be at the wrong level.

Ok. 


> 
> > 
> > 
> > The numbers with stream, tbench and YCSB +
> > Mongodb are as follows:
> > 
> > 
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > Stream with 16 threads.
> > built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
> > Zen3, 64C128T per socket, 2 sockets,
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > NPS=1
> > Test:     tip/sched/core                 mel-v3                    mel-v4
> >  Copy:    113716.62 (0.00 pct)     218961.59 (92.55 pct)     217130.07 (90.93 pct)
> > Scale:    110996.89 (0.00 pct)     216674.73 (95.20 pct)     220765.94 (98.89 pct)
> >   Add:    124504.19 (0.00 pct)     253461.32 (103.57 pct     260273.88 (109.04 pct)
> > Triad:    122890.43 (0.00 pct)     247552.00 (101.44 pct     252615.62 (105.56 pct)
> > 
> > 
> > NPS=2
> > Test:     tip/sched/core                 mel-v3                     mel-v4
> >  Copy:    58217.00 (0.00 pct)      204630.34 (251.49 pct)     191312.73 (228.62 pct)
> > Scale:    55004.76 (0.00 pct)      212142.88 (285.68 pct)     175499.15 (219.06 pct)
> >   Add:    63269.04 (0.00 pct)      254752.56 (302.64 pct)     203571.50 (221.75 pct)
> > Triad:    62178.25 (0.00 pct)      247290.80 (297.71 pct)     198988.70 (220.02 pct)
> > 
> > NPS=4
> > Test:     tip/sched/core                 mel-v3                     mel-v4
> >  Copy:    37986.66 (0.00 pct)      254183.87 (569.13 pct)     48748.87 (28.33 pct)
> > Scale:    35471.22 (0.00 pct)      237804.76 (570.41 pct)     48317.82 (36.21 pct)
> >   Add:    39303.25 (0.00 pct)      292285.20 (643.66 pct)     54259.59 (38.05 pct)
> > Triad:    39319.85 (0.00 pct)      285284.30 (625.54 pct)     54503.98 (38.61 pct)
> > 
> 
> At minimum, v3 is a failure because a single pair of communicating tasks
> were getting split across NUMA domains and the allowed numa imbalance
> gets cut off too early because of the change to allow_numa_imbalance.
> So while it's a valid comparison, it's definitely not the fix.

v3 is definitely not a fix. I wasn't hinting at that. It was just to
point out the opportunity that we have.


> 
> Given how you describe NPS, maybe the scaling should only start at the
> point where tasks are no longer balanced between sibling domains. Can
> you try this? I've only boot tested it at this point. It should work for
> STREAM at least but probably not great for tbench.

Thanks for the patch. I will queue this one for tonight.


> 
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index bacec575ade2..1fa3e977521d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  
>  			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
>  			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> -				struct sched_domain *top = sd;
> +				struct sched_domain *top, *top_p;
>  				unsigned int llc_sq;
>  
>  				/*
> -				 * nr_llcs = (top->span_weight / llc_weight);
> -				 * imb = (child_weight / nr_llcs) >> 2
> +				 * nr_llcs = (sd->span_weight / llc_weight);
> +				 * imb = (llc_weight / nr_llcs) >> 2
>  				 *
>  				 * is equivalent to
>  				 *
> -				 * imb = (llc_weight^2 / top->span_weight) >> 2
> +				 * imb = (llc_weight^2 / sd->span_weight) >> 2
>  				 *
>  				 */
>  				llc_sq = child->span_weight * child->span_weight;
>  
> -				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> -				imb_span = sd->span_weight;
> -
> +				imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
>  				sd->imb_numa_nr = imb;
> +
> +				/*
> +				 * Set span based on top domain that places
> +				 * tasks in sibling domains.
> +				 */
> +				top = sd;
> +				top_p = top->parent;
> +				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> +					top = top->parent;
> +					top_p = top->parent;
> +				}
> +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
>  			} else {
> -				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> +				int factor = max(1U, (sd->span_weight / imb_span));
> +
> +				sd->imb_numa_nr = imb * factor;
>  			}
>  		}
>  	}

--
Thanks and Regards
gautham.