linux-kernel - Re: [PATCH 1/2] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Tue, 13 Feb 2018 11:45:41 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Mike Galbraith <efault@....de>,
        Matt Fleming <matt@...eblueprint.co.uk>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] sched/fair: Consider SD_NUMA when selecting the most
 idle group to schedule on

On Mon, Feb 12, 2018 at 05:11:30PM +0000, Mel Gorman wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 50442697b455..0192448e43a2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5917,6 +5917,18 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  	if (!idlest)
>  		return NULL;
>  
> +	/*
> +	 * When comparing groups across NUMA domains, it's possible for the
> +	 * local domain to be very lightly loaded relative to the remote
> +	 * domains but "imbalance" skews the comparison making remote CPUs
> +	 * look much more favourable. When considering cross-domain, add
> +	 * imbalance to the runnable load on the remote node and consider
> +	 * staying local.
> +	 */
> +	if ((sd->flags & SD_NUMA) &&
> +	    min_runnable_load + imbalance >= this_runnable_load)
> +		return NULL;
> +
>  	if (min_runnable_load > (this_runnable_load + imbalance))
>  		return NULL;

So this is basically a spread vs group decision, which we typically do
using SD_PREFER_SIBLNG. Now that flag is a bit awkward in that its set
on the child domain.

Now, we set it for SD_SHARE_PKG_RESOURCES (aka LLC), which means that for
our typical modern NUMA system we indicate we want to spread between the
lowest NUMA level. And regular load balancing will do so.

Now you modify the idlest code for initial placement to go against the
stable behaviour, which is unfortunate.

However, if we have numa balancing enabled, that will counteract
the normal spreading across nodes, so in that regard it makes sense, but
the above code is not conditional on numa balancing.

I'm torn and confused...