linux-kernel - Re: [RFC] sched: unused cpu in affine workload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 4 Apr 2016 10:59:45 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Jiri Olsa <jolsa@...hat.com>
Cc:	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	James Hartsock <hartsjc@...hat.com>,
	Rik van Riel <riel@...hat.com>,
	Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>,
	Kirill Tkhai <ktkhai@...allels.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC] sched: unused cpu in affine workload

* Jiri Olsa <jolsa@...hat.com> wrote:

> hi,
> we've noticed following issue in one of our workloads.
> 
> I have 24 CPUs server with following sched domains:
>   domain 0: (pairs)
>   domain 1: 0-5,12-17 (group1)  6-11,18-23 (group2)
>   domain 2: 0-23 level NUMA
> 
> I run CPU hogging workload on following CPUs:
>   4,6,14,18,19,20,23
> 
> that is:
>   4,14          CPUs from group1
>   6,18,19,20,23 CPUs from group2
> 
> the workload process gets affinity setup via 'taskset -c ${CPUs workload ...'
> and forks child for every CPU
> 
> very often we notice CPUs 4 and 14 running 3 processes of the workload
> while CPUs 6,18,19,20,23 running just 4 processes, leaving one of the
> CPU from group2 idle
> 
> AFAICS from the code the reason for this is that the load balancing
> follows domains setup (topology) and does not regard affinity setups
> like this. The code in find_busiest_group running under idle cpu from
> group2 will find group1 as bussiest, but its average load will be
> smaller than the one on the local group, so there's no task pulling.
> 
> It's obvious, that load balancer follows sched domain topology.
> However is there some sched feature I'm missing that could help
> with this? Or do we need to follow sched domains topology when
> we select CPUs for workload to get even balancing?

Yeah, so the principle with user-pinning of tasks to CPUs was always:

 - pinning a task to a single CPU should obviously work fine, it's the primary
   usecase for isolation.

 - pinning a task to an arbitrary subset of CPUs is a 'hard' problem
   mathematically that the scheduler never truly wanted to solve in a frontal
   fashion.

... but that principle was set into place well before we did the NUMA scheduling 
work, which in itself is a highly non-trivial load optimization problem to begin 
with, so we might want to reconsider.

So there's two directions I can suggest:

 - if you can come up with workable small-scale solutions to scratch an itch
   that comes up in practice then that's obviously good, as long as it does not
   regress anything else.

 - if you want to come up with a 'complete' solution then please don't put it into
   hot paths such as wakeup or context switching, or any of the hardirq methods,
   but try to integrate it with the NUMA scheduling slow path.

The NUMA balancing slow path: that is softirq driven and reasonably low freq to 
not cause many performance problems.

The two problems (NUMA affinity and user affinity) are also losely related on a 
conceptual level: the NUMA affinity optimization problem can be considered as a 
workload determined, arbitrary 'NUMA mask' being optimized from first principles.

There's one ABI detail: this is true only as long as SMP affinity masks follow 
node boundaries - the current NUMA balancing code is very much node granular, so 
the two can only be merged if the ->cpus_allowed mask follows node boundaries as 
well.

A third approach would be to extend the NUMA balancing code to be CPU granular 
(without changing anytask placement behavior of the current NUMA balancing code of 
course), with node granular being a special case. This would fit the cgroups (and 
virtualization) usecases, but that would be a major change.

Thanks,

	Ingo