[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <jhj7dtaokxe.mognet@arm.com>
Date: Thu, 03 Sep 2020 17:59:57 +0100
From: Valentin Schneider <valentin.schneider@....com>
To: Julia Lawall <julia.lawall@...ia.fr>
Cc: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
linux-kernel@...r.kernel.org,
Gilles Muller <Gilles.Muller@...ia.fr>
Subject: Re: SD_LOAD_BALANCE
Hi Julia,
On 03/09/20 15:09, Julia Lawall wrote:
> Uses of SD_LOAD_BALANCE were removed in commit e669ac8ab952 (first
> released in v5.8), with the comment:
>
> The SD_LOAD_BALANCE flag is set unconditionally for all domains in
> sd_init().
>
> I have the impression that this was not quite true. The NUMA domain was
> not initialized with sd_init, and didn't have the SD_LOAD_BALANCE flag
> set.
Did you check the contents of
/proc/sys/kernel/sched_domain/cpu*/domain*/flags
(requires CONFIG_SCHED_DEBUG)? If the LSB is set, it would mean
SD_LOAD_BALANCE is set.
The sched_domain construction flow isn't the easiest thing to follow, but
NUMA domains *have* to go through sd_init().
What happens is we first go through sched_init_numa(), and there we add
some more topology levels on top of the default ones (or the arch-defined
ones if using an arch-defined topology hierarchy) by using the NUMA
distance table.
We then build the actual domains in sched_init_domains(), and that goes
through a loop that looks like
for_each_cpu() {
for_each_sd_topology() {
build_sched_domain() -> sd_init()
}
}
where the SD topology loop is going to iterate over the newly-added
NUMA-specific topology levels. Since that used to unconditionally set
SD_LOAD_BALANCE, NUMA domains really ought to have it.
If that wasn't the case, we would have fired the (now removed) warning in
sched_domain_debug_one() that would do:
if (!(sd->flags & SD_LOAD_BALANCE)) {
printk("does not load-balance\n");
if (sd->parent)
printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain has parent");
return -1;
}
> The effect is that in v5.8, the for_each_domain loop in
> select_task_rq_fair can always end up at the global NUMA domain, and thus
> consider any pair of waking cpu (cpu) and previous cpus of the wakee
> (prev_cpu) as arguments to wake_affine. Up to v5.7, this was only
> possible if cpu and prev_cpu were together in some lower level domain, ie
> sharing the LLC. The effect is that in v5.8 wake_affine can often end up
> choosing as a target a core that does not share the LLC with the core
> where the thread ran previously. Threads then move around a lot between
> the different sockets.
>
> Was this intentional?
>
AFAICT it isn't forbidden for the logic here to peek outside of the
previous LLC. The NUMA reclaim distance thing says we allow affine wakeups
and fork / exec balancing to move a task to a CPU at most RECLAIM_DISTANCE
away (in NUMA distance values). However, I don't remember any patch
changing this between v5.7 and v5.8.
Briefly glancing over the kernel/sched log between v5.7 and v5.8, I don't
see any obvious culprits. Did you try to bisect this? If it indeed ends on
the SD_LOAD_BALANCE thing, well, I'll be off eating my keyboard.
> The effect can be seen in the traces of the parsec vips benchmark at the
> following URL:
>
> https://pages.lip6.fr/Julia.Lawall/vips.pdf
>
> The first two graphs (complete run and small fragment) are Linux v5.7 and
> the next two are Linux v5.8. The machine has 160 hardware threads
> organized in 4 sockets and the colors are by socket. In the small
> fragment for v5.7 (second graph), one can see that a given pid pretty much
> stays on the same socket, while in the corresponding fragment for v5.8
> (fourth graph), the pids move around between the sockets. The x's
> describe the unblocks that result in a migration. A pink x means that the
> migration is in the same socket, while a blue x means that the migration
> is to another socket. It's not apparent from the graphs, but by adding
> some tracing, it seems that the new socket is always the one of the core
> that handles the wakeup.
>
Interesting graphs, thanks for sharing!
> I haven't yet studied the early part of the execution of vips in detail,
> but I suspect that the same issue causes all of the threads to be
> initially on the same socket in v5.7, while in v5.8 they are more quickly
> dispersed to other sockets.
>
> My impression from the parsec and the NAS benchmarks is that the v5.8
> performance is a bit better than v5.7, probably because of getting more
> threads to more different sockets earlier, but other benchmarks might
> rely more on locality and might react less well to threads moving around
> so much in this way.
>
> julia
Powered by blists - more mailing lists