linux-kernel - Re: [PATCH] fix scheduler regression from "sched/fair: Rework load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201026142455.GA13495@vingu-book>
Date:   Mon, 26 Oct 2020 15:24:55 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Chris Mason <clm@...com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Rik van Riel <riel@...riel.com>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] fix scheduler regression from "sched/fair: Rework
 load_balance()"

Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit :
> On 26 Oct 2020, at 4:39, Vincent Guittot wrote:
> 
> > Hi Chris
> > 
> > On Sat, 24 Oct 2020 at 01:49, Chris Mason <clm@...com> wrote:
> > > 
> > > Hi everyone,
> > > 
> > > We’re validating a new kernel in the fleet, and compared with v5.2,
> > 
> > Which version are you using ?
> > several improvements have been added since v5.5 and the rework of
> > load_balance
> 
> We’re validating v5.6, but all of the numbers referenced in this patch are
> against v5.9.  I usually try to back port my way to victory on this kind of
> thing, but mainline seems to behave exactly the same as 0b0695f2b34a wrt
> this benchmark.

ok. Thanks for the confirmation

I have been able to reproduce the problem on my setup.

Could you try the fix below ?

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
         * emptying busiest.
         */
        if (local->group_type == group_has_spare) {
-               if (busiest->group_type > group_fully_busy) {
+               if ((busiest->group_type > group_fully_busy) &&
+                   (busiest->group_weight > 1)) {
                        /*
                         * If busiest is overloaded, try to fill spare
                         * capacity. This might end up creating spare capacity


When we calculate an imbalance at te smallest level, ie between CPUs (group_weight == 1),
we should try to spread tasks on cpus instead of trying to fill spare capacity.


> 
> > 
> > > performance is ~2-3% lower for some of our workloads.  After some
> > > digging, Johannes found that our involuntary context switch rate was
> > > ~2x
> > > higher, and we were leaving a CPU idle a higher percentage of the
> > > time,
> > > even though the workload was trying to saturate the system.
> > > 
> > > We were able to reproduce the problem with schbench, and Johannes
> > > bisected down to:
> > > 
> > > commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912
> > > Author: Vincent Guittot <vincent.guittot@...aro.org>
> > > Date:   Fri Oct 18 15:26:31 2019 +0200
> > > 
> > >      sched/fair: Rework load_balance()
> > > 
> > > Our working theory is the load balancing changes are leaving
> > > processes
> > > behind busy CPUs instead of moving them onto idle ones.  I made a few
> > > schbench modifications to make this easier to demonstrate:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/
> > > 
> > > My VM has 40 cpus (20 cores, 2 threads per core), and my schbench
> > > command line is:
> > 
> > What is the topology ? are they all part of the same LLC ?
> 
> We’ve seen the regression on both single socket and dual socket bare metal
> intel systems.  On the VM I reproduced with, I saw similar latencies with
> and without siblings configured into the topology.
> 
> -chris