linux-kernel - Re: [sched/fair] 38ac256d1c: stress-ng.vm-segv.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87k0omxe6w.mognet@arm.com>
Date:   Wed, 28 Apr 2021 23:00:07 +0100
From:   Valentin Schneider <valentin.schneider@....com>
To:     Oliver Sang <oliver.sang@...el.com>
Cc:     0day robot <lkp@...el.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        LKML <linux-kernel@...r.kernel.org>, lkp@...ts.01.org,
        ying.huang@...el.com, feng.tang@...el.com, zhengjun.xing@...el.com,
        Lingutla Chandrasekhar <clingutla@...eaurora.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Qais Yousef <qais.yousef@....com>,
        Quentin Perret <qperret@...gle.com>,
        Pavan Kondeti <pkondeti@...eaurora.org>,
        Rik van Riel <riel@...riel.com>, aubrey.li@...ux.intel.com,
        yu.c.chen@...el.com, Mel Gorman <mgorman@...e.de>
Subject: Re: [sched/fair]  38ac256d1c:  stress-ng.vm-segv.ops_per_sec -13.8% regression

On 22/04/21 21:42, Valentin Schneider wrote:
> On 22/04/21 10:55, Valentin Schneider wrote:
>> I'll go find myself some other x86 box and dig into it;
>> I'd rather not leave this hanging for too long.
>
> So I found myself a dual-socket Xeon Gold 5120 @ 2.20GHz (64 CPUs) and
> *there* I get a somewhat consistent ~-6% regression. As I'm suspecting
> cacheline shenanigans, I also ran that with Peter's recent
> kthread_is_per_cpu() change, and that brings it down to ~-3%
>

Ha ha ho ho, so that was a red herring. My statistical paranoia somewhat
paid off, and the kthread_is_per_cpu() thing doesn't really change anything
when you stare at 20+ iterations of that vm-segv thing.

As far as I can tell, the culprit is the loss of LBF_SOME_PINNED. By some
happy accident, the load balancer repeatedly iterates over PCPU kthreads,
sets LBF_SOME_PINNED and causes a group to be classified as group_imbalanced
in a later load-balance. This, in turn, forces a 1-task pull, and repeating
this pattern ~25 times a sec ends up increasing CPU utilization by ~5% over the
span of the benchmark.

schedstats are somewhat noisy but seem to indicate the baseline had many
more migrations at the NUMA level (test machine has SMT, MC, NUMA). Because
of that I suspected

  b396f52326de ("sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains")

but reverting that actually makes things worse. I'm still digging, though
I'm slowly heading towards:

  https://www.youtube.com/watch?v=3L6i5AwVAbs