linux-kernel - Re: [sched/fair] 38ac256d1c: stress-ng.vm-segv.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87bla8ue3e.mognet@arm.com>
Date:   Wed, 21 Apr 2021 11:27:49 +0100
From:   Valentin Schneider <valentin.schneider@....com>
To:     Oliver Sang <oliver.sang@...el.com>
Cc:     0day robot <lkp@...el.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        LKML <linux-kernel@...r.kernel.org>, lkp@...ts.01.org,
        ying.huang@...el.com, feng.tang@...el.com, zhengjun.xing@...el.com,
        Lingutla Chandrasekhar <clingutla@...eaurora.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Qais Yousef <qais.yousef@....com>,
        Quentin Perret <qperret@...gle.com>,
        Pavan Kondeti <pkondeti@...eaurora.org>,
        Rik van Riel <riel@...riel.com>, aubrey.li@...ux.intel.com,
        yu.c.chen@...el.com
Subject: Re: [sched/fair]  38ac256d1c:  stress-ng.vm-segv.ops_per_sec -13.8% regression


Hi,

On 21/04/21 11:20, Oliver Sang wrote:
> hi, Valentin Schneider,
>
> On Wed, Apr 14, 2021 at 06:17:38PM +0100, Valentin Schneider wrote:
>> On 14/04/21 13:21, kernel test robot wrote:
>> > Greeting,
>> >
>> > FYI, we noticed a -13.8% regression of stress-ng.vm-segv.ops_per_sec due to commit:
>> >
>> >
>> > commit: 38ac256d1c3e6b5155071ed7ba87db50a40a4b58 ("[PATCH v5 1/3] sched/fair: Ignore percpu threads for imbalance pulls")
>> > url: https://github.com/0day-ci/linux/commits/Valentin-Schneider/sched-fair-load-balance-vs-capacity-margins/20210408-060830
>> > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 0a2b65c03e9b47493e1442bf9c84badc60d9bffb
>> >
>> > in testcase: stress-ng
>> > on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory
>> > with following parameters:
>> >
>> >       nr_threads: 10%
>> >       disk: 1HDD
>> >       testtime: 60s
>> >       fs: ext4
>> >       class: os
>> >       test: vm-segv
>> >       cpufreq_governor: performance
>> >       ucode: 0x5003006
>> >
>> >
>>
>> That's almost exactly the same result as [1], which is somewhat annoying
>> for me because I wasn't able to reproduce those results back then. Save
>> from scrounging the exact same machine to try this out, I'm not sure what's
>> the best way forward. I guess I can re-run the workload on whatever
>> machines I have and try to spot any potentially problematic pattern in the
>> trace...
>
> what's the machine model you used upon which the regression cannot be reproduced?
> we could check if we have similar model then re-check on the our machine.
>

I tested this on:
o Ampere eMAG (arm64, 32 cores)
o 2-socket Xeon E5-2690 (x86, 40 cores)

and found at worse a -0.3% regression and at best a 2% improvement. I know
that x86 box is somewhat ancient, but it's been my go-to "have I broken
x86?" test victim for a while :-)

> BTW, we supplied perf data in original report, not sure if they are helpful?
> or do you have suggestion which kind of data will be more helpful to you?
> we will continuously improve our report based on suggestions from community.
> Thanks a lot!
>

Staring at it some more, I notice a huge uptick in:

- major page faults (+315.2% and +270%)
- cache misses (+125.2% and +131.0%)

I don't really get the page faults; the cache misses I could somewhat
understand: this is adding p->flags and (p->set_child_tid)->flags accesses,
which are in different cachelines than p->se and p->cpus_mask used in
can_migrate_task().

I think I could dig some more into this with perf, but I'd need to be able
to reproduce this locally first...

>>
>> [1]: http://lore.kernel.org/r/20210223023004.GB25487@xsang-OptiPlex-9020