linux-kernel - Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20180614083640.dekqhsopoefnfhb4@techsingularity.net>
Date:   Thu, 14 Jun 2018 09:36:40 +0100
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     Jirka Hladky <jhladky@...hat.com>
Cc:     Jakub Racek <jracek@...hat.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Len Brown <lenb@...nel.org>, linux-acpi@...r.kernel.org,
        "kkolakow@...hat.com" <kkolakow@...hat.com>
Subject: Re: [4.17 regression] Performance drop on kernel-4.17 visible on
 Stream, Linpack and NAS parallel benchmarks

On Mon, Jun 11, 2018 at 06:07:58PM +0200, Jirka Hladky wrote:
> >
> > Fixing any part of it for STREAM will end up regressing something else.
> 
> 
> I fully understand that. We run a set of benchmarks and we always look at
> the results as the ensemble. Looking only at one benchmark would be
> completely wrong.
> 

Indeed

> And in fact, we do see regression on NAS benchmark going from 4.16 to 4.17
> kernel as well. On 4 NUMA node server with Xeon Gold CPUs we see the
> regression around 26% for ft_C,   35% for mg_C_x and 25% for sp_C_x. The
> biggest regression is with 32 threads (the box has 96 CPUs in total). I
> have not yet tried if it's
> linked to 2c83362734dad8e48ccc0710b5cd2436a0323893. I will do that
> testing tomorrow.
> 

It would be worthwhile. However, it's also worth noting that 32 threads
out of 96 implies that 4 nodes would not be evenly used and it may
account for some of the discrepency. ft and mg for C class are typically
short-lived on modern hardware and sp is not particularly long-lived
either. Hence, they are most likely to see problems with a patch that
avoids spreading tasks across the machine early. Admittedly, I have not
seen similar slowdowns but NAS has a lot of configuration options.

In terms of the speed of migration, it may be worth checking how often the
mm_numa_migrate_ratelimit tracepoint is triggered with bonus points for using
the nr_pages to calculate how many pages get throttled from migrating. If
it's high frequency then you could test increasing ratelimit_pages (which
is set at compile time despite not being a macro). It still may not work
for tasks that are too short-lived to have enough time to identify a
misplacement and migration.

-- 
Mel Gorman
SUSE Labs