lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170630131124.GB12077@codeblueprint.co.uk>
Date:   Fri, 30 Jun 2017 14:11:24 +0100
From:   Matt Fleming <matt@...eblueprint.co.uk>
To:     Josef Bacik <josef@...icpanda.com>
Cc:     Joel Fernandes <joelaf@...gle.com>,
        Mike Galbraith <umgwanakikbuti@...il.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Juri Lelli <Juri.Lelli@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Patrick Bellasi <patrick.bellasi@....com>,
        Brendan Jackman <brendan.jackman@....com>,
        Chris Redpath <Chris.Redpath@....com>,
        Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: wake_wide mechanism clarification

On Thu, 29 Jun, at 08:49:13PM, Josef Bacik wrote:
> 
> It may be worth to try with schedbench and trace it to see how this turns out in
> practice, as that's the workload that generated all this discussion before.  I
> imagine generally speaking this works out properly.  The small regression I
> reported before was at low RPS, so we wouldn't be waking up as many tasks as
> often, so we would be returning 0 from wake_wide() and we'd get screwed.  This
> is where I think possibly dropping the slave < factor part of the test would
> address that, but I'd have to trace it to say for sure.  Thanks,

Just 2 weeks ago I was poking at wake_wide() because it's impacting
hackbench times now we're better at balancing on fork() (see commit
6b94780e45c1 ("sched/core: Use load_avg for selecting idlest group")).

What's happening is that occasionally the hackbench times will be
pretty large because the hackbench tasks are being pulled back and
forth across NUMA domains due to the wake_wide() logic.

Reproducing this issue does require a NUMA box with more CPUs than
hackbench tasks. I was using an 80-cpu 2 NUMA node box with 1
hackbench group (20 readers, 20 writers).

I did the following very quick hack,

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1f5efa51dc7..c1bc1b0434bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5055,7 +5055,7 @@ static int wake_wide(struct task_struct *p)

        if (master < slave)
                swap(master, slave);
-       if (slave < factor || master < slave * factor)
+       if (master < slave * factor)
                return 0;
        return 1;
 }

Which produces the following results for the 1 group (40 tasks) on one
of SUSE's enterprise kernels:

hackbench-process-pipes
                            4.4.71                4.4.71
                          patched+patched+-wake-wide-fix
Min      1        0.7000 (  0.00%)      0.8480 (-21.14%)
Amean    1        1.0343 (  0.00%)      0.9073 ( 12.28%)
Stddev   1        0.2373 (  0.00%)      0.0447 ( 81.15%)
CoeffVar 1       22.9447 (  0.00%)      4.9300 ( 78.51%)
Max      1        1.2270 (  0.00%)      0.9560 ( 22.09%)

You'll see that the minimum value is worse with my change, but the
maximum is much better.

So the current wake_wide() code does help sometimes, but it also hurts
sometimes too.

I'm happy to gather performance data for any code suggestions.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ