lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 11 Apr 2018 18:00:00 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Patrick Bellasi <patrick.bellasi@....com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "open list:THERMAL" <linux-pm@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        "Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Steve Muckle <smuckle@...gle.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>
Subject: Re: [PATCH] sched/fair: schedutil: update only with all info
 available

On Wed, Apr 11, 2018 at 05:41:24PM +0200, Vincent Guittot wrote:
> Yes. and to be honest I don't have any clues of the root cause :-(
> Heiner mentioned that it's much better in latest linux-next but I
> haven't seen any changes related to the code of those patches

Yeah, it's a bit of a puzzle. Now you touch nohz, and the patches in
next that are most likely to have affected this are rjw's
cpuidle-vs-nohz patches. The common demoninator being nohz.

Now I think rjw's patches will ensure we enter nohz _less_, they avoid
stopping the tick when we expect to go idle for a short period only.

So if your patch makes nohz go wobbly, going nohz less will make that
better.

Of course, I've no actual clue as to what that patch (it's the last one
in the series, right?:

  31e77c93e432 ("sched/fair: Update blocked load when newly idle")

) does that is so offensive to that one machine. You never did manage to
reproduce, right?

Could is be that for some reason the nohz balancer now takes a very long
time to run?

Could something like the following happen (and this is really flaky
thinking here):

last CPU goes idle, we enter idle_balance(), that kicks ilb, ilb runs,
which somehow again triggers idle_balance and around we go?

I'm not immediately seeing how that could happen, but if we do something
daft like that we can tie up the CPU for a while, mostly with IRQs
disabled, and that would be visible as that latency he sees.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ