linux-kernel - Re: [PATCH RFC] select_idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160407151728.rkmdjqlcrvpg54yg@floor.thefacebook.com>
Date:	Thu, 7 Apr 2016 11:17:28 -0400
From:	Chris Mason <clm@...com>
To:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	Matt Fleming <matt@...eblueprint.co.uk>,
	Mike Galbraith <mgalbraith@...e.de>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC] select_idle_sibling experiments

On Tue, Apr 05, 2016 at 02:08:22PM -0400, Chris Mason wrote:
> Hi everyone,
> 
> We're porting the fb kernel up to 4.5, and one of our last few out-of-tree
> patches is a hack to try harder to find idle cpus when waking up tasks.
> This helps in pretty much every workload we run, mostly because they all
> get tuned with a similar setup:
> 
> 1) find the load where latencies stop being acceptable
> 2) Run the server at just a little less than that
> 
> Usually this means our CPUs are just a little bit idle, and a poor
> scheduler decision to place a task on a busy CPU instead of an idle CPU
> ends up impacting our p99 latencies.
> 
> Mike helped us with this last year, fixing up wake_wide() to improve
> things.  But we still ended up having to go back to the old hack.
> 
> I started with a small-ish program to benchmark wakeup latencies.  The
> basic idea is a bunch of worker threads who sit around and burn CPU.
> Every once and a while they send a message to a message thread.
> 
> The message thread records the time he woke up the worker, and the
> worker records the delta between that time and the time he actually got
> the CPU again.  At the end it spits out a latency histogram.  The only
> thing we record is the wakeup latency, there's no measurement of 'work
> done' or any of the normal things you'd expect in a benchmark.
> 
> It has knobs for cpu think time, and for how long the messenger thread
> waits before replying.  Here's how I'm running it with my patch:
> 
> ./schbench -c 30000 -s 30000 -m 6 -t 24 -r 30

FYI, I changed schbench around a bit, and fixed a bug with -m (it was
ignored and always forced to 2).

The new code is here:

https://git.kernel.org/cgit/linux/kernel/git/mason/schbench.git/

I added a pipe simulation mode too, since I wanted wakeup latencies for
a raw tput test as well as my original workload.  -p takes the size of
the transfer you want to simulate.  There's no pipe involved, it's just
doing memsets on pages and waking the other thread with futexes.  The
latency is still only the latency of the worker thread wakeup:

# taskset -c 0 ./schbench -p 4 -m 1 -t 1 -r 20
Latency percentiles (usec)
        50.0000th: 1
        75.0000th: 2
        90.0000th: 2
        95.0000th: 2
        *99.0000th: 2
        99.5000th: 2
        99.9000th: 6
        Over=0, min=0, max=43
avg worker transfer: 372311.38 ops/sec 1.42MB/s

# taskset -c 0 perf bench sched pipe -l 5000000
# Running 'sched/pipe' benchmark:
# Executed 5000000 pipe operations between two processes

     Total time: 20.359 [sec]

       4.071851 usecs/op
         245588 ops/sec

I'm taking another stab at fixing the regression for picking an idle
core in my first patch, and I'll get some benchmark's with Mike's nohz
patch going as well.

-chris