lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <bc85a40c-1ea0-9b57-6ba3-b920c436a02c@meta.com>
Date:   Mon, 17 Apr 2023 10:10:25 +0200
From:   Chris Mason <clm@...a.com>
To:     Peter Zijlstra <peterz@...radead.org>,
        David Vernet <void@...ifault.com>,
        linux-kernel@...r.kernel.org, kernel-team@...com
Subject: schbench v1.0

Hi everyone,

Since we've been doing a lot of scheduler benchmarking lately, I wanted
to dust off schbench and see if I could make it more accurately model
the results we're seeing from production workloads.

I've reworked a few things and since it's somewhat different now I went
ahead and tagged v1.0:

https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git

I also tossed in a README.md, which documents the arguments.

https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/tree/README.md

The original schbench focused almost entirely on wakeup latencies, which
is still included in the output now.  Instead of spinning for a fixed
amount of wall time, v1.0 now uses a loop of matrix multiplication to
simulate a web request.

David Vernet recently benchmarked EEVDF, CFS, and sched_ext against
production workloads:

https://lore.kernel.org/lkml/20230411020945.GA65214@maniforge/

And what we see in general is that involuntary context switches trigger
a basket of expensive interactions between CPU/memory/disk.  This is
pretty difficult to model from a benchmark targeting just the scheduler,
so instead of making a much bigger simulation of the workload, I  made
preemption slower inside of schbench.  In terms of performance he found:

EEVDF < CFS < CFS shared wake queue < sched_ext BPF

My runs with schbench match his percentage differences pretty closely.

The least complicated way I could find to penalize preemption is to use
a per-cpu spinlock around the matrix math.  This can be disabled with
(-L/--no-locking).  The results map really well to our production
workloads, which don't use spinlocks, but do get hit with major page
faults when they lose the CPU in the middle of a request.

David has more schbench examples for his presentation at OSPM, but
here's some annotated output:

schbench -F128 -n 10
Wakeup Latencies percentiles (usec) runtime 90 (s) (370488 total samples)
          50.0th: 9          (69381 samples)
          90.0th: 24         (134753 samples)
        * 99.0th: 1266       (32796 samples)
          99.9th: 4712       (3322 samples)
          min=1, max=12449

This is basically the important part of the original schbench.  It's the
time from when a worker thread is woken to when it starts running.

Request Latencies percentiles (usec) runtime 90 (s) (370983 total samples)
          50.0th: 11440      (103738 samples)
          90.0th: 12496      (120020 samples)
        * 99.0th: 22304      (32498 samples)
          99.9th: 26336      (3308 samples)
          min=5818, max=57747

RPS percentiles (requests) runtime 90 (s) (9 total samples)
          20.0th: 4312       (3 samples)
        * 50.0th: 4376       (3 samples)
          90.0th: 4440       (3 samples)
          min=4290, max=4446

Request latency and RPS are both new.  The original schbench had
requests, but they were based on wall clock spinning instead of a fixed
amount of CPU work.  The new requests include two small usleeps() and
the matrix math in their timing.

Generally for production the 99th percentile latencies are most
important.  For RPS, I watch 20th and 50th percentile more. The readme
linked above talks through the command line options and how to pick a
good numbers.

I did some runs with different parameters comparing Linus git and EEVDF:

Comparing EEVDF (8c59a975d5ee) With Linus 6.3-rc6ish (a7a55e27ad72)

schbench -F128 -N <val> with and without -L
Single socket Intel cooperlake CPUs, turbo disabled

F128 N1                 EEVDF    Linus
Wakeup  (usec): 99.0th: 355      555
Request (usec): 99.0th: 2,620    1,906
RPS    (count): 50.0th: 37,696   41,664

F128 N1 no-locking      EEVDF    Linus
Wakeup  (usec): 99.0th: 295      545
Request (usec): 99.0th: 1,890    1,758
RPS    (count): 50.0th: 37,824   41,920

F128 N10                EEVDF    Linus
Wakeup  (usec): 99.0th: 755      1,266
Request (usec): 99.0th: 25,632   22,304
RPS    (count): 50.0th: 4,280    4,376

F128 N10 no-locking     EEVDF    Linus
Wakeup  (usec): 99.0th: 823      1,118
Request (usec): 99.0th: 17,184   14,192
RPS    (count): 50.0th: 4,440    4,456

F128 N20                EEVDF    Linus
Wakeup  (usec): 99.0th: 901      1,806
Request (usec): 99.0th: 51,136   46,016
RPS    (count): 50.0th: 2,132    2,196

F128 N20 no-locking     EEVDF    Linus
Wakeup  (usec): 99.0th: 905      1,902
Request (usec): 99.0th: 32,832   30,496
RPS    (count): 50.0th: 2,212    2,212

In general this shows us that EEVDF is a huge improvement on wakeup
latency, but we pay for it with preemptions during the request itself.
Diving into the F128 N10 no-locking numbers:

F128 N10 no-locking     EEVDF    Linus
Wakeup  (usec): 99.0th: 823      1,118
Request (usec): 99.0th: 17,184   14,192
RPS    (count): 50.0th: 4,440    4,456

EEVDF is very close in terms of RPS.  The p99 request latency shows the
preemptions pretty well, but the p50 request latency numbers have EEVDF
winning slightly (11,376 usec eevdf vs 11,408 usec on -linus).

-chris

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ