lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <277737d6034b3da072d3b0b808d2fa6e110038b0.camel@suse.com>
Date:   Tue, 29 Oct 2019 10:11:20 +0100
From:   Dario Faggioli <dfaggioli@...e.com>
To:     Aaron Lu <aaron.lu@...ux.alibaba.com>,
        Aubrey Li <aubrey.intel@...il.com>
Cc:     Tim Chen <tim.c.chen@...ux.intel.com>,
        Julien Desfossez <jdesfossez@...italocean.com>,
        "Li, Aubrey" <aubrey.li@...ux.intel.com>,
        Subhra Mazumdar <subhra.mazumdar@...cle.com>,
        Vineeth Remanan Pillai <vpillai@...italocean.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Paul Turner <pjt@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Frédéric Weisbecker <fweisbec@...il.com>,
        Kees Cook <keescook@...omium.org>,
        Greg Kerr <kerrnel@...gle.com>, Phil Auld <pauld@...hat.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Dario Faggioli <dfaggioli@...e.com>
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Sun, 2019-09-15 at 22:14 +0800, Aaron Lu wrote:
> I'm using the following branch as base which is v5.1.5 based:
> https://github.com/digitalocean/linux-coresched coresched-v3-v5.1.5-
> test
> 
> And I have pushed Tim's branch to:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-tim
> 
> Mine:
> https://github.com/aaronlu/linux coresched-v3-v5.1.5-test-
> core_vruntime
> 
Hello,

As anticipated, I've been trying to follow the development of this
feature and, in the meantime, I have done some benchmarks.

I actually have a lot of data (and am planning for more), so I am
sending a few emails, each one with a subset of the numbers in it,
instead than just one which would be beyond giant! :-)

I'll put, in this first one, some background and some common
information, e.g., about the benchmarking platform and configurations,
and on how to read and interpreet the data that will follow.

It's quite hard to come up with a concise summary, and sometimes it's
even tricky to identify consolidated trends. There are also things that
looks weird and, although I double checked my methodology, I can't
exclude that of glitches or errors may have occurred. For each of the
benchmark, I have at least some information about what the
configuration was when it was run, and also some monitorning and perf
data. So, if interested, try to ask and we'll see what we can dig out.

And in any case, I have the procedure for running these benchmarks
fairly decently (although not completely) automated. So if we see
things that looks really really weird, I can rerun (perhaps with
different configuration, more monitoring, etc).

For each benchmark, I'll "dump" the results, with just some comments
about the things that I find more relevant/interesting. Then, if we
want, we can look at them and analyze them together.
For each experiment, I do have some limited amount of tracing and
debugging information still available, in case it could be useful. And,
as said, I can always rerun.

I can also provide, quite easily, different looking tables. E.g.,
different set of columns, different baselines, etc. Just as what you
thinks it would be the most interesting to see, and, most likely, it
will be possible to do it.

Oh, and I'll upload text files whose contents will be identical to the
emails in this space:

  http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/

In case tables are rendered better in a browser than in a MUA.

Thanks and Regards,
Dario
---

Code: 
 1) Linux 5.1.5 (commit id 835365932f0dc25468840753e071c05ad6abc76f)
 2) https://github.com/digitalocean/linux-coresched/tree/vpillai/coresched-v3-v5.1.5-test
 3) https://github.com/aaronlu/linux/tree/coresched-v3-v5.1.5-test-core_vruntime
 4) https://github.com/aaronlu/linux/tree/coresched-v3-v5.1.5-test-tim

Benchmarking suite:
 - MMTests: https://github.com/gormanm/mmtests
 - Tweaked to deal with running benchmarks in VMs. Still working on
   upstreaming that to Mel (a WIP is available here:
   https://github.com/dfaggioli/mmtests/tree/bench-virt )

Benchmarking host:
 - CPU: 1 socket, 4 cores, 2 threads
 - RAM: 32 GB
 - distro: opneSUSE Tumbleweed
 - HW Bugs Mitigations: fully disabled
 - Filesys: XFS

VMs:
 - vCPUs: either 8 or 4
 - distro: opneSUSE Tumbleweed
 - Kernel: 5.1.16
 - HW Bugs Mitigations: fully disabled
 - Filesys: XFS

Benchmarks:
- STREAM         : pure memory benchmark (various kind of mem-ops done
                   in parallel). Parallelism is NR_CPUS/2 tasks
- Kernbench      : builds a kernel, with varying number of compile
                   jobs. HT is, in general, known to help, as it let 
                   us do "more parallel" builds
- Hackbench      : communication (via pipes, in this case) between
                   group of processes. As we deal with _groups_ of
                   tasks, we're already in saturation with 1 group,
                   hence we expect HyperThreading disabled
                   configurations to suffer
- mutilate       : load generator for memcached, with high request
                   rate;
- netperf-unix   : two communicating tasks. Without any pinning
                   (neither at the host nor at the guest level), we
                   expect HT to play a role. In fact, depending on
                   where the two task are scheduler (i.e., whether on
                   two core of the same thread, or not) performance may
                   vary
- sysbenchcpu    : the process-based CPU stressing workload of sysbench
- sysbenchthread : the thread-based CPU stressing workload of sysbench
- sysbench       : the database workload

This is kind of a legend for the columns you will see in the tables.

- v-*   : vanilla, i.e., benchmarks were run on code _without_ any
          core-scheduling patch applied (see 1 in 'Code' section above)
- *BM-* : baremetal, i.e., benchmarks were run on the host, without 
          any VM running or anything
- *VM-* : Virtual Machine, i.e., benchmarks were run inside a VM, with
          the following haracteristics:
   - *VM-   : benchmarks were run in a VM with 8 vCPUs. That was the
              only VM running in the system
   - *VM-v4 : benchmarks were run in a VM with 4 vCPUs. That was the
              only VM running in the system
   - *VMx2  : benchmark were run in a VM with 8 vCPUs, and there was
              another VM running, also with 8 vCPUS, generating CPU,
              memory and IO stress load for about 600%
- *-csc-*          : benchmarks were run with Core scheduling v3 patch
                     series (see 2 in 'Code' section above)
- *-csc_stallfix-* : benchmarks were run with Core scheduling v3 and
                     the 'stallfix' feature enabled
- *-csc_vruntime-* : benchmarks were run with Core scheduling v3 + the
                     vruntime patches (see 3 in 'Code' section above)
- *-csc-_tim-*     : benchmarks were run with Core scheduling v3 +
                     Tim's patches (see 4 in 'Code' section above)
- *-noHT           : benchmarks were run with HyperThreading Disabled
- *-HT             : benchmarks were run with Hyperthreading enabled

So, for instance, the column BM-noHT shows data from a run done on
baremetal, with HyperThreading disabled. The column v-VM-HT shows data
from a run done in a 8 vCPUs VM, with HyperThreading enabled, and no
core-scheduling patches applied. The column VM-csc_vruntime-HT shows
data from a run done in a 8 vCPUs VM with core-scheduling v3 patches +
the vruntime patches applied. The column VM-v4-HT shows data from a run
done in a 4 vCPUs VM, core-scheduling patches were applied but not used
(the vCPUs of the VM weren't tagged). The column VMx2-csc_vruntime-HT
shows data from a run done in a 8 vCPUs VM, core-scheduling v3 + Tim's
patchs were applied and the vCPUs of the VM tagged, while there was
another (untagged) VM in the system, trying to introduce ~600% load
(CPU, memory and IO, via stress-ng). Etc.

See the 'Appendix' at the bottom of this email, for a comprehensive
list of all the combinations (or, at least I think is comprehensive...
I hope I haven't missed any :-) ).

In all tables, percent increase and decrease are always relative to the
first column. It is already taken care of whether lower or higher
values are better.
Basically, when we see -x.yz%, it always means performance are worse
than the baseline, and the absolute value of that (i.e., x.yz) tells
you by how much.

If, for instance, we want to compare HT and non HT, on baremetal, we
check the BM-HT and BM-noHT columns.
If we want to compare v3 + vruntime patches against no HyperThreading,
when the system is overloaded, we look at VMx2-noHT and VMx2-
csc_vruntime-HT columns and check by how much they deviates from the
baseline (i.e., which one regresses more). For comparing, the various
core scheduling solutions, we can check by how much each one is either
better or worse than baseline. And so on...

The most relevant comparisons, IMO, are:
- the various core scheduling solutions against their respective HT
baseline. This, in fact, tells us what people will experience if they
start using core scheduling on these workloads
- the various core scheduling solutions against their respective noHT
baseline. This, in fact, tells use whether or not core scheduling is
effective, for the given workload, or if it would just be better to
disable HyperThreading
- the overhead introduced by the core scheduling patches, when they are
not used (i.e., v-BM-HT against BM-HT, or v-VM-HT against VM-HT). This,
in fact, tells us what happens to *everyone*, including the ones that
do not want core scheduling and will keep it disabled, if we merge it

Note that the overhead, so far, has been evaluated only for the -csc
case, i.e., when patches from point 2 in 'Code' above are applied, but
tasks/vCPUs are not tagged, and hence core scheduling is not really
used,

Anyway, let's get to the point where I give you some data already! :-D
:-D

STREAM
======

http://xenbits.xen.org/people/dariof/benchmarks/results/linux/core-sched/mmtests/boxes/wayrath/coresched-email-1_stream.txt

                                  v                      v                     BM                     BM                     BM                     BM                     BM                     BM
                              BM-HT                BM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
MB/sec copy     33827.50 (   0.00%)    33654.32 (  -0.51%)    33683.34 (  -0.43%)    33819.30 (  -0.02%)    33830.88 (   0.01%)    33731.02 (  -0.29%)    33573.76 (  -0.75%)    33292.76 (  -1.58%)
MB/sec scale    22762.02 (   0.00%)    22524.00 (  -1.05%)    22416.54 (  -1.52%)    22444.16 (  -1.40%)    22652.56 (  -0.48%)    22462.80 (  -1.31%)    22461.90 (  -1.32%)    22670.84 (  -0.40%)
MB/sec add      26141.76 (   0.00%)    26241.42 (   0.38%)    26559.40 (   1.60%)    26365.36 (   0.86%)    26607.10 (   1.78%)    26384.50 (   0.93%)    26117.78 (  -0.09%)    26192.12 (   0.19%)
MB/sec triad    26522.46 (   0.00%)    26555.26 (   0.12%)    26499.62 (  -0.09%)    26373.26 (  -0.56%)    26667.32 (   0.55%)    26642.70 (   0.45%)    26505.38 (  -0.06%)    26409.60 (  -0.43%)
                                  v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                              VM-HT                VM-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
MB/sec copy     34559.32 (   0.00%)    34153.30 (  -1.17%)    34236.64 (  -0.93%)    33724.38 (  -2.42%)    33535.60 (  -2.96%)    33534.10 (  -2.97%)    33469.70 (  -3.15%)    33873.18 (  -1.99%)
MB/sec scale    22556.18 (   0.00%)    22834.88 (   1.24%)    22733.12 (   0.78%)    23010.46 (   2.01%)    22480.60 (  -0.34%)    22552.94 (  -0.01%)    22756.50 (   0.89%)    22434.96 (  -0.54%)
MB/sec add      26209.70 (   0.00%)    26640.08 (   1.64%)    26692.54 (   1.84%)    26747.40 (   2.05%)    26358.20 (   0.57%)    26353.50 (   0.55%)    26686.62 (   1.82%)    26256.50 (   0.18%)
MB/sec triad    26521.80 (   0.00%)    26490.26 (  -0.12%)    26598.66 (   0.29%)    26466.30 (  -0.21%)    26560.48 (   0.15%)    26496.30 (  -0.10%)    26609.10 (   0.33%)    26450.68 (  -0.27%)
                                  v                      v                     VM                     VM                     VM                     VM                     VM                     VM
                           VM-v4-HT             VM-v4-noHT                  v4-HT                v4-noHT              v4-csc-HT     v4-csc_stallfix-HT          v4-csc_tim-HT     v4-csc_vruntime-HT
MB/sec copy     32257.48 (   0.00%)    32504.18 (   0.76%)    32375.66 (   0.37%)    32261.98 (   0.01%)    31940.84 (  -0.98%)    32070.88 (  -0.58%)    31926.80 (  -1.03%)    31882.18 (  -1.16%)
MB/sec scale    19806.46 (   0.00%)    20281.18 (   2.40%)    20266.80 (   2.32%)    20075.46 (   1.36%)    19847.66 (   0.21%)    20119.00 (   1.58%)    19899.84 (   0.47%)    20060.48 (   1.28%)
MB/sec add      22178.58 (   0.00%)    22426.92 (   1.12%)    22185.54 (   0.03%)    22153.52 (  -0.11%)    21975.80 (  -0.91%)    22097.72 (  -0.36%)    21827.66 (  -1.58%)    22068.04 (  -0.50%)
MB/sec triad    22149.10 (   0.00%)    22200.54 (   0.23%)    22142.10 (  -0.03%)    21933.04 (  -0.98%)    21898.50 (  -1.13%)    22160.64 (   0.05%)    22003.40 (  -0.66%)    21951.16 (  -0.89%)
                                  v                      v                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2                   VMx2
                            VMx2-HT              VMx2-noHT                     HT                   noHT                 csc-HT        csc_stallfix-HT             csc_tim-HT        csc_vruntime-HT
MB/sec copy     33514.96 (   0.00%)    24740.70 ( -26.18%)    30410.96 (  -9.26%)    22157.24 ( -33.89%)    29552.60 ( -11.82%)    29374.78 ( -12.35%)    28717.38 ( -14.31%)    29143.88 ( -13.04%)
MB/sec scale    22605.74 (   0.00%)    15473.56 ( -31.55%)    19051.76 ( -15.72%)    15278.64 ( -32.41%)    19246.98 ( -14.86%)    19081.04 ( -15.59%)    18747.60 ( -17.07%)    18776.02 ( -16.94%)
MB/sec add      26249.56 (   0.00%)    18559.92 ( -29.29%)    21143.90 ( -19.45%)    18664.30 ( -28.90%)    21236.00 ( -19.10%)    21067.40 ( -19.74%)    20878.78 ( -20.46%)    21266.92 ( -18.98%)
MB/sec triad    26290.16 (   0.00%)    19274.10 ( -26.69%)    20573.62 ( -21.74%)    17631.52 ( -32.93%)    21066.94 ( -19.87%)    20975.04 ( -20.22%)    20944.56 ( -20.33%)    20942.18 ( -20.34%)

So, STREAM, at least in this configuration, it is not (as it could have
been expected) really sensitive to HyperThreading. In fact, in most
cases, both when run on baremetal and in VMs, HT and noHT results are
pretty much the same. When core scheduling is used, things does not
look bad at all to me, although results are, most of the time, only
marginally worse.

Do check, however, the overloaded case. There, disabling HT has quite a
big impact, and core scheduling does a rather good job in restoring
good performance.

From the overhead point of view, the situation does not look too bad
either. In fact, in the first three group of measurements, the overhead
introduced by having core scheduling patches in, is acceptable (there
are actually cases where they seem to do more good than harm! :-P).
However, when the system is overloaded, despite there not being any
tagged task, numbers look pretty bad. It seems that, for instance, of
the 13.04% performance drop between v-VMx2-HT and VMx2-csc_vruntime-HT, 
9.26% comes from overhead (as that's there already in VMx2-HT)!!

Something to investigate better, I guess...


Appendix

* v-BM-HT      : no coresched patch applied, baremetal, HyperThreading enabled
* v-BM-noHT    : no coresched patch applied, baremetal, Hyperthreading disabled
* v-VM-HT      : no coresched patch applied, 8 vCPUs VM, HyperThreading enabled
* v-VM-noHT    : no coresched patch applied, 8 vCPUs VM, Hyperthreading disabled
* v-VM-v4-HT   : no coresched patch applied, 4 vCPUs VM, HyperThreading enabled
* v-VM-v4-noHT : no coresched patch applied, 4 vCPUs VM, Hyperthreading disabled
* v-VMx2-HT    : no coresched patch applied, 8 vCPUs VM + 600% stress overhead, HyperThreading enabled
* v-VMx2-noHT  : no coresched patch applied, 8 vCPUs VM + 600% stress overhead, Hyperthreading disabled

* BM-HT              : baremetal, HyperThreading enabled
* BM-noHT            : baremetal, Hyperthreading disabled
* BM-csc-HT          : baremetal, coresched-v3 (Hyperthreading enabled, of course)
* BM-csc_stallfix-HT : baremetal, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* BM-csc_tim-HT      : baremetal, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* BM-csc_vruntime-HT : baremetal, coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

* VM-HT              : 8 vCPUs VM, HyperThreading enabled
* VM-noHT            : 8 vCPUs VM, Hyperthreading disabled
* VM-csc-HT          : 8 vCPUs VM, coresched-v3 (Hyperthreading enabled, of course)
* VM-csc_stallfix-HT : 8 vCPUs VM, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* VM-csc_tim-HT      : 8 vCPUs VM, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* VM-csc_vruntime-HT : 8 vCPUs VM, coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

* VM-v4-HT              : 4 vCPUs VM, HyperThreading enabled
* VM-v4-noHT            : 4 vCPUs VM, Hyperthreading disabled
* VM-v4-csc-HT          : 4 vCPUs VM, coresched-v3 (Hyperthreading enabled, of course)
* VM-v4-csc_stallfix-HT : 4 vCPUs VM, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* VM-v4-csc_tim-HT      : 4 vCPUs VM, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* VM-v4-csc_vruntime-HT : 4 vCPUs VM, coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

* VMx2-HT              : 8 vCPUs VM + 600% stress overhead, HyperThreading enabled
* VMx2-noHT            : 8 vCPUs VM + 600% stress overhead, Hyperthreading disabled
* VMx2-csc-HT          : 8 vCPUs VM + 600% stress overhead, coresched-v3 (Hyperthreading enabled, of course)
* VMx2-csc_stallfix-HT : 8 vCPUs VM + 600% stress overhead, coresched-v3 + stallfix (Hyperthreading enabled, of course)
* VMx2-csc_tim-HT      : 8 vCPUs VM + 600% stress overhead, coresched-v3 + Tim's patches (Hyperthreading enabeld, of course)
* VMx2-csc_vruntime-HT : 8 vCPUs VM + 600% stress overhead,
                        coresched-v3 + vruntime patches (Hyperthreading enabled, of course)

-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ