lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <BYAPR05MB483975D437F293A40BEF3189A6EC9@BYAPR05MB4839.namprd05.prod.outlook.com>
Date:   Fri, 30 Jul 2021 12:27:26 +0000
From:   Abdul Anshad Azeez <aazees@...are.com>
To:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC:     "peterz@...radead.org" <peterz@...radead.org>,
        "valentin.schneider@....com" <valentin.schneider@....com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "rostedt@...dmis.org" <rostedt@...dmis.org>,
        Rajender M <manir@...are.com>,
        Rahul Gopakumar <gopakumarr@...are.com>
Subject: [Linux Kernel 5.13 GA] ESXi Performance regression

As part of VMware's performance regression testing for Linux Kernel
upstream releases, we evaluated the performance of Linux kernel 5.13
against the 5.12 release. Our evaluation revealed performance
regressions in ESXi Compute workloads up to 3x and ESXi Networking
workloads up to 40%.

After performing the bisect between kernel 5.13 and 5.12, we
identified the root cause behavior to be a “Scheduler” related commit
from Peter Zijlstra's "8a99b6833c884fa0e7919030d93fecedc69fc625 (
sched: Move SCHED_DEBUG sysctl to debugfs)". It appears that the
issue arose due to Peter's commit changing the default value of
"sched_wakeup_granularity_ns" and more details are below.

Impacted test case details:

1. Compute:
- VM Config - RHEL 8.1 - 1VM with 8vCPU & 16G Memory
- Benchmark - kernel compile
- Measures time taken to compile Linux kernel source code (Linux
kernel version used - 4.9.24)
- make -j 2xVCPU - This uses all the available CPU threads to achieve
100% CPU utilization

2. Networking:
- VM Config - RHEL 8.1 - 1VM with 8vCPU & 16G Memory and 8VM with
4vCPU & 8G Memory
- Benchmark - Netperf
- Netperf TCP_STREAM RECV small (8K socket & 256B message)(
TCP_NODELAY set) packets – Throughput (1VM)
- Netperf UDP_STREAM RECV (256K socket & 256B message) – Packet rate (
8VM)

>From our testing, overall results indicate that the above-mentioned
commit has introduced performance regressions in kernel compile
workload for Compute area and in Networking, test cases with high
packet rates were impacted.

We noticed that Peter Zijlstra's commit has moved the Scheduler
tunables to debugfs file system. And on taking a closer look, the
values of two such tunables are different between before and after
the above-mentioned commit.

1. Before:
sched_min_granularity_ns    - 10000000 (10ms)
sched_wakeup_granularity_ns - 15000000 (15ms)

2. After:
sched_min_granularity_ns    - 3000000 (3ms)
sched_wakeup_granularity_ns - 4000000 (4ms)

With further experiments, we have confirmed that the value of
"sched_wakeup_granularity_ns" is influencing these performance
regressions. And, on setting the "sched_wakeup_granularity_ns" value
back to "15000000" in Peter Zijlstra's commit, we are able to gain
back the lost performance in our Compute & Networking workloads.

Further, we also collected guest scheduling stats (during Kernel
compile workload) and were able to notice more involuntary switches
forced by the scheduler when "sched_wakeup_granularity_ns" value is
set to "4000000".

1. "sched_wakeup_granularity_ns = 4000000" (3 iterations):
nr_involuntary_switches : 3
nr_involuntary_switches : 2
nr_involuntary_switches : 2

2. "sched_wakeup_granularity_ns = 15000000" (3 iterations):
nr_involuntary_switches : 0
nr_involuntary_switches : 0
nr_involuntary_switches : 0

So, we believe decreasing the value of "sched_wakeup_granularity_ns"
is causing more preemption to the running processes and it's
impacting the CPU-bound tasks - Kernel compile & Netperf high packet
rate workloads.

Also, since Linux 5.14-rc3 kernel was recently released, we repeated
the same experiments on 5.14-rc3 and were able to observe the same
regressions in both areas (Compute & Networking).

We wanted to understand the reason behind the change in default
values for the above two scheduler tunables and since changing the
value of "sched_wakeup_granularity_ns" from 15ms to 4ms forces more
involuntary switches and which in-turn introduces performance
regression, can this be changed back to 15ms?

Abdul Anshad Azeez
Performance Engineering
VMware, Inc.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ