[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CACLgkEYHqmN-Kf93PNcWKJd5FKtwENxLOk7e7Z-VcxcDfXFdOA@mail.gmail.com>
Date: Fri, 17 May 2024 18:14:19 +0530
From: Krishna Kumar <krikku@...il.com>
To: netdev@...r.kernel.org
Cc: "Krishna Kumar (Engineering)" <krishna.ku@...pkart.com>, vasudeva.sk@...pkart.com
Subject: Large TCP_RR performance drop between 5.x and 6.x kernel?
Dear maintainers, developers, community,
We are using Debian (Bullseye) across the multiple Flipkart data
centers in India with many thousands of servers (running QEMU/KVM).
Recently, as part of testing for upgrading systems to Debian 12, we ran into
a performance regression (~27%) in VMs running on the upgraded hypervisors.
In both cases, VMs run Debian 12.
Hypervisor configuration:
- 144-core Intel Xeon Icelake (8352V), 750 GB memory, Intel E810 NIC, IRQ
and XPS set.
- System has two bridges - one for hypervisor connectivity (one VF to
this bridge), and one for VM (one VF to this bridge). The system has
exactly one VM powered on, connected to the second bridge.
- Command line for both Debian 11 and Debian 12 are similar to the
following (except for the vmlinuz* version):
BOOT_IMAGE=<path-to-vmlinuz> root=UUID=<uuid> ro rootflags=<xyz>
loglevel=7 intel_iommu=on iommu=pt quiet crashkernel=512M
- VM is pinned to CPUs 1-12 of the host, nothing else is running on the host:
NUMA node0 CPU(s): 0-35,72-107
NUMA node1 CPU(s): 36-71,108-143
VM configuration:
- 12 core, 120GB, MQ virtio-net, runs on NUMA socket #0 (no HT) of
the host. Always runs Debian 12
We are testing VM performance running on the same bare metal server, which runs
Debian 11 (5.10.216 kernel and 5.2 QEMU) before upgrading it to Debian 12
(6.1.90 kernel and 7.2 QEMU). The VM runs Debian 12 in both cases.
VM Test: 32 processes TCP_RR for 60 seconds to another bare-metal server:
Debian 11: 440K
Debian 12: 320K (27% performance drop).
Packets are well spread across the TX/RX queues on the guest virtio* and the
VF on the physical host.
We expect some setting that is causing this degradation but we are not
able to figure out what it is (though we do not make any changes to the
system after upgrading distribution).
We did the following test, expecting that the degradation is either
due to the higher kernel or qemu version:
Kernel QEMU Rate Notes
6.0.2 7.2 320K Downgrade kernel to lowest 6.x,
same QEMU.
5.10.216 5.2/7.2 440K Downgrade kernel to Debian 11, any QEMU.
>From here on, do manual "bisect" between 5.10.x towards 6.0.2 without changing
QEMU to identify when it degraded.
5.19.11 7.2 310K Highest kernel version in Debian 5.x.
5.16.18 7.2 330K Go down a version.
5.15.15 7.2 443K Good number, start going forward.
5.16.7 7.2 300K Bad
5.15.158 7.2 448K Now go ahead to find closest
version with degrade.
5.16.1 7.2 340K Degrades, go back a little.
5.16.0 7.2 336K First version after 5.15.158
There's a huge set of changes between 5.15.158 and 5.16.0. So, at this time,
we are blocked knowing that the last 5.15.158 performed well at 448K while
the next version reduced to 340K.
Any help on what we could further look for to identify the reason for this
drop, and if any setting/config change is expected?
Thanks,
- KK
Powered by blists - more mailing lists