netdev - Re: Expected performance w/ bonding

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <10437.1674060181@famine>
Date:   Wed, 18 Jan 2023 08:43:01 -0800
From:   Jay Vosburgh <jay.vosburgh@...onical.com>
To:     Ian Kumlien <ian.kumlien@...il.com>
cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>,
        andy@...yhouse.net, vfalico@...il.com
Subject: Re: Expected performance w/ bonding

Ian Kumlien <ian.kumlien@...il.com> wrote:

>Hi,
>
>I was doing some tests with some of the bigger AMD machines, both
>using PCIE-4 Mellanox connectx-5 nics
>
>They have 2x100gbit links to the same switches (running in VLT - ie as
>"one"), but iperf3 seems to hit a limit at 27gbit max...
>(with 10 threads in parallel) but generally somewhere at 25gbit - so
>my question is if there is a limit at ~25gbit for bonding
>using 802.3ad and layer2+3 hashing.
>
>It's a little bit difficult to do proper measurements since most
>systems are in production - but i'm kinda running out of clues =)
>
>If anyone has any ideas, it would be very interesting to see if they would help.

	If by "bigger AMD machines" you mean ROME or similar, then what
you may be seeing is the effects of (a) NUMA, and (b) the AMD CCX cache
architecture (in which a small-ish number of CPUs share an L3 cache).
We've seen similar effects on these systems with similar configurations,
particularly on versions with smaller numbers of CPUs per CCX (as I
recall, one is 4 per CCX).

	For testing purposes, you can pin the iperf tasks to CPUs in the
same CCX as one another, and on the same NUMA node as the network
device.  If your bond utilizes interfaces on separate NUMA nodes, there
may be additional randomness in the results, as data may or may not
cross a NUMA boundary depending on the flow hash.  For testing, this can
be worked around by disabling one interface in the bond (i.e., a bond
with just one active interface), and insuring the iperf tasks are pinned
to the correct NUMA node.

	There is a mechanism in bonding to do flow -> queue -> interface
assignments (described in Documentation/networking/bonding.rst), but
it's nontrivial, and still needs the processes to be resident on the
same NUMA node (and on the AMD systems, also within the same CCX
domain).

	-J

---
	-Jay Vosburgh, jay.vosburgh@...onical.com