[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240227180619.7e908ac4@kernel.org>
Date: Tue, 27 Feb 2024 18:06:19 -0800
From: Jakub Kicinski <kuba@...nel.org>
To: Jiri Pirko <jiri@...nulli.us>
Cc: "Samudrala, Sridhar" <sridhar.samudrala@...el.com>, Greg Kroah-Hartman
<gregkh@...uxfoundation.org>, Tariq Toukan <ttoukan.linux@...il.com>, Saeed
Mahameed <saeed@...nel.org>, "David S. Miller" <davem@...emloft.net>, Paolo
Abeni <pabeni@...hat.com>, Eric Dumazet <edumazet@...gle.com>, Saeed
Mahameed <saeedm@...dia.com>, netdev@...r.kernel.org, Tariq Toukan
<tariqt@...dia.com>, Gal Pressman <gal@...dia.com>, Leon Romanovsky
<leonro@...dia.com>, jay.vosburgh@...onical.com
Subject: Re: [net-next V3 15/15] Documentation: networking: Add description
for multi-pf netdev
On Fri, 23 Feb 2024 10:36:25 +0100 Jiri Pirko wrote:
> >> It's really a special type of bonding of two netdevs. Like you'd bond
> >> two ports to get twice the bandwidth. With the twist that the balancing
> >> is done on NUMA proximity, rather than traffic hash.
> >>
> >> Well, plus, the major twist that it's all done magically "for you"
> >> in the vendor driver, and the two "lower" devices are not visible.
> >> You only see the resulting bond.
> >>
> >> I personally think that the magic hides as many problems as it
> >> introduces and we'd be better off creating two separate netdevs.
> >> And then a new type of "device bond" on top. Small win that
> >> the "new device bond on top" can be shared code across vendors.
> >
> >Yes. We have been exploring a small extension to bonding driver to enable a
> >single numa-aware multi-threaded application to efficiently utilize multiple
> >NICs across numa nodes.
>
> Bonding was my immediate response when we discussed this internally for
> the first time. But I had to eventually admit it is probably not that
> suitable in this case, here's why:
> 1) there are no 2 physical ports, only one.
Right, sorry, number of PFs matches number of ports for each bus.
But it's not necessarily a deal breaker - it's similar to a multi-host
device. We also have multiple netdevs and PCIe links, they just go to
different host rather than different NUMA nodes on one host.
> 2) it is basically a matter of device layout/provisioning that this
> feature should be enabled, not user configuration.
We can still auto-instantiate it, not a deal breaker.
I'm not sure you're right in that assumption, tho. At Meta, we support
container sizes ranging from few CPUs to multiple NUMA nodes. Each NUMA
node may have it's own NIC, and the orchestration needs to stitch and
un-stitch NICs depending on whether the cores were allocated to small
containers or a huge one.
So it would be _easier_ to deal with multiple netdevs. Orchestration
layer already understands netdev <> NUMA mapping, it does not understand
multi-NUMA netdevs, and how to match up queues to nodes.
> 3) other subsystems like RDMA would benefit the same feature, so this
> int not netdev specific in general.
Yes, looks RDMA-centric. RDMA being infamously bonding-challenged.
Anyway, back to the initial question - from Greg's reply I'm guessing
there's no precedent for doing such things in the device model either.
So we're on our own.
Powered by blists - more mailing lists