[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZdhoBOKc40DeVCfG@nanopsycho>
Date: Fri, 23 Feb 2024 10:40:20 +0100
From: Jiri Pirko <jiri@...nulli.us>
To: "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
Cc: Jay Vosburgh <jay.vosburgh@...onical.com>,
Jakub Kicinski <kuba@...nel.org>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Tariq Toukan <ttoukan.linux@...il.com>,
Saeed Mahameed <saeed@...nel.org>,
"David S. Miller" <davem@...emloft.net>,
Paolo Abeni <pabeni@...hat.com>, Eric Dumazet <edumazet@...gle.com>,
Saeed Mahameed <saeedm@...dia.com>, netdev@...r.kernel.org,
Tariq Toukan <tariqt@...dia.com>, Gal Pressman <gal@...dia.com>,
Leon Romanovsky <leonro@...dia.com>
Subject: Re: [net-next V3 15/15] Documentation: networking: Add description
for multi-pf netdev
Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@...el.com wrote:
>
>
>On 2/22/2024 8:05 PM, Jay Vosburgh wrote:
>> Samudrala, Sridhar <sridhar.samudrala@...el.com> wrote:
>> > On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>> > > On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>> > > > On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>> > > > > Greg, we have a feature here where a single device of class net has
>> > > > > multiple "bus parents". We used to have one attr under class net
>> > > > > (device) which is a link to the bus parent. Now we either need to add
>> > > > > more or not bother with the linking of the whole device. Is there any
>> > > > > precedent / preference for solving this from the device model
>> > > > > perspective?
>> > > >
>> > > > How, logically, can a netdevice be controlled properly from 2 parent
>> > > > devices on two different busses? How is that even possible from a
>> > > > physical point-of-view? What exact bus types are involved here?
>> > > Two PCIe buses, two endpoints, two networking ports. It's one piece
>> >
>> > Isn't it only 1 networking port with multiple PFs?
>> >
>> > > of silicon, tho, so the "slices" can talk to each other internally.
>> > > The NVRAM configuration tells both endpoints that the user wants
>> > > them "bonded", when the PCI drivers probe they "find each other"
>> > > using some cookie or DSN or whatnot. And once they did, they spawn
>> > > a single netdev.
>> > >
>> > > > This "shouldn't" be possible as in the end, it's usually a PCI device
>> > > > handling this all, right?
>> > > It's really a special type of bonding of two netdevs. Like you'd bond
>> > > two ports to get twice the bandwidth. With the twist that the balancing
>> > > is done on NUMA proximity, rather than traffic hash.
>> > > Well, plus, the major twist that it's all done magically "for you"
>> > > in the vendor driver, and the two "lower" devices are not visible.
>> > > You only see the resulting bond.
>> > > I personally think that the magic hides as many problems as it
>> > > introduces and we'd be better off creating two separate netdevs.
>> > > And then a new type of "device bond" on top. Small win that
>> > > the "new device bond on top" can be shared code across vendors.
>> >
>> > Yes. We have been exploring a small extension to bonding driver to enable
>> > a single numa-aware multi-threaded application to efficiently utilize
>> > multiple NICs across numa nodes.
>>
>> Is this referring to something like the multi-pf under
>> discussion, or just generically with two arbitrary network devices
>> installed one each per NUMA node?
>
>Normal network devices one per NUMA node
>
>>
>> > Here is an early version of a patch we have been trying and seems to be
>> > working well.
>> >
>> > =========================================================================
>> > bonding: select tx device based on rx device of a flow
>> >
>> > If napi_id is cached in the sk associated with skb, use the
>> > device associated with napi_id as the transmit device.
>> >
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@...el.com>
>> >
>> > diff --git a/drivers/net/bonding/bond_main.c
>> > b/drivers/net/bonding/bond_main.c
>> > index 7a7d584f378a..77e3bf6c4502 100644
>> > --- a/drivers/net/bonding/bond_main.c
>> > +++ b/drivers/net/bonding/bond_main.c
>> > @@ -5146,6 +5146,30 @@ static struct slave
>> > *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>> > unsigned int count;
>> > u32 hash;
>> >
>> > + if (skb->sk) {
>> > + int napi_id = skb->sk->sk_napi_id;
>> > + struct net_device *dev;
>> > + int idx;
>> > +
>> > + rcu_read_lock();
>> > + dev = dev_get_by_napi_id(napi_id);
>> > + rcu_read_unlock();
>> > +
>> > + if (!dev)
>> > + goto hash;
>> > +
>> > + count = slaves ? READ_ONCE(slaves->count) : 0;
>> > + if (unlikely(!count))
>> > + return NULL;
>> > +
>> > + for (idx = 0; idx < count; idx++) {
>> > + slave = slaves->arr[idx];
>> > + if (slave->dev->ifindex == dev->ifindex)
>> > + return slave;
>> > + }
>> > + }
>> > +
>> > +hash:
>> > hash = bond_xmit_hash(bond, skb);
>> > count = slaves ? READ_ONCE(slaves->count) : 0;
>> > if (unlikely(!count))
>> > =========================================================================
>> >
>> > If we make this as a configurable bonding option, would this be an
>> > acceptable solution to accelerate numa-aware apps?
>>
>> Assuming for the moment this is for "regular" network devices
>> installed one per NUMA node, why do this in bonding instead of at a
>> higher layer (multiple subnets or ECMP, for example)?
>>
>> Is the intent here that the bond would aggregate its interfaces
>> via LACP with the peer being some kind of cross-chassis link aggregation
>> (MLAG, et al)?
No.
>
>Yes. basic LACP bonding setup. There could be multiple peers connecting to
>the server via switch providing LACP based link aggregation. No cross-chassis
>MLAG.
LACP does not make any sense, when you have only a single physical port.
That applies to ECMP mentioned above too I believe.
Powered by blists - more mailing lists