lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ef0270c5-3128-40d8-933e-f9dbeddf5961@intel.com>
Date: Fri, 23 Feb 2024 17:56:52 -0600
From: "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
To: Jiri Pirko <jiri@...nulli.us>
CC: Jay Vosburgh <jay.vosburgh@...onical.com>, Jakub Kicinski
	<kuba@...nel.org>, Greg Kroah-Hartman <gregkh@...uxfoundation.org>, "Tariq
 Toukan" <ttoukan.linux@...il.com>, Saeed Mahameed <saeed@...nel.org>, "David
 S. Miller" <davem@...emloft.net>, Paolo Abeni <pabeni@...hat.com>, "Eric
 Dumazet" <edumazet@...gle.com>, Saeed Mahameed <saeedm@...dia.com>,
	<netdev@...r.kernel.org>, Tariq Toukan <tariqt@...dia.com>, Gal Pressman
	<gal@...dia.com>, Leon Romanovsky <leonro@...dia.com>
Subject: Re: [net-next V3 15/15] Documentation: networking: Add description
 for multi-pf netdev



On 2/23/2024 3:40 AM, Jiri Pirko wrote:
> Fri, Feb 23, 2024 at 06:00:40AM CET, sridhar.samudrala@...el.com wrote:
>>
>>
>> On 2/22/2024 8:05 PM, Jay Vosburgh wrote:
>>> Samudrala, Sridhar <sridhar.samudrala@...el.com> wrote:
>>>> On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>>>>> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>>>>>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>>>>>>> Greg, we have a feature here where a single device of class net has
>>>>>>> multiple "bus parents". We used to have one attr under class net
>>>>>>> (device) which is a link to the bus parent. Now we either need to add
>>>>>>> more or not bother with the linking of the whole device. Is there any
>>>>>>> precedent / preference for solving this from the device model
>>>>>>> perspective?
>>>>>>
>>>>>> How, logically, can a netdevice be controlled properly from 2 parent
>>>>>> devices on two different busses?  How is that even possible from a
>>>>>> physical point-of-view?  What exact bus types are involved here?
>>>>> Two PCIe buses, two endpoints, two networking ports. It's one piece
>>>>
>>>> Isn't it only 1 networking port with multiple PFs?
>>>>
>>>>> of silicon, tho, so the "slices" can talk to each other internally.
>>>>> The NVRAM configuration tells both endpoints that the user wants
>>>>> them "bonded", when the PCI drivers probe they "find each other"
>>>>> using some cookie or DSN or whatnot. And once they did, they spawn
>>>>> a single netdev.
>>>>>
>>>>>> This "shouldn't" be possible as in the end, it's usually a PCI device
>>>>>> handling this all, right?
>>>>> It's really a special type of bonding of two netdevs. Like you'd bond
>>>>> two ports to get twice the bandwidth. With the twist that the balancing
>>>>> is done on NUMA proximity, rather than traffic hash.
>>>>> Well, plus, the major twist that it's all done magically "for you"
>>>>> in the vendor driver, and the two "lower" devices are not visible.
>>>>> You only see the resulting bond.
>>>>> I personally think that the magic hides as many problems as it
>>>>> introduces and we'd be better off creating two separate netdevs.
>>>>> And then a new type of "device bond" on top. Small win that
>>>>> the "new device bond on top" can be shared code across vendors.
>>>>
>>>> Yes. We have been exploring a small extension to bonding driver to enable
>>>> a single numa-aware multi-threaded application to efficiently utilize
>>>> multiple NICs across numa nodes.
>>>
>>> 	Is this referring to something like the multi-pf under
>>> discussion, or just generically with two arbitrary network devices
>>> installed one each per NUMA node?
>>
>> Normal network devices one per NUMA node
>>
>>>
>>>> Here is an early version of a patch we have been trying and seems to be
>>>> working well.
>>>>
>>>> =========================================================================
>>>> bonding: select tx device based on rx device of a flow
>>>>
>>>> If napi_id is cached in the sk associated with skb, use the
>>>> device associated with napi_id as the transmit device.
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@...el.com>
>>>>
>>>> diff --git a/drivers/net/bonding/bond_main.c
>>>> b/drivers/net/bonding/bond_main.c
>>>> index 7a7d584f378a..77e3bf6c4502 100644
>>>> --- a/drivers/net/bonding/bond_main.c
>>>> +++ b/drivers/net/bonding/bond_main.c
>>>> @@ -5146,6 +5146,30 @@ static struct slave
>>>> *bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>>>>          unsigned int count;
>>>>          u32 hash;
>>>>
>>>> +       if (skb->sk) {
>>>> +               int napi_id = skb->sk->sk_napi_id;
>>>> +               struct net_device *dev;
>>>> +               int idx;
>>>> +
>>>> +               rcu_read_lock();
>>>> +               dev = dev_get_by_napi_id(napi_id);
>>>> +               rcu_read_unlock();
>>>> +
>>>> +               if (!dev)
>>>> +                       goto hash;
>>>> +
>>>> +               count = slaves ? READ_ONCE(slaves->count) : 0;
>>>> +               if (unlikely(!count))
>>>> +                       return NULL;
>>>> +
>>>> +               for (idx = 0; idx < count; idx++) {
>>>> +                       slave = slaves->arr[idx];
>>>> +                       if (slave->dev->ifindex == dev->ifindex)
>>>> +                               return slave;
>>>> +               }
>>>> +       }
>>>> +
>>>> +hash:
>>>>          hash = bond_xmit_hash(bond, skb);
>>>>          count = slaves ? READ_ONCE(slaves->count) : 0;
>>>>          if (unlikely(!count))
>>>> =========================================================================
>>>>
>>>> If we make this as a configurable bonding option, would this be an
>>>> acceptable solution to accelerate numa-aware apps?
>>>
>>> 	Assuming for the moment this is for "regular" network devices
>>> installed one per NUMA node, why do this in bonding instead of at a
>>> higher layer (multiple subnets or ECMP, for example)?
>>>
>>> 	Is the intent here that the bond would aggregate its interfaces
>>> via LACP with the peer being some kind of cross-chassis link aggregation
>>> (MLAG, et al)?
> 
> No.
> 
>>
>> Yes. basic LACP bonding setup. There could be multiple peers connecting to
>> the server via switch providing LACP based link aggregation. No cross-chassis
>> MLAG.
> 
> LACP does not make any sense, when you have only a single physical port.
> That applies to ECMP mentioned above too I believe.

I meant for the 2 regular NICs on 2 numa node setup, not for multi-PF 1 
port setup.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ