netdev - Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <16217.1708653901@famine>
Date: Thu, 22 Feb 2024 18:05:01 -0800
From: Jay Vosburgh <jay.vosburgh@...onical.com>
To: "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
cc: Jakub Kicinski <kuba@...nel.org>,
    Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
    Tariq Toukan <ttoukan.linux@...il.com>,
    Saeed Mahameed <saeed@...nel.org>,
    "David S. Miller" <davem@...emloft.net>,
    Paolo Abeni <pabeni@...hat.com>, Eric Dumazet <edumazet@...gle.com>,
    Saeed Mahameed <saeedm@...dia.com>, netdev@...r.kernel.org,
    Tariq Toukan <tariqt@...dia.com>, Gal Pressman <gal@...dia.com>,
    Leon Romanovsky <leonro@...dia.com>
Subject: Re: [net-next V3 15/15] Documentation: networking: Add description for multi-pf netdev

Samudrala, Sridhar <sridhar.samudrala@...el.com> wrote:
>On 2/22/2024 5:00 PM, Jakub Kicinski wrote:
>> On Thu, 22 Feb 2024 08:51:36 +0100 Greg Kroah-Hartman wrote:
>>> On Tue, Feb 20, 2024 at 05:33:09PM -0800, Jakub Kicinski wrote:
>>>> Greg, we have a feature here where a single device of class net has
>>>> multiple "bus parents". We used to have one attr under class net
>>>> (device) which is a link to the bus parent. Now we either need to add
>>>> more or not bother with the linking of the whole device. Is there any
>>>> precedent / preference for solving this from the device model
>>>> perspective?
>>>
>>> How, logically, can a netdevice be controlled properly from 2 parent
>>> devices on two different busses?  How is that even possible from a
>>> physical point-of-view?  What exact bus types are involved here?
>> Two PCIe buses, two endpoints, two networking ports. It's one piece
>
>Isn't it only 1 networking port with multiple PFs?
>
>> of silicon, tho, so the "slices" can talk to each other internally.
>> The NVRAM configuration tells both endpoints that the user wants
>> them "bonded", when the PCI drivers probe they "find each other"
>> using some cookie or DSN or whatnot. And once they did, they spawn
>> a single netdev.
>> 
>>> This "shouldn't" be possible as in the end, it's usually a PCI device
>>> handling this all, right?
>> It's really a special type of bonding of two netdevs. Like you'd bond
>> two ports to get twice the bandwidth. With the twist that the balancing
>> is done on NUMA proximity, rather than traffic hash.
>> Well, plus, the major twist that it's all done magically "for you"
>> in the vendor driver, and the two "lower" devices are not visible.
>> You only see the resulting bond.
>> I personally think that the magic hides as many problems as it
>> introduces and we'd be better off creating two separate netdevs.
>> And then a new type of "device bond" on top. Small win that
>> the "new device bond on top" can be shared code across vendors.
>
>Yes. We have been exploring a small extension to bonding driver to enable
>a single numa-aware multi-threaded application to efficiently utilize
>multiple NICs across numa nodes.

	Is this referring to something like the multi-pf under
discussion, or just generically with two arbitrary network devices
installed one each per NUMA node?

>Here is an early version of a patch we have been trying and seems to be
>working well.
>
>=========================================================================
>bonding: select tx device based on rx device of a flow
>
>If napi_id is cached in the sk associated with skb, use the
>device associated with napi_id as the transmit device.
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@...el.com>
>
>diff --git a/drivers/net/bonding/bond_main.c
>b/drivers/net/bonding/bond_main.c
>index 7a7d584f378a..77e3bf6c4502 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -5146,6 +5146,30 @@ static struct slave
>*bond_xmit_3ad_xor_slave_get(struct bonding *bond,
>        unsigned int count;
>        u32 hash;
>
>+       if (skb->sk) {
>+               int napi_id = skb->sk->sk_napi_id;
>+               struct net_device *dev;
>+               int idx;
>+
>+               rcu_read_lock();
>+               dev = dev_get_by_napi_id(napi_id);
>+               rcu_read_unlock();
>+
>+               if (!dev)
>+                       goto hash;
>+
>+               count = slaves ? READ_ONCE(slaves->count) : 0;
>+               if (unlikely(!count))
>+                       return NULL;
>+
>+               for (idx = 0; idx < count; idx++) {
>+                       slave = slaves->arr[idx];
>+                       if (slave->dev->ifindex == dev->ifindex)
>+                               return slave;
>+               }
>+       }
>+
>+hash:
>        hash = bond_xmit_hash(bond, skb);
>        count = slaves ? READ_ONCE(slaves->count) : 0;
>        if (unlikely(!count))
>=========================================================================
>
>If we make this as a configurable bonding option, would this be an
>acceptable solution to accelerate numa-aware apps?

	Assuming for the moment this is for "regular" network devices
installed one per NUMA node, why do this in bonding instead of at a
higher layer (multiple subnets or ECMP, for example)?

	Is the intent here that the bond would aggregate its interfaces
via LACP with the peer being some kind of cross-chassis link aggregation
(MLAG, et al)?

	Given that sk_napi_id seems to be associated with
CONFIG_NET_RX_BUSY_POLL, am I correct in presuming the target
applications are DPDK-style busy poll packet processors?

	-J

---
	-Jay Vosburgh, jay.vosburgh@...onical.com