lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <228ecfb6-d5cb-403b-aecf-7c1181aa45ce@gmail.com>
Date: Tue, 5 Mar 2024 22:12:23 +0200
From: Tariq Toukan <ttoukan.linux@...il.com>
To: Przemek Kitszel <przemyslaw.kitszel@...el.com>,
 Saeed Mahameed <saeed@...nel.org>, "David S. Miller" <davem@...emloft.net>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Eric Dumazet <edumazet@...gle.com>
Cc: Saeed Mahameed <saeedm@...dia.com>, netdev@...r.kernel.org,
 Tariq Toukan <tariqt@...dia.com>, Gal Pressman <gal@...dia.com>,
 Leon Romanovsky <leonro@...dia.com>, sridhar.samudrala@...el.com,
 Jay Vosburgh <jay.vosburgh@...onical.com>, Jiri Pirko <jiri@...dia.com>,
 Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Subject: Re: [net-next V4 15/15] Documentation: networking: Add description
 for multi-pf netdev



On 04/03/2024 14:03, Przemek Kitszel wrote:
> On 3/2/24 08:22, Saeed Mahameed wrote:
>> From: Tariq Toukan <tariqt@...dia.com>
>>
>> Add documentation for the multi-pf netdev feature.
>> Describe the mlx5 implementation and design decisions.
>>
>> Signed-off-by: Tariq Toukan <tariqt@...dia.com>
>> Signed-off-by: Saeed Mahameed <saeedm@...dia.com>
>> ---
>>   Documentation/networking/index.rst           |   1 +
>>   Documentation/networking/multi-pf-netdev.rst | 177 +++++++++++++++++++
>>   2 files changed, 178 insertions(+)
>>   create mode 100644 Documentation/networking/multi-pf-netdev.rst
>>
>> diff --git a/Documentation/networking/index.rst 
>> b/Documentation/networking/index.rst
>> index 69f3d6dcd9fd..473d72c36d61 100644
>> --- a/Documentation/networking/index.rst
>> +++ b/Documentation/networking/index.rst
>> @@ -74,6 +74,7 @@ Contents:
>>      mpls-sysctl
>>      mptcp-sysctl
>>      multiqueue
>> +   multi-pf-netdev
>>      napi
>>      net_cachelines/index
>>      netconsole
>> diff --git a/Documentation/networking/multi-pf-netdev.rst 
>> b/Documentation/networking/multi-pf-netdev.rst
>> new file mode 100644
>> index 000000000000..f6f782374b71
>> --- /dev/null
>> +++ b/Documentation/networking/multi-pf-netdev.rst
>> @@ -0,0 +1,177 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +.. include:: <isonum.txt>
>> +
>> +===============
>> +Multi-PF Netdev
>> +===============
>> +
>> +Contents
>> +========
>> +
>> +- `Background`_
>> +- `Overview`_
>> +- `mlx5 implementation`_
>> +- `Channels distribution`_
>> +- `Observability`_
>> +- `Steering`_
>> +- `Mutually exclusive features`_
> 
> this document describes mlx5 details mostly, and I would expect to find
> them in a mlx5.rst file instead of vendor-agnostic doc
> 

It was originally under 
Documentation/networking/device_drivers/ethernet/mellanox/mlx5/
We moved it here with the needed changes per request.

See:
https://lore.kernel.org/all/20240209222738.4cf9f25b@kernel.org/

>> +
>> +Background
>> +==========
>> +
>> +The advanced Multi-PF NIC technology enables several CPUs within a 
>> multi-socket server to
> 
> please remove the `advanced` word
> 
>> +connect directly to the network, each through its own dedicated PCIe 
>> interface. Through either a
>> +connection harness that splits the PCIe lanes between two cards or by 
>> bifurcating a PCIe slot for a
>> +single card. This results in eliminating the network traffic 
>> traversing over the internal bus
>> +between the sockets, significantly reducing overhead and latency, in 
>> addition to reducing CPU
>> +utilization and increasing network throughput.
>> +
>> +Overview
>> +========
>> +
>> +The feature adds support for combining multiple PFs of the same port 
>> in a Multi-PF environment under
>> +one netdev instance. It is implemented in the netdev layer. 
>> Lower-layer instances like pci func,
>> +sysfs entry, devlink) are kept separate.
>> +Passing traffic through different devices belonging to different NUMA 
>> sockets saves cross-numa
> 
> please consider spelling out NUMA as always capitalized
> 
>> +traffic and allows apps running on the same netdev from different 
>> numas to still feel a sense of
>> +proximity to the device and achieve improved performance.
>> +
>> +mlx5 implementation
>> +===================
>> +
>> +Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs 
>> together which belong to the same
>> +NIC and has the socket-direct property enabled, once all PFS are 
>> probed, we create a single netdev
> 
> s/PFS/PFs/
> 
>> +to represent all of them, symmetrically, we destroy the netdev 
>> whenever any of the PFs is removed.
>> +
>> +The netdev network channels are distributed between all devices, a 
>> proper configuration would utilize
>> +the correct close numa node when working on a certain app/cpu.
> 
> CPU
> 
>> +
>> +We pick one PF to be a primary (leader), and it fills a special role. 
>> The other devices
>> +(secondaries) are disconnected from the network at the chip level 
>> (set to silent mode). In silent
>> +mode, no south <-> north traffic flowing directly through a secondary 
>> PF. It needs the assistance of
>> +the leader PF (east <-> west traffic) to function. All RX/TX traffic 
>> is steered through the primary
> 
> Rx, Tx (whole document)
> 
>> +to/from the secondaries.
>> +
>> +Currently, we limit the support to PFs only, and up to two PFs 
>> (sockets).
>> +
>> +Channels distribution
>> +=====================
>> +
>> +We distribute the channels between the different PFs to achieve local 
>> NUMA node performance
>> +on multiple NUMA nodes.
>> +
>> +Each combined channel works against one specific PF, creating all its 
>> datapath queues against it. We
>> +distribute channels to PFs in a round-robin policy.
>> +
>> +::
>> +
>> +        Example for 2 PFs and 5 channels:
>> +        +--------+--------+
>> +        | ch idx | PF idx |
>> +        +--------+--------+
>> +        |    0   |    0   |
>> +        |    1   |    1   |
>> +        |    2   |    0   |
>> +        |    3   |    1   |
>> +        |    4   |    0   |
>> +        +--------+--------+
>> +
>> +
>> +We prefer this round-robin distribution policy over another suggested 
>> intuitive distribution, in
>> +which we first distribute one half of the channels to PF0 and then 
>> the second half to PF1.
> 
> Please rephrase to describe current state (which makes sense over what
> was suggested), instead of addressing feedback (that could be kept in
> cover letter if you really want).
> 
> And again, the wording "we" clearly indicates that this section, as
> future ones, is mlx specific.
> 
>> +
>> +The reason we prefer round-robin is, it is less influenced by changes 
>> in the number of channels. The
>> +mapping between a channel index and a PF is fixed, no matter how many 
>> channels the user configures.
>> +As the channel stats are persistent across channel's closure, 
>> changing the mapping every single time
>> +would turn the accumulative stats less representing of the channel's 
>> history.
>> +
>> +This is achieved by using the correct core device instance (mdev) in 
>> each channel, instead of them
>> +all using the same instance under "priv->mdev".
>> +
>> +Observability
>> +=============
>> +The relation between PF, irq, napi, and queue can be observed via 
>> netlink spec:
>> +
>> +$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml 
>> --dump queue-get --json='{"ifindex": 13}'
>> +[{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
>> + {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
>> + {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
>> + {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
>> + {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
>> + {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
>> + {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
>> + {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
>> + {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
>> + {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
>> +
>> +$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml 
>> --dump napi-get --json='{"ifindex": 13}'
>> +[{'id': 543, 'ifindex': 13, 'irq': 42},
>> + {'id': 542, 'ifindex': 13, 'irq': 41},
>> + {'id': 541, 'ifindex': 13, 'irq': 40},
>> + {'id': 540, 'ifindex': 13, 'irq': 39},
>> + {'id': 539, 'ifindex': 13, 'irq': 36}]
>> +
>> +Here you can clearly observe our channels distribution policy:
>> +
>> +$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
>> +/proc/irq/36/mlx5_comp1@pci:0000:08:00.0
>> +/proc/irq/39/mlx5_comp1@pci:0000:09:00.0
>> +/proc/irq/40/mlx5_comp2@pci:0000:08:00.0
>> +/proc/irq/41/mlx5_comp2@pci:0000:09:00.0
>> +/proc/irq/42/mlx5_comp3@pci:0000:08:00.0
>> +
>> +Steering
>> +========
>> +Secondary PFs are set to "silent" mode, meaning they are disconnected 
>> from the network.
>> +
>> +In RX, the steering tables belong to the primary PF only, and it is 
>> its role to distribute incoming
>> +traffic to other PFs, via cross-vhca steering capabilities. Nothing 
>> special about the RSS table
>> +content, except that it needs a capable device to point to the 
>> receive queues of a different PF.
> 
> I guess you cannot enable the multi-pf for incapable device, so there is
> anything noteworthy in last sentence?
> 

I was asked in earlier patchsets to elaborate on this.

It tells "how" an RSS table looks like on a capable device.
Maybe I should re-phrase to emphasize the point.

It is not straightforward that we still maintain a single RSS table like 
non-multi-PF netdevs. Preserving this (over other complex alternatives) 
is what noteworthy here.

>> +
>> +In TX, the primary PF creates a new TX flow table, which is aliased 
>> by the secondaries, so they can
>> +go out to the network through it.
>> +
>> +In addition, we set default XPS configuration that, based on the cpu, 
>> selects an SQ belonging to the
>> +PF on the same node as the cpu.
>> +
>> +XPS default config example:
>> +
>> +NUMA node(s):          2
>> +NUMA node0 CPU(s):     0-11
>> +NUMA node1 CPU(s):     12-23
>> +
>> +PF0 on node0, PF1 on node1.
>> +
>> +- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
>> +- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
>> +- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
>> +- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
>> +- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
>> +- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
>> +- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
>> +- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
>> +- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
>> +- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
>> +- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
>> +- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
>> +- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
>> +- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
>> +- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
>> +- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
>> +- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
>> +- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
>> +- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
>> +- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
>> +- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
>> +- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
>> +- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
>> +- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
>> +
>> +Mutually exclusive features
>> +===========================
>> +
>> +The nature of Multi-PF, where different channels work with different 
>> PFs, conflicts with
>> +stateful features where the state is maintained in one of the PFs.
>> +For example, in the TLS device-offload feature, special context 
>> objects are created per connection
>> +and maintained in the PF.  Transitioning between different RQs/SQs 
>> would break the feature. Hence,
>> +we disable this combination for now.
> 
>  From the reading I will know what the feature is at the user level.
> 
> After splitting most of the doc out into mlx5 file, and fixing the minor
> typos, feel free to add my:
> 
> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@...el.com>
> 

Thanks.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ