[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <228ecfb6-d5cb-403b-aecf-7c1181aa45ce@gmail.com>
Date: Tue, 5 Mar 2024 22:12:23 +0200
From: Tariq Toukan <ttoukan.linux@...il.com>
To: Przemek Kitszel <przemyslaw.kitszel@...el.com>,
Saeed Mahameed <saeed@...nel.org>, "David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Eric Dumazet <edumazet@...gle.com>
Cc: Saeed Mahameed <saeedm@...dia.com>, netdev@...r.kernel.org,
Tariq Toukan <tariqt@...dia.com>, Gal Pressman <gal@...dia.com>,
Leon Romanovsky <leonro@...dia.com>, sridhar.samudrala@...el.com,
Jay Vosburgh <jay.vosburgh@...onical.com>, Jiri Pirko <jiri@...dia.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Subject: Re: [net-next V4 15/15] Documentation: networking: Add description
for multi-pf netdev
On 04/03/2024 14:03, Przemek Kitszel wrote:
> On 3/2/24 08:22, Saeed Mahameed wrote:
>> From: Tariq Toukan <tariqt@...dia.com>
>>
>> Add documentation for the multi-pf netdev feature.
>> Describe the mlx5 implementation and design decisions.
>>
>> Signed-off-by: Tariq Toukan <tariqt@...dia.com>
>> Signed-off-by: Saeed Mahameed <saeedm@...dia.com>
>> ---
>> Documentation/networking/index.rst | 1 +
>> Documentation/networking/multi-pf-netdev.rst | 177 +++++++++++++++++++
>> 2 files changed, 178 insertions(+)
>> create mode 100644 Documentation/networking/multi-pf-netdev.rst
>>
>> diff --git a/Documentation/networking/index.rst
>> b/Documentation/networking/index.rst
>> index 69f3d6dcd9fd..473d72c36d61 100644
>> --- a/Documentation/networking/index.rst
>> +++ b/Documentation/networking/index.rst
>> @@ -74,6 +74,7 @@ Contents:
>> mpls-sysctl
>> mptcp-sysctl
>> multiqueue
>> + multi-pf-netdev
>> napi
>> net_cachelines/index
>> netconsole
>> diff --git a/Documentation/networking/multi-pf-netdev.rst
>> b/Documentation/networking/multi-pf-netdev.rst
>> new file mode 100644
>> index 000000000000..f6f782374b71
>> --- /dev/null
>> +++ b/Documentation/networking/multi-pf-netdev.rst
>> @@ -0,0 +1,177 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +.. include:: <isonum.txt>
>> +
>> +===============
>> +Multi-PF Netdev
>> +===============
>> +
>> +Contents
>> +========
>> +
>> +- `Background`_
>> +- `Overview`_
>> +- `mlx5 implementation`_
>> +- `Channels distribution`_
>> +- `Observability`_
>> +- `Steering`_
>> +- `Mutually exclusive features`_
>
> this document describes mlx5 details mostly, and I would expect to find
> them in a mlx5.rst file instead of vendor-agnostic doc
>
It was originally under
Documentation/networking/device_drivers/ethernet/mellanox/mlx5/
We moved it here with the needed changes per request.
See:
https://lore.kernel.org/all/20240209222738.4cf9f25b@kernel.org/
>> +
>> +Background
>> +==========
>> +
>> +The advanced Multi-PF NIC technology enables several CPUs within a
>> multi-socket server to
>
> please remove the `advanced` word
>
>> +connect directly to the network, each through its own dedicated PCIe
>> interface. Through either a
>> +connection harness that splits the PCIe lanes between two cards or by
>> bifurcating a PCIe slot for a
>> +single card. This results in eliminating the network traffic
>> traversing over the internal bus
>> +between the sockets, significantly reducing overhead and latency, in
>> addition to reducing CPU
>> +utilization and increasing network throughput.
>> +
>> +Overview
>> +========
>> +
>> +The feature adds support for combining multiple PFs of the same port
>> in a Multi-PF environment under
>> +one netdev instance. It is implemented in the netdev layer.
>> Lower-layer instances like pci func,
>> +sysfs entry, devlink) are kept separate.
>> +Passing traffic through different devices belonging to different NUMA
>> sockets saves cross-numa
>
> please consider spelling out NUMA as always capitalized
>
>> +traffic and allows apps running on the same netdev from different
>> numas to still feel a sense of
>> +proximity to the device and achieve improved performance.
>> +
>> +mlx5 implementation
>> +===================
>> +
>> +Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs
>> together which belong to the same
>> +NIC and has the socket-direct property enabled, once all PFS are
>> probed, we create a single netdev
>
> s/PFS/PFs/
>
>> +to represent all of them, symmetrically, we destroy the netdev
>> whenever any of the PFs is removed.
>> +
>> +The netdev network channels are distributed between all devices, a
>> proper configuration would utilize
>> +the correct close numa node when working on a certain app/cpu.
>
> CPU
>
>> +
>> +We pick one PF to be a primary (leader), and it fills a special role.
>> The other devices
>> +(secondaries) are disconnected from the network at the chip level
>> (set to silent mode). In silent
>> +mode, no south <-> north traffic flowing directly through a secondary
>> PF. It needs the assistance of
>> +the leader PF (east <-> west traffic) to function. All RX/TX traffic
>> is steered through the primary
>
> Rx, Tx (whole document)
>
>> +to/from the secondaries.
>> +
>> +Currently, we limit the support to PFs only, and up to two PFs
>> (sockets).
>> +
>> +Channels distribution
>> +=====================
>> +
>> +We distribute the channels between the different PFs to achieve local
>> NUMA node performance
>> +on multiple NUMA nodes.
>> +
>> +Each combined channel works against one specific PF, creating all its
>> datapath queues against it. We
>> +distribute channels to PFs in a round-robin policy.
>> +
>> +::
>> +
>> + Example for 2 PFs and 5 channels:
>> + +--------+--------+
>> + | ch idx | PF idx |
>> + +--------+--------+
>> + | 0 | 0 |
>> + | 1 | 1 |
>> + | 2 | 0 |
>> + | 3 | 1 |
>> + | 4 | 0 |
>> + +--------+--------+
>> +
>> +
>> +We prefer this round-robin distribution policy over another suggested
>> intuitive distribution, in
>> +which we first distribute one half of the channels to PF0 and then
>> the second half to PF1.
>
> Please rephrase to describe current state (which makes sense over what
> was suggested), instead of addressing feedback (that could be kept in
> cover letter if you really want).
>
> And again, the wording "we" clearly indicates that this section, as
> future ones, is mlx specific.
>
>> +
>> +The reason we prefer round-robin is, it is less influenced by changes
>> in the number of channels. The
>> +mapping between a channel index and a PF is fixed, no matter how many
>> channels the user configures.
>> +As the channel stats are persistent across channel's closure,
>> changing the mapping every single time
>> +would turn the accumulative stats less representing of the channel's
>> history.
>> +
>> +This is achieved by using the correct core device instance (mdev) in
>> each channel, instead of them
>> +all using the same instance under "priv->mdev".
>> +
>> +Observability
>> +=============
>> +The relation between PF, irq, napi, and queue can be observed via
>> netlink spec:
>> +
>> +$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml
>> --dump queue-get --json='{"ifindex": 13}'
>> +[{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
>> + {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
>> + {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
>> + {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
>> + {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
>> + {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
>> + {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
>> + {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
>> + {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
>> + {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
>> +
>> +$ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml
>> --dump napi-get --json='{"ifindex": 13}'
>> +[{'id': 543, 'ifindex': 13, 'irq': 42},
>> + {'id': 542, 'ifindex': 13, 'irq': 41},
>> + {'id': 541, 'ifindex': 13, 'irq': 40},
>> + {'id': 540, 'ifindex': 13, 'irq': 39},
>> + {'id': 539, 'ifindex': 13, 'irq': 36}]
>> +
>> +Here you can clearly observe our channels distribution policy:
>> +
>> +$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
>> +/proc/irq/36/mlx5_comp1@pci:0000:08:00.0
>> +/proc/irq/39/mlx5_comp1@pci:0000:09:00.0
>> +/proc/irq/40/mlx5_comp2@pci:0000:08:00.0
>> +/proc/irq/41/mlx5_comp2@pci:0000:09:00.0
>> +/proc/irq/42/mlx5_comp3@pci:0000:08:00.0
>> +
>> +Steering
>> +========
>> +Secondary PFs are set to "silent" mode, meaning they are disconnected
>> from the network.
>> +
>> +In RX, the steering tables belong to the primary PF only, and it is
>> its role to distribute incoming
>> +traffic to other PFs, via cross-vhca steering capabilities. Nothing
>> special about the RSS table
>> +content, except that it needs a capable device to point to the
>> receive queues of a different PF.
>
> I guess you cannot enable the multi-pf for incapable device, so there is
> anything noteworthy in last sentence?
>
I was asked in earlier patchsets to elaborate on this.
It tells "how" an RSS table looks like on a capable device.
Maybe I should re-phrase to emphasize the point.
It is not straightforward that we still maintain a single RSS table like
non-multi-PF netdevs. Preserving this (over other complex alternatives)
is what noteworthy here.
>> +
>> +In TX, the primary PF creates a new TX flow table, which is aliased
>> by the secondaries, so they can
>> +go out to the network through it.
>> +
>> +In addition, we set default XPS configuration that, based on the cpu,
>> selects an SQ belonging to the
>> +PF on the same node as the cpu.
>> +
>> +XPS default config example:
>> +
>> +NUMA node(s): 2
>> +NUMA node0 CPU(s): 0-11
>> +NUMA node1 CPU(s): 12-23
>> +
>> +PF0 on node0, PF1 on node1.
>> +
>> +- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
>> +- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
>> +- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
>> +- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
>> +- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
>> +- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
>> +- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
>> +- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
>> +- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
>> +- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
>> +- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
>> +- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
>> +- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
>> +- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
>> +- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
>> +- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
>> +- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
>> +- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
>> +- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
>> +- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
>> +- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
>> +- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
>> +- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
>> +- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
>> +
>> +Mutually exclusive features
>> +===========================
>> +
>> +The nature of Multi-PF, where different channels work with different
>> PFs, conflicts with
>> +stateful features where the state is maintained in one of the PFs.
>> +For example, in the TLS device-offload feature, special context
>> objects are created per connection
>> +and maintained in the PF. Transitioning between different RQs/SQs
>> would break the feature. Hence,
>> +we disable this combination for now.
>
> From the reading I will know what the feature is at the user level.
>
> After splitting most of the doc out into mlx5 file, and fixing the minor
> typos, feel free to add my:
>
> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@...el.com>
>
Thanks.
Powered by blists - more mailing lists