netdev - Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and other) Representors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220805184359.5c55ca0d@kernel.org>
Date:   Fri, 5 Aug 2022 18:43:59 -0700
From:   Jakub Kicinski <kuba@...nel.org>
To:     <ecree@...inx.com>
Cc:     <netdev@...r.kernel.org>, <davem@...emloft.net>,
        <pabeni@...hat.com>, <edumazet@...gle.com>, <corbet@....net>,
        <linux-doc@...r.kernel.org>, Edward Cree <ecree.xilinx@...il.com>,
        <linux-net-drivers@....com>,
        Jacob Keller <jacob.e.keller@...el.com>,
        Jesse Brandeburg <jesse.brandeburg@...el.com>,
        Michael Chan <michael.chan@...adcom.com>,
        Andy Gospodarek <andy@...yhouse.net>,
        Saeed Mahameed <saeed@...nel.org>,
        Jiri Pirko <jiri@...nulli.us>,
        Shannon Nelson <snelson@...sando.io>,
        Simon Horman <simon.horman@...igine.com>,
        Alexander Duyck <alexander.duyck@...il.com>
Subject: Re: [RFC PATCH net-next] docs: net: add an explanation of VF (and
 other) Representors

On Fri, 5 Aug 2022 17:58:50 +0100 ecree@...inx.com wrote:
> From: Edward Cree <ecree.xilinx@...il.com>
> 
> There's no clear explanation of what VF Representors are for, their
>  semantics, etc., outside of vendor docs and random conference slides.
> Add a document explaining Representors and defining what drivers that
>  implement them are expected to do.
> 
> Signed-off-by: Edward Cree <ecree.xilinx@...il.com>
> ---
> This documents representors as I understand them, but I suspect others
>  (including other vendors) might disagree (particularly with the "what
>  functions should have a rep" section).  I'm hoping that through review
>  of this doc we can converge on a consensus.

Thanks for doing this, we need to CC people tho. Otherwise they won't
pay attention. (adding semi-non-exhaustively those I have in my address
book)

> +=============================
> +Network Function Representors
> +=============================
> +
> +This document describes the semantics and usage of representor netdevices, as
> +used to control internal switching on SmartNICs.  For the closely-related port
> +representors on physical (multi-port) switches, see
> +:ref:`Documentation/networking/switchdev.rst <switchdev>`.
> +
> +Motivation
> +----------
> +
> +Since the mid-2010s, network cards have started offering more complex
> +virtualisation capabilities than the legacy SR-IOV approach (with its simple
> +MAC/VLAN-based switching model) can support.  This led to a desire to offload
> +software-defined networks (such as OpenVSwitch) to these NICs to specify the
> +network connectivity of each function.  The resulting designs are variously
> +called SmartNICs or DPUs.
> +
> +Network function representors provide the mechanism by which network functions
> +on an internal switch are managed. They are used both to configure the
> +corresponding function ('representee') and to handle slow-path traffic to and
> +from the representee for which no fast-path switching rule is matched.

I think we should just describe how those netdevs bring SR-IOV
forwarding into Linux networking stack. This section reads too much
like it's a hack rather than an obvious choice. Perhaps:

The representors bring the standard Linux networking stack to IOV
functions. Same as each port of a Linux-controlled switch has a
separate netdev, each virtual function has one. When system boots 
and before any offload is configured all packets from the virtual
functions appear in the networking stack of the PF via the representors.
PF can thus always communicate freely with the virtual functions. 
PF can configure standard Linux forwarding between representors, 
the uplink or any other netdev (routing, bridging, TC classifiers).

> +That is, a representor is both a control plane object (representing the function
> +in administrative commands) and a data plane object (one end of a virtual pipe).
> +As a virtual link endpoint, the representor can be configured like any other
> +netdevice; in some cases (e.g. link state) the representee will follow the
> +representor's configuration, while in others there are separate APIs to
> +configure the representee.
> +
> +What does a representor do?
> +---------------------------
> +
> +A representor has three main rôles.
> +
> +1. It is used to configure the representee's virtual MAC, e.g. link up/down,
> +   MTU, etc.  For instance, bringing the representor administratively UP should
> +   cause the representee to see a link up / carrier on event.

I presume you're trying to start a discussion here, rather than stating
the existing behavior. Or the "virtual MAC" means something else than I
think it means?

> +2. It provides the slow path for traffic which does not hit any offloaded
> +   fast-path rules in the virtual switch.  Packets transmitted on the
> +   representor netdevice should be delivered to the representee; packets
> +   transmitted to the representee which fail to match any switching rule should
> +   be received on the representor netdevice.  (That is, there is a virtual pipe
> +   connecting the representor to the representee, similar in concept to a veth
> +   pair.)
> +
> +   This allows software switch implementations (such as OpenVSwitch or a Linux
> +   bridge) to forward packets between representees and the rest of the network.
> +3. It acts as a handle by which switching rules (such as TC filters) can refer
> +   to the representee, allowing these rules to be offloaded.
> +
> +The combination of 2) and 3) means that the behaviour (apart from performance)
> +should be the same whether a TC filter is offloaded or not.  E.g. a TC rule
> +on a VF representor applies in software to packets received on that representor
> +netdevice, while in hardware offload it would apply to packets transmitted by
> +the representee VF.  Conversely, a mirred egress redirect to a VF representor
> +corresponds in hardware to delivery directly to the representee VF.
> +
> +What functions should have a representor?
> +-----------------------------------------
> +
> +Essentially, for each virtual port on the device's internal switch, there
> +should be a representor.
> +The only exceptions are the management PF (whose port is used for traffic to
> +and from all other representors) 

AFAIK there's no "management PF" in the Linux model.

> and perhaps the physical network port (for
> +which the management PF may act as a kind of port representor.  Devices that
> +combine multiple physical ports and SR-IOV capability may need to have port
> +representors in addition to PF/VF representors).

That doesn't generalize well. If we just say that all uplinks and PFs
should have a repr we don't have to make exceptions for all the cases
where that's the case.

> +Thus, the following should all have representors:
> +
> + - VFs belonging to the management PF.

management PF -> /dev/null

> + - Other PFs on the PCIe controller, and any VFs belonging to them.

What is "the PCIe controller" here? I presume you've seen the
devlink-port doc.

> + - PFs and VFs on other PCIe controllers on the device (e.g. for any embedded
> +   System-on-Chip within the SmartNIC).
> + - PFs and VFs with other personalities, including network block devices (such
> +   as a vDPA virtio-blk PF backed by remote/distributed storage).

IDK how you can configure block forwarding (which is DMAs of command
+ data blocks, not packets AFAIU) with the networking concepts..
I've not used the storage functions tho, so I could be wrong.

> + - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
> +   their own port on the switch (as opposed to using their parent PF's port).
> + - Any accelerators or plugins on the device whose interface to the network is
> +   through a virtual switch port, even if they do not have a corresponding PCIe
> +   PF or VF.
> +
> +This allows the entire switching behaviour of the NIC to be controlled through
> +representor TC rules.
> +
> +An example of a PCIe function that should *not* have a representor is, on an
> +FPGA-based NIC, a PF which is only used to deploy a new bitstream to the FPGA,
> +and which cannot create RX and TX queues.

What's the thinking here? We're letting everyone add their own
exceptions to the doc?

>  Since such a PF does not have network
> +access through the internal switch, not even indirectly via a distributed
> +storage endpoint, there is no switch virtual port for the representor to
> +configure or to be the other end of the virtual pipe.

Does it have a netdev?

> +How are representors created?
> +-----------------------------
> +
> +The driver instance attached to the management PF should enumerate the virtual
> +ports on the switch, and for each representee, create a pure-software netdevice
> +which has some form of in-kernel reference to the PF's own netdevice or driver
> +private data (``netdev_priv()``).
> +If switch ports can dynamically appear/disappear, the PF driver should create
> +and destroy representors appropriately.
> +The operations of the representor netdevice will generally involve acting
> +through the management PF.  For example, ``ndo_start_xmit()`` might send the
> +packet, specially marked for delivery to the representee, through a TX queue
> +attached to the management PF.

IDK how common that is, RDMA NICs will likely do the "dedicated queue
per repr" thing since they pretend to have infinite queues.

> +How are representors identified?
> +--------------------------------
> +
> +The representor netdevice should *not* directly refer to a PCIe device (e.g.
> +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
> +representee or of the management PF.

Do we know how many existing ones do? 

> +Instead, it should implement the ``ndo_get_port_parent_id()`` and
> +``ndo_get_phys_port_name()`` netdevice ops (corresponding to the
> +``phys_switch_id`` and ``phys_port_name`` sysfs nodes).
> +``ndo_get_port_parent_id()`` should return a string identical to that returned
> +by the management PF's ``ndo_get_phys_port_id()`` (typically the MAC address of
> +the physical port), while ``ndo_get_phys_port_name()`` should return a string
> +describing the representee's relation to the management PF.
> +
> +For instance, if the management PF has a ``phys_port_name`` of ``p0`` (physical
> +port 0), then the representor for the third VF on the second PF should typically
> +be ``p0pf1vf2`` (i.e. "port 0, PF 1, VF 2").  More generally, the
> +``phys_port_name`` for a PCIe function should be the concatenation of one or
> +more of:
> +
> + - ``p<N>``, physical port number *N*.
> + - ``if<N>``, PCIe controller number *N*.  The semantics of these numbers are
> +   vendor-defined, and controller 0 need not correspond to the controller on
> +   which the management PF resides.

/me checks in horror if this is already upstream

> + - ``pf<N>``, PCIe physical function index *N*.
> + - ``vf<N>``, PCIe virtual function index *N*.
> + - ``sf<N>``, Subfunction index *N*.

Yeah, nah... implement devlink port, please. This is done by the core,
you shouldn't have to document this.

> +It is expected that userland will use this information (e.g. through udev rules)
> +to construct an appropriately informative name or alias for the netdevice.  For
> +instance if the management PF is ``eth4`` then our representor with a
> +``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
> +
> +There are as yet no established conventions for naming representors which do not
> +correspond to PCIe functions (e.g. accelerators and plugins).
> +
> +How do representors interact with TC rules?
> +-------------------------------------------
> +
> +Any TC rule on a representor applies (in software TC) to packets received by
> +that representor netdevice.  Thus, if the delivery part of the rule corresponds
> +to another port on the virtual switch, the driver may choose to offload it to
> +hardware, applying it to packets transmitted by the representee.
> +
> +Similarly, since a TC mirred egress action targeting the representor would (in
> +software) send the packet through the representor (and thus indirectly deliver
> +it to the representee), hardware offload should interpret this as delivery to
> +the representee.
> +
> +As a simple example, if ``eth0`` is the management PF's netdevice and ``eth1``
> +is a VF representor, the following rules::
> +
> +    tc filter add dev eth1 parent ffff: protocol ipv4 flower \
> +        action mirred egress redirect dev eth0
> +    tc filter add dev eth0 parent ffff: protocol ipv4 flower \
> +        action mirred egress mirror dev eth1
> +
> +would mean that all IPv4 packets from the VF are sent out the physical port, and
> +all IPv4 packets received on the physical port are delivered to the VF in
> +addition to the management PF.
> +
> +Of course the rules can (if supported by the NIC) include packet-modifying
> +actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
> +
> +Tunnel encapsulation and decapsulation are rather more complicated, as they
> +involve a third netdevice (a tunnel netdev operating in metadata mode, such as
> +a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
> +require an IP address to be bound to the underlay device (e.g. management PF or
> +port representor).  TC rules such as::
> +
> +    tc filter add dev eth1 parent ffff: flower \
> +        action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
> +                              dst_port 4789 \
> +        action mirred egress redirect dev vxlan0
> +    tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
> +        enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
> +        action tunnel_key unset action mirred egress redirect dev eth1
> +
> +where ``LOCAL_IP`` is an IP address bound to ``eth0``, and ``REMOTE_IP`` is
> +another IP address on the same subnet, mean that packets sent by the VF should
> +be VxLAN encapsulated and sent out the physical port (the driver has to deduce
> +this by a route lookup of ``LOCAL_IP`` leading to ``eth0``, and also perform an
> +ARP/neighbour table lookup to find the MAC addresses to use in the outer
> +Ethernet frame), while UDP packets received on the physical port with UDP port
> +4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``, decapsulated
> +and forwarded to the VF.
> +
> +If this all seems complicated, just remember the 'golden rule' of TC offload:
> +the hardware should ensure the same final results as if the packets were
> +processed through the slow path, traversed software TC and were transmitted or
> +received through the representor netdevices.
> +
> +Configuring the representee's MAC
> +---------------------------------
> +
> +The representee's link state is controlled through the representor.  Setting the
> +representor administratively UP or DOWN should cause carrier ON or OFF at the
> +representee.
> +
> +Setting an MTU on the representor should cause that same MTU to be reported to
> +the representee.
> +(On hardware that allows configuring separate and distinct MTU and MRU values,
> +the representor MTU should correspond to the representee's MRU and vice-versa.)

Why worry about that?

> +Currently there is no way to use the representor to set the station permanent
> +MAC address of the representee; other methods available to do this include:
> +
> + - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
> + - devlink port function (see **devlink-port(8)** and
> +   :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)
> diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst
> index f1f4e6a85a29..21e80c8e661b 100644
> --- a/Documentation/networking/switchdev.rst
> +++ b/Documentation/networking/switchdev.rst
> @@ -1,5 +1,6 @@
>  .. SPDX-License-Identifier: GPL-2.0
>  .. include:: <isonum.txt>
> +.. _switchdev:
>  
>  ===============================================
>  Ethernet switch device driver model (switchdev)