netdev - RE: [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and other) Representors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <PH0PR12MB548104B7D8CC614060B0DD98DC6C9@PH0PR12MB5481.namprd12.prod.outlook.com>
Date:   Fri, 19 Aug 2022 18:59:37 +0000
From:   Parav Pandit <parav@...dia.com>
To:     Edward Cree <ecree.xilinx@...il.com>,
        "ecree@...inx.com" <ecree@...inx.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-net-drivers@....com" <linux-net-drivers@....com>
CC:     "davem@...emloft.net" <davem@...emloft.net>,
        "kuba@...nel.org" <kuba@...nel.org>,
        "pabeni@...hat.com" <pabeni@...hat.com>,
        "edumazet@...gle.com" <edumazet@...gle.com>,
        "corbet@....net" <corbet@....net>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "jacob.e.keller@...el.com" <jacob.e.keller@...el.com>,
        "jesse.brandeburg@...el.com" <jesse.brandeburg@...el.com>,
        "michael.chan@...adcom.com" <michael.chan@...adcom.com>,
        "andy@...yhouse.net" <andy@...yhouse.net>,
        "saeed@...nel.org" <saeed@...nel.org>,
        "jiri@...nulli.us" <jiri@...nulli.us>,
        "snelson@...sando.io" <snelson@...sando.io>,
        "simon.horman@...igine.com" <simon.horman@...igine.com>,
        "alexander.duyck@...il.com" <alexander.duyck@...il.com>,
        "rdunlap@...radead.org" <rdunlap@...radead.org>
Subject: RE: [RFC PATCH v2 net-next] docs: net: add an explanation of VF (and
 other) Representors


> From: Edward Cree <ecree.xilinx@...il.com>
> Sent: Friday, August 19, 2022 12:36 PM
> 
> On 18/08/2022 17:44, Parav Pandit wrote:
> > A _whole_ network function is represented today using a. netdevice
> > represents representee's network port
> >
> > b. devlink port function for function management
> 
> So, at the moment I'm just trying to document the current consensus,  but
> where I plan to go _after_ this doc is building the case that  devlink port as it
> exists today mixes in too much networking  configuration that really belongs
> in the representor.  
Representor netdevice has its own mac address that can be used to build switching/routing services.

> The example  that motivated this for me is that setting
> the MAC address of the  representee is currently a devlink port function
> operation, but  this has nothing to do with the PCIe function and everything
> to do  with the network port, so logically it should be an operation on  the
> representor.  (I intend to develop a patch making it such, once  we're all on
> the same page.)

Devlink port function is the "function representor".
Life cycle and attributes of this function are controlled using this representor object.

We always wanted a "function representor" out of the devlink port but unfortunately that didn’t happen in past.

A mac address is set to the PCI function. It is a L2 address. Hence it is defined as hw_addr and not as mac.

I didn’t fully understand "network port" above. It is the network port of the function, right?

> 
> I think a general rule is — would this operation make sense on a non-
> networking SR-IOV device?  If not, then it shouldn't be in devlink  port.  E.g.
> why is port splitting a devlink port operation and not  an operation on the
> port representor netdev?
> 
It is applicable to networking non-SR-IOV device such as physical ports, PFs, SFs.

But I resonate with you, would a "devlink function object" as independent entity make sense?
Absolutely yes to me.
Should this devlink function object to be linked to a devlink port and netdevice representor? Yes.

In such case should we take devlink out of the net/devlink.c so that non networking functions can also enjoy?
May be yes.
Good discussion to have.

> > s/master PF/switchdev function
> 
> switchdev function might actually be the best name suggestion yet.
> I like it.
> 
> > Please add text that,
> > Packets transmitted by the representee and when they are not offloaded,
> such packets are delivered to the port representor netdevice.
> 
> That's exactly what
> >> packets
> >> +   transmitted to the representee which fail to match any switching
> >> + rule
> >> should
> >> +   be received on the representor netdevice.
> says.  (Although my choice of preposition — 'to', rather than 'by'
>  — was less than clear.)
> 
Yeah.
'to' is confusing. 
'by' is clear that it is sent by the representee.

> >> +What functions should have a representor?
> >> +-----------------------------------------
> >> +
> >> +Essentially, for each virtual port on the device's internal switch,
> >
> > ^^^^ You probably wanted to say master PF internal switch here.
> >
> > Better to word it as, each virtual port of a switchdev, there should be...
> 
> Hmm idk, I feel like "switchdev" has the connotation of "the software  object
> inside the kernel representing the switch" rather than "the  switch itself".
> 
True, but at kernel level, such device is exposed as switchdev object.
So referring to kernel object looks more aligned with the rest documentation.

> >> + - Other PFs on the local PCIe controller, and any VFs belonging to them.
> > Local and/or external PCIe controllers.
> That's literally the next bullet point.
> 
> >> + - PFs and VFs on other PCIe controllers on the device (e.g. for any
> >> embedded
> >> +   System-on-Chip within the SmartNIC).
> Do I need to use the word "external" to make it more obvious?
> 
'other' is vague.
Yes, external makes it obvious that reader can refer to in other part of the kernel documentation.

> >> + - PFs and VFs with other personalities, including network block
> >> + devices
> >> (such
> >> +   as a vDPA virtio-blk PF backed by remote/distributed storage), if their
> >> +   network access is implemented through a virtual switch port.
> >> +   Note that such functions can require a representor despite the
> >> representee
> >> +   not having a netdev.
> > This looks a big undertaking to represent them via "netdevice".
> > Mostly they cannot be well represented by the netdevice.
> 
> The netdevice isn't supposed to represent the vDPA block device.  Rather  it
> represents the switch port that the block device is using.
> 
Above text was written under a question " What functions should have a representor".
So should a vdpa function need a (switch) representor? Ans = no.
Should a vdpa function need a backend/control representor? Ans = yes.
This document is about (switch) representor. Hence it is better to avoid this confusion.

> > In some cases, such vDPA devices are affiliated to the switchdev, but they
> use one or multiple of its ports.
> 
> If the block device uses multiple switch ports, then it should have  multiple
> representors, one for each port, so that each switch port can  be configured
> in the standard way.
> 
Sure. There is a big difference between 'using representors' vs 'having representors'.
You wanted to say that they use switch representors.

If you want to say a 'function representor' than it doesn’t really belong to this .rst documentation.
It is the discussion that’s happening right now. :)

> Configuration of the block device itself is of course through separate
> interfaces which are common to non-switchdev virtual block devices.
> 
> >> + - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they
> have
> >> +   their own port on the switch (as opposed to using their parent PF's
> port).
> > Not sure why the text has _if_ for SF and not for the VF.
> > Do you see a SF device in the kernel that doesn't have their own port, due
> to which there is _if_ added?
> 
> This document is meant to cover situations that vendors are likely to  find
> themselves in, not just those that have already been encountered.
> It is plausible, at least to me, that a vendor might decide to implement
> subfunctions at a filtering rather than a switching level (i.e. it's  just a bundle
> of queue pairs and you use something like ethtool NFC to  direct traffic to it).
> And if that happens, I don't want them to read  my doc and (wrongly) think
> that they still need reprs for such SFs.
That would be a very different abstraction. It is not a function.
A user might be better of with netdev with bunch of queues.
There were similar macvlan acceleration in past, if not mistaken in the Intel driver.

> (The corresponding situation is far less likely to arise for VFs,  because there's
> a clear understanding across the industry that VFs  should look to their
> consumer like self-contained network devices,  which implies transparent
> switching.)
We don’t see SFs being any different in the kernel as some day when platform is ready SF will have to have self-contained too to map to the guest VM.
A subfunction may not be a network device at all, even though today in Linux kernel we see it only for network function.

> 
> >> +How are representors created?
> >> +-----------------------------
> >> +
> >> +The driver instance attached to the master PF should enumerate the
> >> +virtual ports on the switch, and for each representee, create a
> >> +pure-software netdevice which has some form of in-kernel reference
> >> +to the PF's own netdevice or driver private data (``netdev_priv()``).
> > Today a user can create new virtual ports. Hence, these port represnetors
> and function representors are created dynamically without enumeration.
> > Please add text describing both ways.
> 
> Again, this is addressed in the next sentence after you quoted:
> >> +If switch ports can dynamically appear/disappear, the PF driver
> >> +should create and destroy representors appropriately.
> 
For the reader 'appear/disappear' misses the details if it can be created or not.
Today all the three ways of port addition to switch is done in the kernel.
1. enumerate and add
2. appear through event -> add/delete
3. created by user -> add/delete

So I suggest we write it as,

The driver instance attached to the switchdev should create representor netdevices for the each function whose traffic flow through the switchdev.


> > For mlx5 case a representor netdevice has real queue from which tx/rx
> DMA happens from the device to/from network.
> > It is not entirely pure software per say.
> > Hence, "pure-software" is misleading. Please drop that word.
> 
> The rep dev doesn't own the BAR.  Everything it has it gets from  the PF.
> That's why it shouldn't SET_NETDEV_DEV, which is what I  mean by "pure-
> software".
> 
It accesses the switchdev PF, including the BAR and interrupts and hw queues.
It is no different than uplink or physical port representor.
Setting netdevice parent is not decided based on what type of wire they talk to.

> >> +The operations of the representor netdevice will generally involve
> >> +acting through the master PF.  For example, ``ndo_start_xmit()``
> >> +might send the packet through a hardware TX queue attached to the
> >> +master PF, with either packet metadata or queue configuration
> >> +marking it for delivery
> >> to the representee.
> > Sharing/not sharing TX and RX queue among representor netdevices is not
> yet well established.
> 
> But in either case the hw TXQ will have been created out of the  PF's BAR(s)
> (there's no other PCIe function/aperture to poke at  the hardware from),
> that's what I mean by "attached to".  If you  have a clearer way to word that
> I'm all ears.
> 
I think we should just way,

Representor netdevices access the switchdev device resources and queues they are attached to.

> >> +
> >> +How are representors identified?
> >> +--------------------------------
> >> +
> >> +The representor netdevice should *not* directly refer to a PCIe device
> (e.g.
> >> +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of
> >> +the representee or of the master PF.
> > This isn't true.
> > Representor netdevices are connected to the switchdev device PCI
> function.
> 
> In some but not all existing drivers.
> Note that I said "should not", not "does not".
> 
'Should not' is a guidance to new drivers.
And for end users it will be nightmare to maintain per vendor udev plumbing in systemd.


> > Without linking to PCI device, udev scriptology needs to grep among
> thousands of netdevices and its very inefficient.
> 
> It's a control plane operation, is efficiency really a prime  concern?  
Yes. In production systems that we use, users create these functions at msec granularity and udev scripts find it hard to walk through.

>If so,
> surely the right thing is to give  /sys/class/net/$REP_DEV/ a suitably-named
> symlink to  /sys/class/net/$SWITCH_DEV, performing the same role as the
> phys_switch_id matching without the global search, rather than  a
> semantically-invalid PCIe device link.
> 
This plumbing is already done through /sys/class/net/REP_NETDEV/device pointing to the parent device uniformly regardless of whom they represent.
Be it a physical port or the virtual port.

> >> +There are as yet no established conventions for naming representors
> >> +which do not correspond to PCIe functions (e.g. accelerators and
> plugins).
> > Netdevice represents the networking port of the function.
> 
> No, it represents any networking port on the switch.  Whether  that has a
> PCIe function or not.
Sure, in context of the subject of this email I said function. :)
It can be physical port of other types of ports of switchev.

> (The doc title, "Network Function Representors", is deliberately  phrased to
> be interpretable as about "network functions" in the  sense of NFV, rather
> than "networking PCIe functions".  
Do we have non networking PCI functions through representors in the kernel today?
If not, better not to generalize them.
When they appear, a new devlink port flavour will show up and it will be perfect to extend this doc to capture that.

> An  entire network function (in the NFV
> sense) could be implemented  in hardware, in which case it would have a
> switch port and thus  representor, but no PCIe function — it terminates
> traffic inside  the device rather than sending it over the PCIe bus to a driver
> in the host or a VM.)

This is very interesting new type of network function.
Do you mean a MMIO kind of function instead of PCI function?
If so, yeah, it will need a switch representation and new devlink port flavour type.