netdev - Re: [RFC] HW TX Rate Limiting Driver API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240410075745.4637c537@kernel.org>
Date: Wed, 10 Apr 2024 07:57:45 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Simon Horman <horms@...nel.org>, netdev@...r.kernel.org, Jiri Pirko
 <jiri@...nulli.us>, Madhu Chittim  <madhu.chittim@...el.com>, Sridhar
 Samudrala <sridhar.samudrala@...el.com>
Subject: Re: [RFC] HW TX Rate Limiting Driver API

On Wed, 10 Apr 2024 10:33:50 +0200 Paolo Abeni wrote:
> On Tue, 2024-04-09 at 15:32 -0700, Jakub Kicinski wrote:
> > On Fri, 5 Apr 2024 11:23:13 +0100 Simon Horman wrote:  
> > > /**
> > >  * enum shaper_lookup_mode - Lookup method used to access a shaper
> > >  * @SHAPER_LOOKUP_BY_PORT: The root shaper for the whole H/W, @id is unused
> > >  * @SHAPER_LOOKUP_BY_NETDEV: The main shaper for the given network device,
> > >  *                           @id is unused
> > >  * @SHAPER_LOOKUP_BY_VF: @id is a virtual function number.
> > >  * @SHAPER_LOOKUP_BY_QUEUE: @id is a queue identifier.
> > >  * @SHAPER_LOOKUP_BY_TREE_ID: @id is the unique shaper identifier inside the
> > >  *                            shaper hierarchy as in shaper_info.id
> > >  *
> > >  * SHAPER_LOOKUP_BY_PORT and SHAPER_LOOKUP_BY_VF, SHAPER_LOOKUP_BY_TREE_ID are
> > >  * only available on PF devices, usually inside the host/hypervisor.
> > >  * SHAPER_LOOKUP_BY_NETDEV is available on both PFs and VFs devices, but
> > >  * only if the latter are privileged ones.  
> > 
> > The privileged part is an implementation limitation. 
> > Limiting oneself
> > or subqueues should not require elevated privileges, in principle,
> > right?  
> 
> The reference to privileged functions is here to try to ensure proper
> isolation when required.
> 
> E.g. Let's suppose the admin in the the host wants to restricts/limits
> the B/W for a given VF (the VF itself, not the representor! See below
> WRT shaper_lookup_mode) to some rate, he/she likely wants intends
> additionally preventing the guest from relaxing the setting configuring
> the such rate on the guest device.

1) representor is just a control path manifestation of the VF
   any offload on the representor is for the VF - this is how forwarding
   works, this is how "switchdev qdisc offload" works

2) its a hierarchy, we can delegate one layer to the VF and the layer
   under that to the eswitch manager. VF can set it limit to inf
   but the eswitch layer should still enforce its limit
   The driver implementation may constrain the number of layers,
   delegation or form of the hierarchy, sure, but as I said, that's an
   implementation limitation (which reminds me -- remember to add extack 
   to the "write" ops)

> > > enum shaper_lookup_mode {
> > >     SHAPER_LOOKUP_BY_PORT,
> > >     SHAPER_LOOKUP_BY_NETDEV,
> > >     SHAPER_LOOKUP_BY_VF,
> > >     SHAPER_LOOKUP_BY_QUEUE,
> > >     SHAPER_LOOKUP_BY_TREE_ID,  
> > 
> > Two questions.
> > 
> > 1) are these looking up real nodes or some special kind of node which
> > can't have settings assigned directly? 
> > IOW if I want to rate limit 
> > the port do I get + set the port object or create a node above it and
> > link it up?  
> 
> There is no concept of such special kind of nodes. Probably the above
> enum needs a better/clearer definition of each element.
> How to reach a specific configuration for the port shaper depends on
> the NIC defaults - whatever hierarchy it provides/creates at
> initialization time. 
> 
> The NIC/firmware can either provide a default shaper for the port level
> - in such case the user/admin just need to set it. Otherwise user/admin
> will have to create the shaper and link it.

What's a default limit for a port? Line rate?
I understand that the scheduler can't be "disabled" but that doesn't mean
we must expose "noop" schedulers as if user has created them.

Let me step back. The goal, AFAIU, was to create an internal API into
which we can render existing APIs. We can combine settings from mqprio
and sysfs etc. and create a desired abstract hierarchy to present to 
the driver. That's doable.
Transforming arbitrary pre-existing driver hierarchy to achieve what
the uAPI wants.. would be an NP hard problem, no?

> I guess the first option should be more common/the expected one.
> 
> This proposal allows both cases.
> 
> > Or do those special nodes not exit implicitly (from the example it
> > seems like they do?)  

s/exit/exist/
 
> Could you please re-phrase the above?

Basically whether dump of the hierarchy is empty at the start.

> > 2) the objects themselves seem rather different. I'm guessing we need
> > port/netdev/vf because either some users don't want to use switchdev
> > (vf = repr netdev) or drivers don't implement it "correctly" (port !=
> > netdev?!)?  
> 
> Yes, the nodes inside the hierarchy can be linked to very different
> objects. The different lookup mode are there just to provide easy
> access to relevant objects.
> 
> Likely a much better description is needed:  'port' here is really the
> cable plug level, 'netdev' refers to the Linux network device. There
> could be multiple netdev for the same port as 'netdev' could be either
> referring to a PF or a VFs. Finally VF is really the virtual function
> device, not the representor, so that the host can configure/limits the
> guest tx rate. 
> 
> The same shaper can be reached/looked-up with different ids.
> 
> e.g. the device level shaper for a virtual function can be reached
> with:
> 
> - SHAPER_LOOKUP_BY_TREE_ID + unique tree id (every node is reachable
> this way) from any host device in the same hierarcy
> - SHAPER_LOOKUP_BY_VF + virtual function id, from the host PF device
> - SHAPER_LOOKUP_BY_NETDEV, from the guest VF device
> 
> > Also feels like some of these are root nodes, some are leaf nodes?  
> 
> There is a single root node (the port's parent), possibly many internal
> nodes (port, netdev, vf, possibly more intermediate levels depending on
> the actual configuration [e.g. the firmware or the admin could create
> 'device groups' or 'queue groups']) and likely many leave nodes (queue
> level).
> 
> My personal take is than from an API point of view differentiating
> between leaves and internal nodes makes the API more complex with no
> practical advantage for the API users.

If you allow them to be configured.

What if we consider netdev/queue node as "exit points" of the tree, 
to which a layer of actual scheduling nodes can be attached, rather 
than doing scheduling by themselves?

> > > 	int (*get)(struct net_device *dev,
> > > 		   enum shaper_lookup_mode lookup_mode, u32 id,
> > >                    struct shaper_info *shaper, struct netlink_ext_ack *extack);  
> > 
> > How about we store the hierarchy in the core?
> > Assume core has the source of truth, no need to get?  
> 
> One design choice was _not_ duplicating the hierarchy in the core: the
> full hierarchy is maintained only by the NIC/firmware. The NIC/firmware
> can perform some changes "automatically" e.g. when adding or deleting
> VFs or queues it will likely create/delete the paired shaper. The
> synchronization looks cumbersome and error prone.

The core only needs to maintain the hierarchy of whats in its domain of
control.

> The hierarchy could be exposed to both host and guests, I guess only
> the host core could be the source of truth, right?

I guess you're saying that the VF configuration may be implemented by asking 
the PF driver to perform the actions? Perhaps we can let the driver allocate
and maintain its own hierarchy but yes, we don't have much gain from holding
that in the core.

This is the picture in my mind:

PF / eswitch domain

              Q0 ]-> [50G] | RR | >-[ PF netdev = pf repr ] - |
              Q1 ]-> [50G] |    |                             |
                                                              |
-----------------------------------------                     |
VF1 domain                               |                    |
                                         |                    |
     Q0 ]-> | RR | - [35G] >-[ VF netdev x vf repr ]-> [50G]- | RR | >- port
     Q1 ]-> |    |                       |                    |
                                         |                    |
-----------------------------------------                     |
                                                              |
-------------------------------------------                   |
VF2 domain                                 |                  |
                                           |                  |
     Q0 ]-> [100G] -> |0 SP | >-[ VF net.. x vf r ]-> [50G] - |
     Q1 ]-> [200G] -> |1    |              |                  |
-------------------------------------------

In this contrived example we have VF1 which limited itself to 35G.
VF2 limited each queue to 100G and 200G (ignored, eswitch limit is lower)
and set strict priority between queues.
PF limits each of its queues, and VFs to 50G, no rate limit on the port.

"x" means we cross domains, "=" purely splices one hierarchy with another.

The hierarchy for netdevs always starts with a queue and ends in a netdev.
The hierarchy for eswitch has just netdevs at each end (hierarchy is
shared by all netdevs with the same switchdev id).

If the eswitch implementation is not capable of having a proper repr for PFs
the PF queues feed directly into the port.

The final RR node may be implicit (if hierarchy has loose ends, the are
assumed to RR at the last possible point before egress).