netdev - Re: [RFC] HW TX Rate Limiting Driver API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <de5bc3a7180fdc42a58df56fd5527c4955fd0978.camel@redhat.com>
Date: Thu, 11 Apr 2024 17:58:59 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Simon Horman <horms@...nel.org>, netdev@...r.kernel.org, Jiri Pirko
 <jiri@...nulli.us>, Madhu Chittim <madhu.chittim@...el.com>, Sridhar
 Samudrala <sridhar.samudrala@...el.com>
Subject: Re: [RFC] HW TX Rate Limiting Driver API

On Wed, 2024-04-10 at 07:57 -0700, Jakub Kicinski wrote:
> On Wed, 10 Apr 2024 10:33:50 +0200 Paolo Abeni wrote:
> > 
> > The reference to privileged functions is here to try to ensure proper
> > isolation when required.
> > 
> > E.g. Let's suppose the admin in the the host wants to restricts/limits
> > the B/W for a given VF (the VF itself, not the representor! See below
> > WRT shaper_lookup_mode) to some rate, he/she likely wants intends
> > additionally preventing the guest from relaxing the setting configuring
> > the such rate on the guest device.
> 
> 1) representor is just a control path manifestation of the VF
>    any offload on the representor is for the VF - this is how forwarding
>    works, this is how "switchdev qdisc offload" works
> 
> 2) its a hierarchy, we can delegate one layer to the VF and the layer
>    under that to the eswitch manager. VF can set it limit to inf
>    but the eswitch layer should still enforce its limit
>    The driver implementation may constrain the number of layers,
>    delegation or form of the hierarchy, sure, but as I said, that's an
>    implementation limitation (which reminds me -- remember to add extack 
>    to the "write" ops)
> 
> > > > enum shaper_lookup_mode {
> > > >     SHAPER_LOOKUP_BY_PORT,
> > > >     SHAPER_LOOKUP_BY_NETDEV,
> > > >     SHAPER_LOOKUP_BY_VF,
> > > >     SHAPER_LOOKUP_BY_QUEUE,
> > > >     SHAPER_LOOKUP_BY_TREE_ID,  
> > > 
> > > Two questions.
> > > 
> > > 1) are these looking up real nodes or some special kind of node which
> > > can't have settings assigned directly? 
> > > IOW if I want to rate limit 
> > > the port do I get + set the port object or create a node above it and
> > > link it up?  
> > 
> > There is no concept of such special kind of nodes. Probably the above
> > enum needs a better/clearer definition of each element.
> > How to reach a specific configuration for the port shaper depends on
> > the NIC defaults - whatever hierarchy it provides/creates at
> > initialization time. 
> > 
> > The NIC/firmware can either provide a default shaper for the port level
> > - in such case the user/admin just need to set it. Otherwise user/admin
> > will have to create the shaper and link it.
> 
> What's a default limit for a port? Line rate?
> I understand that the scheduler can't be "disabled" but that doesn't mean
> we must expose "noop" schedulers as if user has created them.
> 
> Let me step back. The goal, AFAIU, was to create an internal API into
> which we can render existing APIs. We can combine settings from mqprio
> and sysfs etc. and create a desired abstract hierarchy to present to 
> the driver. That's doable.
> Transforming arbitrary pre-existing driver hierarchy to achieve what
> the uAPI wants.. would be an NP hard problem, no?
> 
> > I guess the first option should be more common/the expected one.
> > 
> > This proposal allows both cases.
> > 
> > > Or do those special nodes not exit implicitly (from the example it
> > > seems like they do?)  
> 
> s/exit/exist/
>  
> > Could you please re-phrase the above?
> 
> Basically whether dump of the hierarchy is empty at the start.
> 
> > > 2) the objects themselves seem rather different. I'm guessing we need
> > > port/netdev/vf because either some users don't want to use switchdev
> > > (vf = repr netdev) or drivers don't implement it "correctly" (port !=
> > > netdev?!)?  
> > 
> > Yes, the nodes inside the hierarchy can be linked to very different
> > objects. The different lookup mode are there just to provide easy
> > access to relevant objects.
> > 
> > Likely a much better description is needed:  'port' here is really the
> > cable plug level, 'netdev' refers to the Linux network device. There
> > could be multiple netdev for the same port as 'netdev' could be either
> > referring to a PF or a VFs. Finally VF is really the virtual function
> > device, not the representor, so that the host can configure/limits the
> > guest tx rate. 
> > 
> > The same shaper can be reached/looked-up with different ids.
> > 
> > e.g. the device level shaper for a virtual function can be reached
> > with:
> > 
> > - SHAPER_LOOKUP_BY_TREE_ID + unique tree id (every node is reachable
> > this way) from any host device in the same hierarcy
> > - SHAPER_LOOKUP_BY_VF + virtual function id, from the host PF device
> > - SHAPER_LOOKUP_BY_NETDEV, from the guest VF device
> > 
> > > Also feels like some of these are root nodes, some are leaf nodes?  
> > 
> > There is a single root node (the port's parent), possibly many internal
> > nodes (port, netdev, vf, possibly more intermediate levels depending on
> > the actual configuration [e.g. the firmware or the admin could create
> > 'device groups' or 'queue groups']) and likely many leave nodes (queue
> > level).
> > 
> > My personal take is than from an API point of view differentiating
> > between leaves and internal nodes makes the API more complex with no
> > practical advantage for the API users.
> 
> If you allow them to be configured.
> 
> What if we consider netdev/queue node as "exit points" of the tree, 
> to which a layer of actual scheduling nodes can be attached, rather 
> than doing scheduling by themselves?
> 
> > > > 	int (*get)(struct net_device *dev,
> > > > 		   enum shaper_lookup_mode lookup_mode, u32 id,
> > > >                    struct shaper_info *shaper, struct netlink_ext_ack *extack);  
> > > 
> > > How about we store the hierarchy in the core?
> > > Assume core has the source of truth, no need to get?  
> > 
> > One design choice was _not_ duplicating the hierarchy in the core: the
> > full hierarchy is maintained only by the NIC/firmware. The NIC/firmware
> > can perform some changes "automatically" e.g. when adding or deleting
> > VFs or queues it will likely create/delete the paired shaper. The
> > synchronization looks cumbersome and error prone.
> 
> The core only needs to maintain the hierarchy of whats in its domain of
> control.
> 
> > The hierarchy could be exposed to both host and guests, I guess only
> > the host core could be the source of truth, right?
> 
> I guess you're saying that the VF configuration may be implemented by asking 
> the PF driver to perform the actions? Perhaps we can let the driver allocate
> and maintain its own hierarchy but yes, we don't have much gain from holding
> that in the core.
> 
> This is the picture in my mind:
> 
> PF / eswitch domain
> 
>               Q0 ]-> [50G] | RR | >-[ PF netdev = pf repr ] - |
>               Q1 ]-> [50G] |    |                             |
>                                                               |
> -----------------------------------------                     |
> VF1 domain                               |                    |
>                                          |                    |
>      Q0 ]-> | RR | - [35G] >-[ VF netdev x vf repr ]-> [50G]- | RR | >- port
>      Q1 ]-> |    |                       |                    |
>                                          |                    |
> -----------------------------------------                     |
>                                                               |
> -------------------------------------------                   |
> VF2 domain                                 |                  |
>                                            |                  |
>      Q0 ]-> [100G] -> |0 SP | >-[ VF net.. x vf r ]-> [50G] - |
>      Q1 ]-> [200G] -> |1    |              |                  |
> -------------------------------------------
> 
> In this contrived example we have VF1 which limited itself to 35G.
> VF2 limited each queue to 100G and 200G (ignored, eswitch limit is lower)
> and set strict priority between queues.
> PF limits each of its queues, and VFs to 50G, no rate limit on the port.
> 
> "x" means we cross domains, "=" purely splices one hierarchy with another.
> 
> The hierarchy for netdevs always starts with a queue and ends in a netdev.
> The hierarchy for eswitch has just netdevs at each end (hierarchy is
> shared by all netdevs with the same switchdev id).
> 
> If the eswitch implementation is not capable of having a proper repr for PFs
> the PF queues feed directly into the port.
> 
> The final RR node may be implicit (if hierarchy has loose ends, the are
> assumed to RR at the last possible point before egress).

Let me try to wrap-up all the changes suggested above:

- we need to clearly define the initial/default status (possibly no b/w
limits and all the objects on the same level doing RR)

- The hierarchy controlled by the API should shown only non
default/user-configured nodes

- We need to drop the references to privileged VFs.

- The core should maintain the full status of the user-provided
configuration changes (say, the 'delta' hierarchy )

Am I missing something?

Also it's not 110% clear to me the implication of:

> consider netdev/queue node as "exit points" of the tree, 
> to which a layer of actual scheduling nodes can be attached

could you please rephrase a bit?

I have the feeling the the points above should not require significant
changes to the API defined here, mainly more clear documentation, but
I'll have a better look.

Thanks!

Paolo