netdev - Re: [RFC] HW TX Rate Limiting Driver API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fc300335885790c759e565cb6712e78777da6ca9.camel@redhat.com>
Date: Fri, 05 Apr 2024 18:25:17 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: John Fastabend <john.fastabend@...il.com>, Simon Horman
 <horms@...nel.org>,  netdev@...r.kernel.org
Cc: Jakub Kicinski <kuba@...nel.org>, Jiri Pirko <jiri@...nulli.us>, Madhu
 Chittim <madhu.chittim@...el.com>, Sridhar Samudrala
 <sridhar.samudrala@...el.com>, Jamal Hadi Salim <jhs@...atatu.com>
Subject: Re: [RFC] HW TX Rate Limiting Driver API

On Fri, 2024-04-05 at 07:34 -0700, John Fastabend wrote:
> Simon Horman wrote:
> > This is follow-up to the ongoing discussion started by Intel to extend the
> > support for TX shaping H/W offload [1].
> > 
> > The goal is allowing the user-space to configure TX shaping offload on a
> > per-queue basis with min guaranteed B/W, max B/W limit and burst size on a
> > VF device.
> > 
> > 
> > In the past few months several different solutions were attempted and
> > discussed, without finding a perfect fit:
> > 
> > - devlink_rate APIs are not appropriate for to control TX shaping on netdevs
> > - No existing TC qdisc offload covers the required feature set
> > - HTB does not allow direct queue configuration
> > - MQPRIO imposes constraint on the maximum number of TX queues
> 
> iirc the TX queue limitation was somewhat artifical. Probably we could
> extend it.

Indeed MQPRIO was one of the preferred candidates, but removing the
limits is blocked by uAPI constraints.

Even ignoring the uAPI problem, it will requires some offload change to
get full support of the features required here.

In practice that will be quite similar to creating a new offload type,
and will not be 'future proof'.

Since here we are adding a new offload type, the idea is to try to
cover the capabilities of exiting (and possibly foreseeable) H/W to
avoid being in the same situation some time from now.

Being 'future proof' is one of the requirements (sorry, not stated in
this thread yet) that materialized in the past discussion.

> > - TBF does not support max B/W limit
> > - ndo_set_tx_maxrate() only controls the max B/W limit
> > 
> > A new H/W offload API is needed, but offload API proliferation should be
> > avoided.
> 
> Did you consider the dcbnl_rtnl_ops? These have the advantage of being
> implemented in intel, mlx, amd, broadcom and a few more drivers. There
> is an IEEE spec as well fwiw.
> 
> This has a getmaxrate, setmaxrate API. Adding a getminrate and setminrate
> would be relatively straightforward. The spec defines how to do WRR and
> NICs support it.

It would be similar to ndo_set_tx_maxrate(). Extending it would be
simple, and is implemented by some different vendors already.

But we will face the same points mentioned above: will be quite similar
to creating a new offload type, and will not be 'future proof'.

> > The following proposal intends to cover the above specified requirement and
> > provide a possible base to unify all the shaping offload APIs mentioned above.
> > 
> > The following only defines the in-kernel interface between the core and
> > drivers. The intention is to expose the feature to user-space via Netlink.
> > Hopefully the latter part should be straight-forward after agreement
> > on the in-kernel interface.
> > 
> > All feedback and comment is more then welcome!
> > 
> > [1] https://lore.kernel.org/netdev/20230808015734.1060525-1-wenjun1.wu@intel.com/
> > 
> > Regards,
> > Simon with much assistance from Paolo
> 
> Cool to see some hw offloads.

:)

The planning phase was less cool :)

> > --- 
> > /* SPDX-License-Identifier: GPL-2.0-or-later */
> > 
> > #ifndef _NET_SHAPER_H_
> > #define _NET_SHAPER_H_
> > 
> > /**
> >  * enum shaper_metric - the metric of the shaper
> >  * @SHAPER_METRIC_PPS: Shaper operates on a packets per second basis
> >  * @SHAPER_METRIC_BPS: Shaper operates on a bits per second basis
> >  */
> > enum shaper_metric {
> > 	SHAPER_METRIC_PPS;
> > 	SHAPER_METRIC_BPS;
> > };
> 
> Interesting hw has a PPS limiter?

AFAIK at least the intel ice driver has this kind of support. I've been
told this is a requirement to access the telco market. I guess other
vendors will follow - assuming they don't have it already.

> 
> > 
> > #define SHAPER_ROOT_ID 0
> > #define SHAPER_NONE_ID UINT_MAX
> > 
> > /**
> >  * struct shaper_info - represent a node of the shaper hierarchy
> >  * @id: Unique identifier inside the shaper tree.
> >  * @parent_id: ID of parent shaper, or SHAPER_NONE_ID if the shaper has
> >  *             no parent. Only the root shaper has no parent.
> >  * @metric: Specify if the bw limits refers to PPS or BPS
> >  * @bw_min: Minimum guaranteed rate for this shaper
> >  * @bw_max: Maximum peak bw allowed for this shaper
> >  * @burst: Maximum burst for the peek rate of this shaper
> >  * @priority: Scheduling priority for this shaper
> >  * @weight: Scheduling weight for this shaper
> >  *
> >  * The full shaper hierarchy is maintained only by the
> >  * NIC driver (or firmware), possibly in a NIC-specific format
> >  * and/or in H/W tables.
> 
> Is the hw actually implementing hierarchy? For some reason I
> imagined different scopes, PF, VF, QueueSet, Queue. And if it
> has more hierarchy maybe FW is just translating this? IF that
> is the case perhaps instead of hierarchy we expose a
> 
>   enum hw_rate_limit_scope scope
> 
> and
> 
>   enum hw_rate_limit_scope {
>      SHAPER_LOOKUP_BY_PORT,
>      SHAPER_LOOKUP_BY_NETDEV,
>      SHAPER_LOOKUP_BY_VF,
>      SHAPER_LOOKUP_BY_QUEUE_SET,
>      SHAPER_LOOKUP_BY_QUEUE,
>   }
> 
> My preference is almost always to push logic out of firmware
> and into OS if possible.

I think this actually overlap with is described below? this API will
allow access by the above 'scopes' (except by QUEUE_SET, but we can add
it). See shaper_lookup_mode below.

> 
> >  * The kernel uses this representation and the shaper_ops to
> >  * access, traverse, and update it.
> >  */
> > struct shaper_info {
> > 	/* The following fields allow the full traversal of the whole
> > 	 * hierarchy.
> > 	 */
> > 	u32 id;
> > 	u32 parent_id;
> > 
> > 	/* The following fields define the behavior of the shaper. */
> > 	enum shaper_metric metric;
> > 	u64 bw_min;
> > 	u64 bw_max;
> > 	u32 burst;
> 
> Any details on how burst is implemented in HW?

I guess I could describe some specific H/W implementation, but would
that be too restrictive? what about just:

	maximum bw_max burst in bytes

or the like?

> 
> > 	u32 priority;
> 
> What is priority?

It's the strict scheduling priority, within the relevant group/set. 

> 
> > 	u32 weight;
> 
> Weight to me is a reference to a WRR algorithm? Is there some other
> notion here?

Yes, as in WRR.

The general idea is to try to expose as much features as possible from
the existing H/W.

[...]

> > 	/* set - Update the specified shaper, if it exists
> > 	 * @dev: Netdevice to operate on.
> > 	 * @lookup_mode: How to perform the shaper lookup
> > 	 * @id: ID of the specified shaper,
> > 	 *      relative to the specified @access_type.
> > 	 * @shaper: Configuration of shaper.
> > 	 * @extack: Netlink extended ACK for reporting errors.
> > 	 *
> > 	 * Configure the parameters of @shaper according to values supplied
> > 	 * in the following fields:
> > 	 * * @shaper.metric
> > 	 * * @shaper.bw_min
> > 	 * * @shaper.bw_max
> > 	 * * @shaper.burst
> > 	 * * @shaper.priority
> > 	 * * @shaper.weight
> > 	 * Values supplied in other fields of @shaper must be zero and,
> > 	 * other than verifying that, are ignored.
> > 	 *
> > 	 * Return:
> > 	 * * %0 - Success
> > 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> > 	 *		    or core for any reason. @extack should be set
> > 	 *		    to text describing the reason.
> > 	 * * Other negative error values on failure.
> > 	 */
> > 	int (*set)(struct net_device *dev,
> > 		   enum shaper_lookup_mode lookup_mode, u32 id,
> > 		   const struct shaper_info *shaper,
> > 		   struct netlink_ext_ack *extack);
> > 
> > 	/* Move - change the parent id of the specified shaper
> > 	 * @dev: netdevice to operate on.
> > 	 * @lookup_mode: how to perform the shaper lookup
> > 	 * @id: ID of the specified shaper,
> > 	 *                      relative to the specified @access_mode.
> > 	 * @new_parent_id: new ID of the parent shapers,
> > 	 *                      always relative to the SHAPER_LOOKUP_BY_TREE_ID
> > 	 *                      lookup mode
> > 	 * @extack: Netlink extended ACK for reporting errors.
> > 	 *
> > 	 * Move the specified shaper in the hierarchy replacing its
> > 	 * current parent shaper with @new_parent_id
> > 	 *
> > 	 * Return:
> > 	 * * %0 - Success
> > 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> > 	 *		    or core for any reason. @extack should be set
> > 	 *		    to text describing the reason.
> > 	 * * Other negative error values on failure.
> > 	 */
> 
> Some heavy firmware or onchip CPU managing this move operation? Is this
> a reset operation as well? Is there a need for an atomic move like this
> in initial API? Maybe start with just set and push down an entire config
> in one hit. If hw can really move things around dynamically I think it
> would make sense though.

This 'move' ops was a last minute addendum. It's used by the example
below, useful in not trivial scenario. Reading the datasheet 'ice'
supports it, and I think even mlx5 - guessing on devlink_rate api.

It can be dropped or added later, without it the queue group example
below will need some other additional op.

I guess h/w not supporting it (or posing some kind of constraint) could
error out with some descriptive extack.

> > 	int (*move)(struct net_device *dev,
> > 		    enum shaper_lookup_mode lookup_mode, u32 id,
> > 		    u32 new_parent_id, struct netlink_ext_ack *extack);
> > 
> > 	/* add - Add a shaper inside the shaper hierarchy
> > 	 * @dev: netdevice to operate on.
> > 	 * @shaper: configuration of shaper.
> > 	 * @extack: Netlink extended ACK for reporting errors.
> > 	 *
> > 	 * @shaper.id must be set to SHAPER_NONE_ID as
> > 	 * the id for the shaper will be automatically allocated.
> > 	 * @shaper.parent_id determines where inside the shaper's tree
> > 	 * this node is inserted.
> > 	 *
> > 	 * Return:
> > 	 * * non-negative shaper id on success
> > 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> > 	 *		    or core for any reason. @extack should be set
> > 	 *		    to text describing the reason.
> > 	 * * Other negative error values on failure.
> > 	 *
> > 	 * Examples or reasons this operation may fail include:
> > 	 * * H/W resources limits.
> > 	 * * The parent is a ‘leaf’ node - attached to a queue.
> > 	 * * Can’t respect the requested bw limits.
> > 	 */
> > 	int (*add)(struct net_device *dev, const struct shaper_info *shaper,
> > 		   struct netlink_ext_ack *extack);
> > 
> > 	/* delete - Add a shaper inside the shaper hierarchy
> > 	 * @dev: netdevice to operate on.
> > 	 * @lookup_mode: how to perform the shaper lookup
> > 	 * @id: ID of the specified shaper,
> > 	 *                      relative to the specified @access_type.
> > 	 * @shaper: Object to return the deleted shaper configuration.
> > 	 *              Ignored if NULL.
> > 	 * @extack: Netlink extended ACK for reporting errors.
> > 	 *
> > 	 * Return:
> > 	 * * %0 - Success
> > 	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> > 	 *		    or core for any reason. @extack should be set
> > 	 *		    to text describing the reason.
> > 	 * * Other negative error values on failure.
> > 	 */
> > 	int (*delete)(struct net_device *dev,
> > 		      enum shaper_lookup_mode lookup_mode,
> > 		      u32 id, struct shaper_info *shaper,
> > 		      struct netlink_ext_ack *extack);
> 
> One thought I have about exposing hierarchy like this is user will need 
> to have nic user manual in hand to navigate hardware limitation I presume.
> If hw is this flexible than lets do something. But, this is why I started
> to think more about scopes (nic, pf, vf, queueSet, queue) than arbitrary
> hierarchy. Perhaps this is going to target some DPU though with lots of
> flexibility here.

I fear we were too terse describing 'enum shaper_lookup_mode
lookup_mode/id access'. I think, it actually allows the scope lookup
you are looking for. And even the full hierarchy access - if the H/W
support it. 

My understanding is that at least intel and mellanox already support
this kind of features, and likely marvell.

> > };
> > 
> > /*
> >  * Examples:
> >  * - set shaping on a given queue
> >  *   struct shaper_info info = { // fill this };
> >  *   dev->shaper_ops->set(dev, SHAPER_LOOKUP_BY_QUEUE, queue_id, &info, NULL);
> >  *
> >  * - create a queue group with a queue group shaping limits.
> >  *   Assuming the following topology already exists:
> >  *                    < netdev shaper >
> >  *                      /           \
> >  *         <queue 0 shaper> . . .  <queue N shaper>
> >  *
> >  *   struct shaper_info pinfo, ginfo;
> >  *   dev->shaper_ops->get(dev, SHAPER_LOOKUP_BY_NETDEV, 0, &pinfo);
> >  *
> >  *   ginfo.parent_id = pinfo.id;
> >  *   // fill-in other shaper params...
> >  *   new_node_id = dev->shaper_ops->add(dev, &ginfo);
> >  *
> >  *   // now topology is:
> >  *   //                  <    netdev shaper    >
> >  *   //                  /            |        \
> >  *   //                 /             |        <newly created shaper>
> >  *   //                /              |
> >  *   // <queue 0 shaper> . . . <queue N shaper>
> 
> I think we need a queue set shaper as part of this for larger bw limits.

I'm sorry, I'm unsure I understand the above, could you please re-
phrase?

In this example the 'newly created shaper' is a 'queue set' shaper.


Thanks for the feedback!

Paolo