[<prev] [next>] [day] [month] [year] [list]
Message-ID: <d9831d0c940a7b77419abe7c7330e822bbfd1cfb.camel@nvidia.com>
Date: Thu, 6 Mar 2025 14:03:54 +0000
From: Cosmin Ratiu <cratiu@...dia.com>
To: "netdev@...r.kernel.org" <netdev@...r.kernel.org>
CC: "horms@...nel.org" <horms@...nel.org>, "andrew+netdev@...n.ch"
<andrew+netdev@...n.ch>, "davem@...emloft.net" <davem@...emloft.net>, Tariq
Toukan <tariqt@...dia.com>, Gal Pressman <gal@...dia.com>, "jiri@...nulli.us"
<jiri@...nulli.us>, Leon Romanovsky <leonro@...dia.com>,
"edumazet@...gle.com" <edumazet@...gle.com>, "kuba@...nel.org"
<kuba@...nel.org>, Saeed Mahameed <saeedm@...dia.com>, Carolina Jubran
<cjubran@...dia.com>, "pabeni@...hat.com" <pabeni@...hat.com>
Subject: net-shapers plan
Hello,
This (long) email presents a plan agreed with Simon and Paolo for
extending net-shapers with use cases currently serviced by devlink-
rate. The goal is to get net-shapers to feature parity with devlink-
rate so that the amount of code dedicated to traffic shaping in the
kernel could eventually be reduced significantly.
This is in response to Jakub's concerns raised in [3] and [4].
Context
-------
devlink-rate ([1]) can control traffic shaping for a VF / VF group and
is currently implemented by the Intel ice and NVIDIA mlx5 drivers. It
operates either on devlink ports (for VF rates) or on devlink objects
(for group rates). Rate objects are owned by the devlink object.
net-shapers ([2]) is a recently added API for shaping traffic for a
netdev tx queue / queue group / entire netdev. It is more granular than
devlink-rate but cannot currently control shaping for groups of
netdevs. It operates with netdev handles. Stores the shaping hierarchy
in the netdevice.
[3] & [4] add support to devlink-rate for traffic-class shaping, which
is controlling the shaping hierarchy in hardware to control the
bandwidth allocation different traffic classes get. The question is how
to represent traffic classes in net-shapers.
In [5], Jiri expressed a desire to eventually convert devlink-rate to
net-shapers.
Finally, in [6] I sent an update outlining a snapshot of discussions
that took place trying to figure things out.
Putting these pieces together, the following plan takes shape.
Plan, in short
--------------
1. Extend net-shapers hierarchy with the ability to define 8 traffic
class roots for a net device instead of a single root like today. There
is no need for a new scope, the NETDEV scope with a different id to
differentiate TCs should be enough.
This is needed to allow backpressure from the hierarchy to the txq
level and proper TC selection.
The goal is to either have a hierarchy like today, with one netdev-
level root containing nodes and leaves being txqs or to have a TC-
enabled hierarchy with 8 roots (one for each traffic class), with nodes
and txqs as leaves.
2. Extend the semantics of NET_SHAPER_SCOPE_NODE to be able to group
multiple netdevs, similar to devlink-rate nodes.
3. Add a new DEVLINK binding type for the hierarchy, to be able to
represent netdev groups. That part of the hierarchy would be stored in
the devlink object instead of the netdev. This allows separation
between the VM and the hypervisor parts of the hierarchy.
These together should make net-shapers a strict superset of devlink-
rate and would allow the devlink-rate implementation to be converted to
net-shapers. It allows independently operating traffic shaping from a
VM (limited to its own VF/netdev) and from the hypervisor (being able
to rate limit traffic classes and groups of VFs, like devlink-rate).
Plan, in detail
---------------
1. Packet classification
It is outside the scope of net-shapers, but it's worth talking about
it.
Packet classification is done based on either:
a. TOS field in the IP header (known as DSCP) or
b. VLAN priority in the VLAN header (known as PCP).
c. Arbitrary rules based on DPI (not standard, but possible).
Classification means labeling a packet with a traffic class based on
the above rules, then treating packets with different traffic classes
differently during tx processing.
The first moment when classification matters is when choosing a txq.
Since the goal is to be able to treat different traffic classes
differently, it it necessary to have a txq only output a single traffic
class. If that condition doesn't hold, a txq sending a mixture of
traffic classes might suffer from head-of-line blocking. Imagine a
scenario with a txq on which low volume high priority TC 7 for example
is sent alongside high volume low priority TC 0.
Backpressure on TC 0 from further up the shaping hierarchy would only
be able to manifest itself by blocking the entire txq, affecting both
traffic classes.
It is not important which entity (kernel or hw) classifies packets as
long as the condition that a given txq only sends traffic for a single
traffic class holds.
2. New net-shapers netdev TC roots
A new netdev TC root would therefore logically identify a disjoint
subset of txqs that service that TC. The union of all 8 roots would
encompass all device txqs.
The primary reason to define separate roots for each TC is that
backpressure from the hierarchy on one of the traffic classes needs to
not affect other traffic classes, meaning only txqs servicing the
blocked traffic class should be affected.
Furthermore, this cannot be done by simply grouping txqs for a given TC
with NET_SHAPER_SCOPE_NODE, because the TC for a txq is not always
known to the kernel and might only be known to the driver or the NIC.
With the new roots, net-shapers can relay the intent to shape traffic
for a particular TC to the driver without having knowledge of which
txqs service a TC. The association between txqs and TCs they service
doesn't need to be known to the kernel.
3. Extend NODE scope to group multiple netdevs and new DEVLINK binding
Today, all net-shapers objects are owned by a netdevice. Who should own
a net shaper that represents a group of netdevices? It needs to be a
stable object that isn't affected by group membership changes and
therefore cannot be any netdev from the group. The only sensible option
would be to pick an object corresponding to the eswitch to own such
groups, which neatly corresponds to the devlink object today.
4. VM/hypervisor considerations
A great deal of discussion happened about the split of shaping
responsibilities between the VM and the hypervisor. With devlink today,
the shaping hierarchy and traffic class bw split is decided entirely by
the hypervisor, the VMs have no influence on shaping.
But net-shapers has more precise granularity for shaping at queue
level, so perhaps there are valid use cases for allowing VMs to control
their part of the hierarchy. In the end, what we think makes sense is
this model:
VMs can control the shaping of txqs, queue groups and the VFs they own.
On top of that, the hypervisor can take the netdev root of the VM
hierarchy and plug it into its own hierarchy, imposing additional
constraints. The VM has no influence on that. So for example the VM can
decide that "my VF should be limited to 10Gbps", but the hypervisor can
then add another shaping node saying "that VF is limited to 1Gbps" and
the later should be the limit.
With traffic classes, the VM can send out tc-labeled traffic on
different txqs, but the hypervisor decides to take the VM TC roots and
group them in an arbiter node (== a shaping node arbitrating between
different traffic classes), or to group TC roots from multiple VMs
before applying arbitration settings. This is similar to devlink-rate
today. The VM itself should have no control into TC bandwidth settings.
Cosmin.
[1] https://man7.org/linux/man-pages/man8/devlink-rate.8.html
[2]
https://lore.kernel.org/netdev/cover.1728460186.git.pabeni@redhat.com/
[3] https://lore.kernel.org/netdev/20241206181345.3eccfca4@kernel.org/
[4]
https://lore.kernel.org/netdev/20250209101716.112774-1-tariqt@nvidia.com/
[5] https://lore.kernel.org/netdev/ZwP8OWtMfCH0_ikc@nanopsycho.orion/
[6]
https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.camel@nvidia.com/
Powered by blists - more mailing lists