lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5fff8229-4aaa-44d0-9068-fa5d3f268345@nvidia.com>
Date: Wed, 12 Mar 2025 13:02:13 +0200
From: Carolina Jubran <cjubran@...dia.com>
To: "Samudrala, Sridhar" <sridhar.samudrala@...el.com>,
 Cosmin Ratiu <cratiu@...dia.com>,
 "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Cc: "horms@...nel.org" <horms@...nel.org>,
 "andrew+netdev@...n.ch" <andrew+netdev@...n.ch>,
 "davem@...emloft.net" <davem@...emloft.net>, Tariq Toukan
 <tariqt@...dia.com>, Gal Pressman <gal@...dia.com>,
 "jiri@...nulli.us" <jiri@...nulli.us>, Leon Romanovsky <leonro@...dia.com>,
 "edumazet@...gle.com" <edumazet@...gle.com>,
 "kuba@...nel.org" <kuba@...nel.org>, Saeed Mahameed <saeedm@...dia.com>,
 "pabeni@...hat.com" <pabeni@...hat.com>,
 Madhu Chittim <madhu.chittim@...el.com>, "Zaki, Ahmed" <ahmed.zaki@...el.com>
Subject: Re: net-shapers plan



On 11/03/2025 3:42, Samudrala, Sridhar wrote:
> 
> 
> On 3/6/2025 6:03 AM, Cosmin Ratiu wrote:
>> Hello,
>>
>> This (long) email presents a plan agreed with Simon and Paolo for
>> extending net-shapers with use cases currently serviced by devlink-
>> rate. The goal is to get net-shapers to feature parity with devlink-
>> rate so that the amount of code dedicated to traffic shaping in the
>> kernel could eventually be reduced significantly.
>>
>> This is in response to Jakub's concerns raised in [3] and [4].
>>
>> Context
>> -------
>> devlink-rate ([1]) can control traffic shaping for a VF / VF group and
>> is currently implemented by the Intel ice and NVIDIA mlx5 drivers. It
>> operates either on devlink ports (for VF rates) or on devlink objects
>> (for group rates). Rate objects are owned by the devlink object.
>>
>> net-shapers ([2]) is a recently added API for shaping traffic for a
>> netdev tx queue / queue group / entire netdev. It is more granular than
>> devlink-rate but cannot currently control shaping for groups of
>> netdevs. It operates with netdev handles. Stores the shaping hierarchy
>> in the netdevice.
>>
>> [3] & [4] add support to devlink-rate for traffic-class shaping, which
>> is controlling the shaping hierarchy in hardware to control the
>> bandwidth allocation different traffic classes get. The question is how
>> to represent traffic classes in net-shapers.
>> In [5], Jiri expressed a desire to eventually convert devlink-rate to
>> net-shapers.
>> Finally, in [6] I sent an update outlining a snapshot of discussions
>> that took place trying to figure things out.
>> Putting these pieces together, the following plan takes shape.
>> Plan, in short
>> --------------
>> 1. Extend net-shapers hierarchy with the ability to define 8 traffic
>> class roots for a net device instead of a single root like today. There
>> is no need for a new scope, the NETDEV scope with a different id to
>> differentiate TCs should be enough.
>> This is needed to allow backpressure from the hierarchy to the txq
>> level and proper TC selection.
>>
>> The goal is to either have a hierarchy like today, with one netdev-
>> level root containing nodes and leaves being txqs or to have a TC-
>> enabled hierarchy with 8 roots (one for each traffic class), with nodes
>> and txqs as leaves.
>>
>> 2. Extend the semantics of NET_SHAPER_SCOPE_NODE to be able to group
>> multiple netdevs, similar to devlink-rate nodes.
>>
>> 3. Add a new DEVLINK binding type for the hierarchy, to be able to
>> represent netdev groups. That part of the hierarchy would be stored in
>> the devlink object instead of the netdev. This allows separation
>> between the VM and the hypervisor parts of the hierarchy.
>>
>> These together should make net-shapers a strict superset of devlink-
>> rate and would allow the devlink-rate implementation to be converted to
>> net-shapers. It allows independently operating traffic shaping from a
>> VM (limited to its own VF/netdev) and from the hypervisor (being able
>> to rate limit traffic classes and groups of VFs, like devlink-rate).
>>
>> Plan, in detail
>> ---------------
>> 1. Packet classification
>> It is outside the scope of net-shapers, but it's worth talking about
>> it.
>> Packet classification is done based on either:
>> a. TOS field in the IP header (known as DSCP) or
>> b. VLAN priority in the VLAN header (known as PCP).
>> c. Arbitrary rules based on DPI (not standard, but possible).
>>
>> Classification means labeling a packet with a traffic class based on
>> the above rules, then treating packets with different traffic classes
>> differently during tx processing.
>>
>> The first moment when classification matters is when choosing a txq.
>> Since the goal is to be able to treat different traffic classes
>> differently, it it necessary to have a txq only output a single traffic
>> class. If that condition doesn't hold, a txq sending a mixture of
>> traffic classes might suffer from head-of-line blocking. Imagine a
>> scenario with a txq on which low volume high priority TC 7 for example
>> is sent alongside high volume low priority TC 0.
>> Backpressure on TC 0 from further up the shaping hierarchy would only
>> be able to manifest itself by blocking the entire txq, affecting both
>> traffic classes.
>>
>> It is not important which entity (kernel or hw) classifies packets as
>> long as the condition that a given txq only sends traffic for a single
>> traffic class holds.
>>
>> 2. New net-shapers netdev TC roots
>> A new netdev TC root would therefore logically identify a disjoint
>> subset of txqs that service that TC. The union of all 8 roots would
>> encompass all device txqs.
> 
> Are these TC roots configured on the VF/SF netdev? OR are these on the 
> corresponding Port representor netdevs?
The answer is it depends. If the user want to achieve tc bandwidth 
allocation inside the VF, these TC roots are configured on the VF 
netdev. If we are trying to achieve this tc-bw on a intermediate node 
that groups multiple devlink ports, this will happen on the devlink
port.

> 
>>
>> The primary reason to define separate roots for each TC is that
>> backpressure from the hierarchy on one of the traffic classes needs to
>> not affect other traffic classes, meaning only txqs servicing the
>> blocked traffic class should be affected.
>>
>> Furthermore, this cannot be done by simply grouping txqs for a given TC
>> with NET_SHAPER_SCOPE_NODE, because the TC for a txq is not always
>> known to the kernel and might only be known to the driver or the NIC.
>> With the new roots, net-shapers can relay the intent to shape traffic
>> for a particular TC to the driver without having knowledge of which
>> txqs service a TC. The association between txqs and TCs they service
>> doesn't need to be known to the kernel.
>>
>> 3. Extend NODE scope to group multiple netdevs and new DEVLINK binding
>> Today, all net-shapers objects are owned by a netdevice. Who should own
>> a net shaper that represents a group of netdevices? It needs to be a
>> stable object that isn't affected by group membership changes and
>> therefore cannot be any netdev from the group. The only sensible option
>> would be to pick an object corresponding to the eswitch to own such
>> groups, which neatly corresponds to the devlink object today.
> 
> When you are referring to grouping multiple netdevs, I am assuming these 
> are port representor netdevs. Is this correct?
> 
Grouping multiple netdevs; these are devlink ports not netdevices.

>>
>> 4. VM/hypervisor considerations
>> A great deal of discussion happened about the split of shaping
>> responsibilities between the VM and the hypervisor. With devlink today,
>> the shaping hierarchy and traffic class bw split is decided entirely by
>> the hypervisor, the VMs have no influence on shaping.
>>
>> But net-shapers has more precise granularity for shaping at queue
>> level, so perhaps there are valid use cases for allowing VMs to control
>> their part of the hierarchy. In the end, what we think makes sense is
>> this model:
>>
>> VMs can control the shaping of txqs, queue groups and the VFs they own.
>> On top of that, the hypervisor can take the netdev root of the VM
>> hierarchy and plug it into its own hierarchy, imposing additional
>> constraints. The VM has no influence on that. So for example the VM can
>> decide that "my VF should be limited to 10Gbps", but the hypervisor can
>> then add another shaping node saying "that VF is limited to 1Gbps" and
>> the later should be the limit.
> 
> Isn't it sufficient to enable rate limit at a VF/SF's queue or queue- 
> group granularity from the VF/SF netdev? The hypervisor should be able 
> to rate limit at VF granularity.

You can do that, this depends on the requirement. If we have two VFs 
that shares the same physical NIC, one VF could fully utilize the link. 
To prevent that, you should limit each function from the outside (on the 
hypervisor). Otherwise there is no enforcement to let the VFs fairly 
share the link capacity.

>>
>> With traffic classes, the VM can send out tc-labeled traffic on
>> different txqs, but the hypervisor decides to take the VM TC roots and
>> group them in an arbiter node (== a shaping node arbitrating between
>> different traffic classes), or to group TC roots from multiple VMs
>> before applying arbitration settings. This is similar to devlink-rate
>> today. The VM itself should have no control into TC bandwidth settings.
> 
> 
> It is not clear if TC roots are configured by the VF driver or the PF 
> drivers supporting switchdev. Can you share an example configuration 
> with steps on how to configure hierachical traffic shaping of VFs 
> queues/TCs
>
 From the inside the tc roots are configured by the VF driver.
 From the outside this will happen on the devlink port that indicates
the eswitch port that represents the PF.

>>
>> Cosmin.
>>
>> [1] https://man7.org/linux/man-pages/man8/devlink-rate.8.html
>> [2]
>> https://lore.kernel.org/netdev/cover.1728460186.git.pabeni@redhat.com/
>> [3] https://lore.kernel.org/netdev/20241206181345.3eccfca4@kernel.org/
>> [4]
>> https://lore.kernel.org/netdev/20250209101716.112774-1-tariqt@nvidia.com/
>> [5] https://lore.kernel.org/netdev/ZwP8OWtMfCH0_ikc@nanopsycho.orion/
>> [6]
>> https://lore.kernel.org/ 
>> netdev/67df1a562614b553dcab043f347a0d7c5393ff83.camel@...dia.com/
>>
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ