netdev - Re: net-shapers plan

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e40e5776-7c37-4b8b-853c-d4ba693a5f9d@nvidia.com>
Date: Wed, 30 Apr 2025 15:12:37 +0300
From: Carolina Jubran <cjubran@...dia.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Cosmin Ratiu <cratiu@...dia.com>,
 "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
 "horms@...nel.org" <horms@...nel.org>,
 "andrew+netdev@...n.ch" <andrew+netdev@...n.ch>,
 "davem@...emloft.net" <davem@...emloft.net>, Tariq Toukan
 <tariqt@...dia.com>, Gal Pressman <gal@...dia.com>,
 "jiri@...nulli.us" <jiri@...nulli.us>,
 "edumazet@...gle.com" <edumazet@...gle.com>,
 Saeed Mahameed <saeedm@...dia.com>, "pabeni@...hat.com" <pabeni@...hat.com>
Subject: Re: net-shapers plan



On 23/04/2025 9:50, Carolina Jubran wrote:
> 
> 
> On 14/04/2025 19:27, Jakub Kicinski wrote:
>> On Mon, 14 Apr 2025 11:27:00 +0300 Carolina Jubran wrote:
>>>> I hope you understand my concern, tho. Since you're providing the first
>>>> implementation, if the users can grow dependent on such behavior we'd
>>>> be in no position to explain later that it's just a quirk of mlx5 and
>>>> not how the API is intended to operate.
>>>
>>> Thanks for bringing this up. I want to make it clear that traffic
>>> classes must be properly matched to queues. We don’t rely on the
>>> hardware fallback behavior in mlx5. If the driver or firmware isn’t
>>> configured correctly, traffic class bandwidth control won’t work as
>>> expected — the user will suffer from constant switching of the TX queue
>>> between scheduling queues and head-of-line blocking. As a result, users
>>> shouldn’t expect reliable performance or correct bandwidth allocation.
>>> We don’t encourage configuring this without proper TX queue mapping, so
>>> users won’t grow dependent on behavior that only happens to work 
>>> without it.
>>> We tried to highlight this in the plan section discussing queue
>>> selection and head-of-line blocking: To make traffic class shaping work,
>>> we must keep traffic classes separate for each transmit queue.
>>
>> Right, my concern is more that there is no requirement for explicit
>> configuration of the queues, as long as traffic arrives silo'ed WRT
>> DSCP markings. As long as a VF sorts the traffic it does not have
>> to explicitly say (or even know) that queue A will land in TC N.
>>
> 
> Even if the VF sends DSCP marked traffic, the packet's classification 
> into a traffic class still depends on the prio-to-TC mapping set by the 
> hypervisor. Without that mapping, the hardware can't reliably classify 
> packets, and traffic may not land in the intended TC.
> 
> Overall, for traffic class separation and scheduling to work as 
> intended, the VF and hypervisor need to be in sync. The VF provides the 
> markings, but the hypervisor owns the classification logic.
> 
> The hypervisor sets up the classification mechanism; it’s up to the VFs 
> to use it correctly, otherwise, packets will be misclassified. In a 
> virtualized setup, VFs are untrusted and don’t control classification or 
> shaping, they just select which queue to transmit on.
> 
>> BTW the classification is before all rewrites? IOW flower or any other
>> forwarding rules cannot affect scheduling?
> 
> The classification happens after forwarding actions. So yes, if the user 
> modifies DSCP or VLAN priority as part of a TC rule, that rewritten 
> value is what we use for classification and scheduling. The 
> classification reflects how the packet will look on the wire.
> 

Just to add a clarification on top of my previous reply:

The hardware does not reclassify packets. The packet's priority (from 
DSCP or VLAN PCP) is interpreted based on the prio-to-TC mapping set by 
the hypervisor, and that classification remains unchanged.

What actually happens is that if the packet’s traffic class differs from 
the TC associated with the current scheduler of the SQ, the SQ is moved 
to the correct TC scheduler to maintain traffic separation. This SQ 
movement does not change the packet’s classification.

This is necessary to avoid sending traffic through the wrong TC 
scheduler. Otherwise, packets would bypass the intended shaping 
hierarchy, and traffic isolation between classes would break.

In particular, without this queue movement, backpressure applied to a 
traffic class would incorrectly stall packets from other classes, 
leading to HOL blocking, exactly the kind of behavior we want to prevent 
by keeping queues bound to a single TC.

So this is not a reclassification of the packet itself, but a necessary 
mechanism to enforce correct scheduling and maintain class based 
isolation. Smart SQ selection helps improve performance by avoiding 
scheduler transitions, but it's just an optimization, not something that 
affects classification.