netdev - Re: [PATCH net-next v2 2/4] sch_htb: Hierarchical QoS hardware offload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5f4f0785-54cb-debc-1f16-b817b83fbd96@nvidia.com>
Date:   Wed, 16 Dec 2020 13:47:52 +0200
From:   Maxim Mikityanskiy <maximmi@...dia.com>
To:     Jamal Hadi Salim <jhs@...atatu.com>,
        Cong Wang <xiyou.wangcong@...il.com>
CC:     Maxim Mikityanskiy <maximmi@...lanox.com>,
        "David S. Miller" <davem@...emloft.net>,
        Jiri Pirko <jiri@...nulli.us>,
        Saeed Mahameed <saeedm@...dia.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Tariq Toukan <tariqt@...lanox.com>,
        Dan Carpenter <dan.carpenter@...cle.com>,
        "Linux Kernel Network Developers" <netdev@...r.kernel.org>,
        Tariq Toukan <tariqt@...dia.com>,
        Yossi Kuperman <yossiku@...dia.com>
Subject: Re: [PATCH net-next v2 2/4] sch_htb: Hierarchical QoS hardware
 offload

On 2020-12-15 18:37, Jamal Hadi Salim wrote:
> On 2020-12-14 3:30 p.m., Maxim Mikityanskiy wrote:
>> On 2020-12-14 21:35, Cong Wang wrote:
>>> On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy 
>>> <maximmi@...dia.com> wrote:
>>>>
>>>> On 2020-12-11 21:16, Cong Wang wrote:
>>>>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy 
>>>>> <maximmi@...lanox.com> wrote:
>>>>>>
> 
> 
>>>
>>> Interesting, please explain how your HTB offload still has a global rate
>>> limit and borrowing across queues?
>>
>> Sure, I will explain that.
>>
>>> I simply can't see it, all I can see
>>> is you offload HTB into each queue in ->attach(),
>>
>> In the non-offload mode, the same HTB instance would be attached to 
>> all queues. In the offload mode, HTB behaves like MQ: there is a root 
>> instance of HTB, but each queue gets a separate simple qdisc (pfifo). 
>> Only the root qdisc (HTB) gets offloaded, and when that happens, the 
>> NIC creates an object for the QoS root.
>>
>> Then all configuration changes are sent to the driver, and it issues 
>> the corresponding firmware commands to replicate the whole hierarchy 
>> in the NIC. Leaf classes correspond to queue groups (in this 
>> implementation queue groups contain only one queue, but it can be 
>> extended),
> 
> 
> FWIW, it is very valuable to be able to abstract HTB if the hardware
> can emulate it (users dont have to learn about new abstracts).

Yes, that's the reason for using an existing interface (HTB) to 
configure the feature.

> Since you are expressing a limitation above:
> How does the user discover if they over-provisioned i.e single
> queue example above?

It comes to the CPU usage. If the core that serves the queue is busy 
with sending packets 100% of time, you need more queues. Also, if the 
user runs more than one application belonging to the same class, and 
pins them to different cores, it makes sense to create more than one queue.

I'd like to emphasize that this is not a hard limitation. Our hardware 
and firmware supports multiple queues per class. What's needed is the 
support from the driver side and probably an additional parameter to tc 
class add to specify the number of queues to reserve.

> If there are too many corner cases it may
> make sense to just create a new qdisc.
> 
>> and inner classes correspond to entities called TSARs.
>>
>> The information about rate limits is stored inside TSARs and queue 
>> groups. Queues know what groups they belong to, and groups and TSARs 
>> know what TSAR is their parent. A queue is picked in ndo_select_queue 
>> by looking at the classification result of clsact. So, when a packet 
>> is put onto a queue, the NIC can track the whole hierarchy and do the 
>> HTB algorithm.
>>
> 
> Same question above:
> Is there a limit to the number of classes that can be created?

Yes, the commit message of the mlx5 patch lists the limitations of our 
NICs. Basically, it's 256 leaf classes and 3 levels of hierarchy.

> IOW, if someone just created an arbitrary number of queues do they
> get errored-out if it doesnt make sense for the hardware?

The current implementation starts failing gracefully if the limits are 
exceeded. The tc command won't succeed, and everything will roll back to 
the stable state, which was just before the tc command.

> If such limits exist, it may make sense to provide a knob to query
> (maybe ethtool)

Sounds legit, but I'm not sure what would be the best interface for 
that. Ethtool is not involved at all in this implementation, and AFAIK 
it doesn't contain any existing command for similar stuff. We could hook 
into set-channels and add new type of channels for HTB, but the 
semantics isn't very clear, because HTB queues != HTB leaf classes, and 
I don't know if it's allowed to extend this interface (if so, I have 
more thoughts of extending it for other purposes).

> and if such limits can be adjusted it may be worth
> looking at providing interfaces via devlink.

Not really. At the moment, there isn't a good reason to decrease the 
maximum limits. It would make sense if it could free up some resources 
for something else, but AFAIK it's not the case now.

Thanks,
Max

> cheers,
> jamal
> 
> 
> cheers,
> jamal