netdev - Re: [PATCH net-next v2 2/4] sch_htb: Hierarchical QoS hardware offload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0bdb1540-5712-0176-c187-e7ff80aa704d@nvidia.com>
Date:   Thu, 17 Dec 2020 16:24:01 +0200
From:   Maxim Mikityanskiy <maximmi@...dia.com>
To:     Cong Wang <xiyou.wangcong@...il.com>
CC:     Maxim Mikityanskiy <maximmi@...lanox.com>,
        "David S. Miller" <davem@...emloft.net>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Jiri Pirko <jiri@...nulli.us>,
        Saeed Mahameed <saeedm@...dia.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Tariq Toukan <tariqt@...lanox.com>,
        Dan Carpenter <dan.carpenter@...cle.com>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Tariq Toukan <tariqt@...dia.com>,
        Yossi Kuperman <yossiku@...dia.com>
Subject: Re: [PATCH net-next v2 2/4] sch_htb: Hierarchical QoS hardware
 offload

On 2020-12-16 21:01, Cong Wang wrote:
> On Mon, Dec 14, 2020 at 12:30 PM Maxim Mikityanskiy <maximmi@...dia.com> wrote:
>>
>> On 2020-12-14 21:35, Cong Wang wrote:
>>> On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy <maximmi@...dia.com> wrote:
>>>>
>>>> On 2020-12-11 21:16, Cong Wang wrote:
>>>>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy <maximmi@...lanox.com> wrote:
>>>>>>
>>>>>> HTB doesn't scale well because of contention on a single lock, and it
>>>>>> also consumes CPU. This patch adds support for offloading HTB to
>>>>>> hardware that supports hierarchical rate limiting.
>>>>>>
>>>>>> This solution addresses two main problems of scaling HTB:
>>>>>>
>>>>>> 1. Contention by flow classification. Currently the filters are attached
>>>>>> to the HTB instance as follows:
>>>>>
>>>>> I do not think this is the reason, tcf_classify() has been called with RCU
>>>>> only on the ingress side for a rather long time. What contentions are you
>>>>> talking about here?
>>>>
>>>> When one attaches filters to HTB, tcf_classify is called from
>>>> htb_classify, which is called from htb_enqueue, which is called with the
>>>> root spinlock of the qdisc taken.
>>>
>>> So it has nothing to do with tcf_classify() itself... :-/
>>>
>>> [...]
>>>
>>>>> And doesn't TBF already work with mq? I mean you can attach it as
>>>>> a leaf to each mq so that the tree lock will not be shared either, but you'd
>>>>> lose the benefits of a global rate limit too.
>>>>
>>>> Yes, I'd lose not only the global rate limit, but also multi-level
>>>> hierarchical limits, which are all provided by this HTB offload - that's
>>>> why TBF is not really a replacement for this feature.
>>>
>>> Interesting, please explain how your HTB offload still has a global rate
>>> limit and borrowing across queues?
>>
>> Sure, I will explain that.
>>
>>> I simply can't see it, all I can see
>>> is you offload HTB into each queue in ->attach(),
>>
>> In the non-offload mode, the same HTB instance would be attached to all
>> queues. In the offload mode, HTB behaves like MQ: there is a root
>> instance of HTB, but each queue gets a separate simple qdisc (pfifo).
>> Only the root qdisc (HTB) gets offloaded, and when that happens, the NIC
>> creates an object for the QoS root.
> 
> Please add this to your changelog.

The similar information is already in the commit message (the paragraph 
under point 2.), but I can rephrase it and elaborate on this, focusing 
on the interaction between HTB, driver and hardware.

> And why is the offloaded root qdisc not visible to software? All you add to
> root HTB are pointers of direct qdisc's and a boolean, this is what I meant
> by "not reflected". I expect the hardware parameters/stats are exposed to
> software too, but I can't find any.

Hardware parameters are rate and ceil, there are no extra parameters 
that exist only in the offload mode. In the future, we may add the 
number of queues to reserve for a class - that will be configured and 
reflected in tc.

Regarding stats, unfortunately, the hardware doesn't expose any 
low-level stats like numbers of tokens, etc. Basic stats like packets 
and bytes are supported and exposed in tc, but they are calculated in 
software (by per-queue qdiscs) - similarly to how we calculate 
per-TX-queue packets and bytes in software.

> 
>>
>> Then all configuration changes are sent to the driver, and it issues the
>> corresponding firmware commands to replicate the whole hierarchy in the
>> NIC. Leaf classes correspond to queue groups (in this implementation
>> queue groups contain only one queue, but it can be extended), and inner
>> classes correspond to entities called TSARs.
>>
>> The information about rate limits is stored inside TSARs and queue
>> groups. Queues know what groups they belong to, and groups and TSARs
>> know what TSAR is their parent. A queue is picked in ndo_select_queue by
>> looking at the classification result of clsact. So, when a packet is put
>> onto a queue, the NIC can track the whole hierarchy and do the HTB
>> algorithm.
> 
> Glad to know hardware still keeps HTB as a hierarchy.
> 
> Please also add this either to source code as comments or in your
> changelog, it is very important to understand what is done by hardware.

OK, as I said above in this letter, I can rephrase the commit message, 
focusing on details about interaction between HTB <-> driver <-> NIC and 
what happens in the NIC. It should look better if I put it in a 
dedicated paragraph, instead of mentioning it under the contention problem.

Thanks,
Max

> Thanks.
>