netdev - Re: [RFC PATCH] sch_htb: Hierarchical QoS hardware offload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <8b48f489-8646-91d3-fa1b-fe115c4979d4@mellanox.com>
Date:   Tue, 14 Jul 2020 14:22:52 +0300
From:   Maxim Mikityanskiy <maximmi@...lanox.com>
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Yossi Kuperman <yossiku@...lanox.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        John Fastabend <john.fastabend@...il.com>,
        Toke Høiland-Jørgensen <toke@...hat.com>,
        Dave Taht <dave.taht@...il.com>,
        Jiri Pirko <jiri@...lanox.com>,
        Rony Efraim <ronye@...lanox.com>,
        Eran Ben Elisha <eranbe@...lanox.com>
Subject: Re: [RFC PATCH] sch_htb: Hierarchical QoS hardware offload

On 2020-07-08 09:44, Cong Wang wrote:
> On Fri, Jun 26, 2020 at 3:46 AM Maxim Mikityanskiy <maximmi@...lanox.com> wrote:
>>
>> HTB doesn't scale well because of contention on a single lock, and it
>> also consumes CPU. Mellanox hardware supports hierarchical rate limiting
>> that can be leveraged by offloading the functionality of HTB.
> 
> True, essentially because it has to enforce a global rate limit with
> link sharing.
> 
> There is a proposal of adding a new lockless shaping qdisc, which
> you can find in netdev list.

Thanks for pointing out! It's sch_ltb (lockless token bucket), right? I 
see it's very recent. I'll certainly have to dig deeper to understand 
all the details, but as I got, LTB still has a bottleneck of a single 
queue ("drain queue") processed by a single thread, but what makes a 
difference is that enqueue and dequeue are cheap, all algorithm 
processing is taken out of these functions, and they work on per-CPU queues.

>>
>> Our solution addresses two problems of HTB:
>>
>> 1. Contention by flow classification. Currently the filters are attached
>> to the HTB instance as follows:
>>
>>      # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
>>      classid 1:10
>>
>> It's possible to move classification to clsact egress hook, which is
>> thread-safe and lock-free:
>>
>>      # tc filter add dev eth0 egress protocol ip flower dst_port 80
>>      action skbedit priority 1:10
>>
>> This way classification still happens in software, but the lock
>> contention is eliminated, and it happens before selecting the TX queue,
>> allowing the driver to translate the class to the corresponding hardware
>> queue.
>>
>> Note that this is already compatible with non-offloaded HTB and doesn't
>> require changes to the kernel nor iproute2.
>>
>> 2. Contention by handling packets. HTB is not multi-queue, it attaches
>> to a whole net device, and handling of all packets takes the same lock.
>> Our solution offloads the logic of HTB to the hardware and registers HTB
>> as a multi-queue qdisc, similarly to how mq qdisc does, i.e. HTB is
>> attached to the netdev, and each queue has its own qdisc. The control
>> flow is performed by HTB, it replicates the hierarchy of classes in
>> hardware by calling callbacks of the driver. Leaf classes are presented
>> by hardware queues. The data path works as follows: a packet is
>> classified by clsact, the driver selectes the hardware queue according
>> to its class, and the packet is enqueued into this queue's qdisc.
> 
> Are you sure the HTB algorithm could still work even after you
> kinda make each HTB class separated? I think they must still share
> something when they borrow bandwidth from each other. This is why I
> doubt you can simply add a ->attach() without touching the core
> algorithm.

The core algorithm is offloaded to the hardware, the NIC does all the 
shaping, so all we need to do on the kernel side is put packets into the 
correct hardware queues.

I think offloading the algorithm processing could give an extra benefit 
over the purely software implementation of LTB, but that is something I 
need to explore (e.g., is it realistic to reach the drain queue 
bottleneck with LTB; how much CPU usage can be saved with HTB offload).

Thank you for your feedback!

> Thanks.
>