lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <FC053E80-74C9-4884-92F1-4DBEB5F0C81A@mellanox.com>
Date:   Thu, 30 Jan 2020 16:20:38 +0000
From:   Yossi Kuperman <yossiku@...lanox.com>
To:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>
CC:     Jamal Hadi Salim <jhs@...atatu.com>,
        Jiri Pirko <jiri@...lanox.com>,
        Rony Efraim <ronye@...lanox.com>,
        Maxim Mikityanskiy <maximmi@...lanox.com>,
        John Fastabend <john.fastabend@...il.com>,
        Eran Ben Elisha <eranbe@...lanox.com>
Subject: [RFC] Hierarchical QoS Hardware Offload (HTB)

Following is an outline briefly describing our plans towards offloading HTB functionality.

HTB qdisc allows you to use one physical link to simulate several slower links. This is done by configuring a hierarchical QoS tree; each tree node corresponds to a class. Filters are used to classify flows to different classes. HTB is quite flexible and versatile, but it comes with a cost. HTB does not scale and consumes considerable CPU and memory. Our aim is to offload HTB functionality to hardware and provide the user with the flexibility and the conventional tools offered by TC subsystem, while scaling to thousands of traffic classes and maintaining wire-speed performance. 

Mellanox hardware can support hierarchical rate-limiting; rate-limiting is done per hardware queue. In our proposed solution, flow classification takes place in software. By moving the classification to clsact egress hook, which is thread-safe and does not require locking, we avoid the contention induced by the single qdisc lock. Furthermore, clsact filters are perform before the net-device’s TX queue is selected, allowing the driver a chance to translate the class to the appropriate hardware queue. Please note that the user will need to configure the filters slightly different; apply them to the clsact rather than to the HTB itself, and set the priority to the desired class-id.

For example, the following two filters are equivalent:
	1. tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10
	2. tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10

Note: to support the above filter no code changes to the upstream kernel nor to iproute2 package is required.

Furthermore, the most concerning aspect of the current HTB implementation is its lack of support for multi-queue. All net-device’s TX queues points to the same HTB instance, resulting in high spin-lock contention. This contention (might) negates the overall performance gains expected by introducing the offload in the first place. We should modify HTB to present itself as mq qdisc does. By default, mq qdisc allocates a simple fifo qdisc per TX queue exposed by the lower layer device. This is only when hardware offload is configured, otherwise, HTB behaves as usual. There is no HTB code along the data-path; the only overhead compared to regular traffic is the classification taking place at clsact. Please note that this design induces full offload---no fallback to software; it is not trivial to partial offload the hierarchical tree considering borrowing between siblings anyway.


To summaries: for each HTB leaf-class the driver will allocate a special queue and match it with a corresponding net-device TX queue (increase real_num_tx_queues). A unique fifo qdisc will be attached to any such TX queue. Classification will still take place in software, but rather at the clsact egress hook. This way we can scale to thousands of classes while maintaining wire-speed performance and reducing CPU overhead.

Any feedback will be much appreciated.

Cheers,
Kuperman


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ