[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87y2tmckyt.fsf@toke.dk>
Date: Sat, 01 Feb 2020 17:48:26 +0100
From: Toke Høiland-Jørgensen <toke@...hat.com>
To: Yossi Kuperman <yossiku@...lanox.com>,
"netdev\@vger.kernel.org" <netdev@...r.kernel.org>
Cc: Jamal Hadi Salim <jhs@...atatu.com>,
Jiri Pirko <jiri@...lanox.com>,
Rony Efraim <ronye@...lanox.com>,
Maxim Mikityanskiy <maximmi@...lanox.com>,
John Fastabend <john.fastabend@...il.com>,
Eran Ben Elisha <eranbe@...lanox.com>
Subject: Re: [RFC] Hierarchical QoS Hardware Offload (HTB)
Yossi Kuperman <yossiku@...lanox.com> writes:
> Following is an outline briefly describing our plans towards offloading HTB functionality.
>
> HTB qdisc allows you to use one physical link to simulate several
> slower links. This is done by configuring a hierarchical QoS tree;
> each tree node corresponds to a class. Filters are used to classify
> flows to different classes. HTB is quite flexible and versatile, but
> it comes with a cost. HTB does not scale and consumes considerable CPU
> and memory. Our aim is to offload HTB functionality to hardware and
> provide the user with the flexibility and the conventional tools
> offered by TC subsystem, while scaling to thousands of traffic classes
> and maintaining wire-speed performance.
>
> Mellanox hardware can support hierarchical rate-limiting;
> rate-limiting is done per hardware queue. In our proposed solution,
> flow classification takes place in software. By moving the
> classification to clsact egress hook, which is thread-safe and does
> not require locking, we avoid the contention induced by the single
> qdisc lock. Furthermore, clsact filters are perform before the
> net-device’s TX queue is selected, allowing the driver a chance to
> translate the class to the appropriate hardware queue. Please note
> that the user will need to configure the filters slightly different;
> apply them to the clsact rather than to the HTB itself, and set the
> priority to the desired class-id.
>
> For example, the following two filters are equivalent:
> 1. tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10
> 2. tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10
>
> Note: to support the above filter no code changes to the upstream kernel nor to iproute2 package is required.
>
> Furthermore, the most concerning aspect of the current HTB
> implementation is its lack of support for multi-queue. All
> net-device’s TX queues points to the same HTB instance, resulting in
> high spin-lock contention. This contention (might) negates the overall
> performance gains expected by introducing the offload in the first
> place. We should modify HTB to present itself as mq qdisc does. By
> default, mq qdisc allocates a simple fifo qdisc per TX queue exposed
> by the lower layer device. This is only when hardware offload is
> configured, otherwise, HTB behaves as usual. There is no HTB code
> along the data-path; the only overhead compared to regular traffic is
> the classification taking place at clsact. Please note that this
> design induces full offload---no fallback to software; it is not
> trivial to partial offload the hierarchical tree considering borrowing
> between siblings anyway.
>
>
> To summaries: for each HTB leaf-class the driver will allocate a
> special queue and match it with a corresponding net-device TX queue
> (increase real_num_tx_queues). A unique fifo qdisc will be attached to
> any such TX queue. Classification will still take place in software,
> but rather at the clsact egress hook. This way we can scale to
> thousands of classes while maintaining wire-speed performance and
> reducing CPU overhead.
>
> Any feedback will be much appreciated.
Other than echoing Dave's concern around baking FIFO semantics into
hardware, maybe also consider whether implementing the required
functionality using EDT-based semantics instead might be better? I.e.,
something like this:
https://netdevconf.info/0x14/session.html?talk-replacing-HTB-with-EDT-and-BPF
-Toke
Powered by blists - more mailing lists