netdev - Re: Per-CPU Queueing for QoS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 13 Nov 2017 18:18:03 -0800
From:   Michael Ma <make0818@...il.com>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        Stephen Hemminger <stephen@...workplumber.org>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        jianjun.duan@...baba-inc.com, xiangning.yu@...baba-inc.com
Subject: Re: Per-CPU Queueing for QoS

2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.duyck@...il.com>:
> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@...il.com> wrote:
>>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@...workplumber.org>:
>>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>>> >> Michael Ma <make0818@...il.com> wrote:
>>> >>
>>> >>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>>> >>>
>>> >>> Thanks,
>>> >>> Michael
>>> >>>
>>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@...il.com> wrote:
>>> >>> >
>>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>>> >>> > avoids out-of-order problem.
>>> >>> >
>>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>> >>> > per-cpu queues for a single TC class and do busy polling from a
>>> >>> > per-class thread to drain these queues. If we can do this frequently
>>> >>> > enough the out-of-order situation in this polling thread should not be
>>> >>> > that bad.
>>> >>> >
>>> >>> > To give more details - in the send path we introduce per-cpu per-class
>>> >>> > queues so that packets from the same class and same core will be
>>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>>> >>> > belonging to its class from all the cpus and aggregate them into
>>> >>> > another per-class queue. This can effectively reduce contention but
>>> >>> > inevitably introduces potential out-of-order issue.
>>> >>> >
>>> >>> > Any concern/suggestion for working towards this direction?
>>> >>
>>> >> In general, there is no meta design discussions in Linux development
>>> >> Several developers have tried to do lockless
>>> >> qdisc and similar things in the past.
>>> >>
>>> >> The devil is in the details, show us the code.
>>> >
>>> > Thanks for the response, Stephen. The code is fairly straightforward,
>>> > we have a per-cpu per-class queue defined as this:
>>> >
>>> > struct bandwidth_group
>>> > {
>>> >     struct skb_list queues[MAX_CPU_COUNT];
>>> >     struct skb_list drain;
>>> > }
>>> >
>>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>>> > same class. In the enqueue function, we determine the cpu where the
>>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>>> >
>>> > int cpu;
>>> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>> >
>>> > cpu = get_cpu();
>>> > skb_list_append(&bwg->queues[cpu], skb);
>>> >
>>> > Here we don't check the flow of the packet so if there is task
>>> > migration or multiple threads sending packets through the same flow we
>>> > theoretically can have packets enqueued to different queues and
>>> > aggregated to the "drain" queue out of order.
>>> >
>>> > Also AFAIK there is no lockless htb-like qdisc implementation
>>> > currently, however if there is already similar effort ongoing please
>>> > let me know.
>>>
>>> The question I would have is how would this differ from using XPS w/
>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>> like mqprio?
>>>
>>> From what I can tell XPS would be able to get you your per-cpu
>>> functionality, the benefit of it though would be that it would avoid
>>> out-of-order issues for sockets originating on the local system. The
>>> only thing I see as an issue right now is that the rate limiting with
>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>> DCB.
>>
>> I think one of the key point was in : " do busy polling from a per-class
>> thread to drain these queues."
>>
>> I mentioned this idea in TX path of :
>>
>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>
> I think this is a bit different from that idea in that the busy
> polling is transferring packets from a per-cpu qdisc to a per traffic
> class queueing discipline. Basically it would be a busy_poll version
> of qdisc_run that would be transferring packets from one qdisc onto
> another instead of attempting to transmit them directly.

The idea is to have the whole part implemented as one classful qdisc
and remove the qdisc root lock since all the synchronization will be
handled internally (let's put aside that other filters/actions/qdiscs
might still require a root lock for now).
>
> What I think is tripping me up is that I don't think this is even
> meant to work with a multiqueue device. The description seems more
> like a mqprio implementation feeding into a prio qdisc which is then
> used for dequeue. To me it seems like this solution would be pulling
> complexity off of the enqueue side and just adding it to the dequeue
> side. The use of the "busy poll" is just to try to simplify things as
> the agent would then be consolidating traffic from multiple per-cpu
> queues onto one drain queue.

We're essentially trying to spread the complexity from enqueue to
different stages such as enqueue/aggregation and rate
limiting/dequeue. Each stage will have different parallelisms. It
should work with multi-queue device since txq selection can be the
same as today. However our concern is that between enqueue and
aggregation we have a small window which can allow packet oob, which
is a sacrifice to better concurrency.

>
> Structure wise this ends up looking not too different from mqprio, the
> main difference though would be that this would be a classful qdisc
> and that the virtual qdiscs we have for the traffic classes would need
> to be replaced with actual qdiscs for handling the "drain" aspect of
> this.

Structure wise it's similar to mqprio + rate limiting qdisc without
root lock, and replacing txq/flow level parallelism by cpu level
parallelism. I'm actually not sure about the similarity with busy
polling that Eric mentioned since I haven't read the slides yet.
>
> - Alex