netdev - Re: Per-CPU Queueing for QoS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Mon, 13 Nov 2017 15:08:36 -0800
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     Michael Ma <make0818@...il.com>,
        Stephen Hemminger <stephen@...workplumber.org>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        jianjun.duan@...baba-inc.com, xiangning.yu@...baba-inc.com
Subject: Re: Per-CPU Queueing for QoS

On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@...il.com> wrote:
> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@...workplumber.org>:
> >> On Sun, 12 Nov 2017 13:43:13 -0800
> >> Michael Ma <make0818@...il.com> wrote:
> >>
> >>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
> >>>
> >>> Thanks,
> >>> Michael
> >>>
> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@...il.com> wrote:
> >>> >
> >>> > Currently txq/qdisc selection is based on flow hash so packets from
> >>> > the same flow will follow the order when they enter qdisc/txq, which
> >>> > avoids out-of-order problem.
> >>> >
> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
> >>> > per-cpu queues for a single TC class and do busy polling from a
> >>> > per-class thread to drain these queues. If we can do this frequently
> >>> > enough the out-of-order situation in this polling thread should not be
> >>> > that bad.
> >>> >
> >>> > To give more details - in the send path we introduce per-cpu per-class
> >>> > queues so that packets from the same class and same core will be
> >>> > enqueued to the same place. Then a per-class thread poll the queues
> >>> > belonging to its class from all the cpus and aggregate them into
> >>> > another per-class queue. This can effectively reduce contention but
> >>> > inevitably introduces potential out-of-order issue.
> >>> >
> >>> > Any concern/suggestion for working towards this direction?
> >>
> >> In general, there is no meta design discussions in Linux development
> >> Several developers have tried to do lockless
> >> qdisc and similar things in the past.
> >>
> >> The devil is in the details, show us the code.
> >
> > Thanks for the response, Stephen. The code is fairly straightforward,
> > we have a per-cpu per-class queue defined as this:
> >
> > struct bandwidth_group
> > {
> >     struct skb_list queues[MAX_CPU_COUNT];
> >     struct skb_list drain;
> > }
> >
> > "drain" queue is used to aggregate per-cpu queues belonging to the
> > same class. In the enqueue function, we determine the cpu where the
> > packet is processed and enqueue it to the corresponding per-cpu queue:
> >
> > int cpu;
> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
> >
> > cpu = get_cpu();
> > skb_list_append(&bwg->queues[cpu], skb);
> >
> > Here we don't check the flow of the packet so if there is task
> > migration or multiple threads sending packets through the same flow we
> > theoretically can have packets enqueued to different queues and
> > aggregated to the "drain" queue out of order.
> >
> > Also AFAIK there is no lockless htb-like qdisc implementation
> > currently, however if there is already similar effort ongoing please
> > let me know.
> 
> The question I would have is how would this differ from using XPS w/
> mqprio? Would this be a classful qdisc like HTB or a classless one
> like mqprio?
> 
> From what I can tell XPS would be able to get you your per-cpu
> functionality, the benefit of it though would be that it would avoid
> out-of-order issues for sockets originating on the local system. The
> only thing I see as an issue right now is that the rate limiting with
> mqprio is assumed to be handled via hardware due to mechanisms such as
> DCB.

I think one of the key point was in : " do busy polling from a per-class
thread to drain these queues." 

I mentioned this idea in TX path of :

https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf