netdev - Re: [WIP] net+mlx4: auto doorbell

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1480559566.18162.253.camel@edumazet-glaptop3.roam.corp.google.com>
Date:   Wed, 30 Nov 2016 18:32:46 -0800
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Tom Herbert <tom@...bertland.com>
Cc:     Jesper Dangaard Brouer <brouer@...hat.com>,
        Willem de Bruijn <willemb@...gle.com>,
        Rick Jones <rick.jones2@....com>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Tariq Toukan <tariqt@...lanox.com>,
        Achiad Shochat <achiad@...lanox.com>
Subject: Re: [WIP] net+mlx4: auto doorbell

On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote:
> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> >
> > Another issue I found during my tests last days, is a problem with BQL,
> > and more generally when driver stops/starts the queue.
> >
> > When under stress and BQL stops the queue, driver TX completion does a
> > lot of work, and servicing CPU also takes over further qdisc_run().
> >
> Hmm, hard to say if this is problem or a feature I think ;-). One way
> to "solve" this would be to use IRQ round robin, that would spread the
> load of networking across CPUs, but that gives us no additional
> parallelism and reduces locality-- it's generally considered a bad
> idea. The question might be: is it better to continuously ding one CPU
> with all the networking work or try to artificially spread it out?
> Note this is orthogonal to MQ also, so we should already have multiple
> CPUs doing netif_schedule_queue for queues they manage.
> 
> Do you have a test or application that shows this is causing pain?

Yes, just launch enough TCP senders (more than 10,000) to fully utilize
the NIC, with small messages.

super_netperf is not good for that, because you would need 10,000
processes and would spend too much cycles just dealing with an enormous
working set, you would not activate BQL.


> 
> > The work-flow is :
> >
> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
> > unmap things, queue skbs for freeing.
> >
> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
> >
> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
> >      netif_schedule_queue(dev_queue);
> >
> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
> > (They absolutely have no chance to grab it)
> >
> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
> > and RX work.
> >
> > This problem is magnified when XPS is used, if one mono-threaded application deals with
> > thousands of TCP sockets.
> >
> Do you know of an application doing this? The typical way XPS and most
> of the other steering mechanisms are configured assume that workloads
> tend towards a normal distribution. Such mechanisms can become
> problematic under asymmetric loads, but then I would expect these are
> likely dedicated servers so that the mechanisms can be tuned
> accordingly. For instance, XPS can be configured to map more than one
> queue to a CPU. Alternatively, IIRC Windows has some functionality to
> tune networking for the load (spin up queues, reconfigure things
> similar to XPS/RPS, etc.)-- that's promising up the point that we need
> a lot of heuristics and measurement; but then we lose determinism and
> risk edge case where we get completely unsatisfied results (one of my
> concerns with the recent attempt to put configuration in the kernel).

We have thousands of applications, and some of them 'kind of multicast'
events to a broad number of TCP sockets.

Very often, applications writers use a single thread for doing this,
when all they need is to send small packets to 10,000 sockets, and they
do not really care of doing this very fast. They also do not want to
hurt other applications sharing the NIC.

Very often, process scheduler will also run this single thread in a
single cpu, ie avoiding expensive migrations if they are not needed.

Problem is this behavior locks one TX queue for the duration of the
multicast, since XPS will force all the TX packets to go to one TX
queue.

Other flows that would share the locked CPU experience high P99
latencies.


> 
> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
> > to allow another cpu to service the qdisc and spare us.
> >
> Wouldn't this need to be an active operation? That is to queue the
> qdisc on another CPU's output_queue?

I simply suggest we try to queue the qdisc for further servicing as we
do today, from net_tx_action(), but we might use a different bit, so
that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
before we grab it from net_tx_action(), maybe 100 usec later (time to
flush all skbs queued in napi_consume_skb() and maybe RX processing,
since most NIC handle TX completion before doing RX processing from thei
napi poll handler.

Should be doable with few changes in __netif_schedule() and
net_tx_action(), plus some control paths that will need to take care of
the new bit at dismantle time, right ?