lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Wed, 30 Nov 2016 17:16:25 -0800 From: Tom Herbert <tom@...bertland.com> To: Eric Dumazet <eric.dumazet@...il.com> Cc: Jesper Dangaard Brouer <brouer@...hat.com>, Willem de Bruijn <willemb@...gle.com>, Rick Jones <rick.jones2@....com>, Linux Kernel Network Developers <netdev@...r.kernel.org>, Saeed Mahameed <saeedm@...lanox.com>, Tariq Toukan <tariqt@...lanox.com>, Achiad Shochat <achiad@...lanox.com> Subject: Re: [WIP] net+mlx4: auto doorbell On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@...il.com> wrote: > > Another issue I found during my tests last days, is a problem with BQL, > and more generally when driver stops/starts the queue. > > When under stress and BQL stops the queue, driver TX completion does a > lot of work, and servicing CPU also takes over further qdisc_run(). > Hmm, hard to say if this is problem or a feature I think ;-). One way to "solve" this would be to use IRQ round robin, that would spread the load of networking across CPUs, but that gives us no additional parallelism and reduces locality-- it's generally considered a bad idea. The question might be: is it better to continuously ding one CPU with all the networking work or try to artificially spread it out? Note this is orthogonal to MQ also, so we should already have multiple CPUs doing netif_schedule_queue for queues they manage. Do you have a test or application that shows this is causing pain? > The work-flow is : > > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and > unmap things, queue skbs for freeing. > > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes); > > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state)) > netif_schedule_queue(dev_queue); > > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED > (They absolutely have no chance to grab it) > > So we end up with one cpu doing the ndo_start_xmit() and TX completions, > and RX work. > > This problem is magnified when XPS is used, if one mono-threaded application deals with > thousands of TCP sockets. > Do you know of an application doing this? The typical way XPS and most of the other steering mechanisms are configured assume that workloads tend towards a normal distribution. Such mechanisms can become problematic under asymmetric loads, but then I would expect these are likely dedicated servers so that the mechanisms can be tuned accordingly. For instance, XPS can be configured to map more than one queue to a CPU. Alternatively, IIRC Windows has some functionality to tune networking for the load (spin up queues, reconfigure things similar to XPS/RPS, etc.)-- that's promising up the point that we need a lot of heuristics and measurement; but then we lose determinism and risk edge case where we get completely unsatisfied results (one of my concerns with the recent attempt to put configuration in the kernel). > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way > to allow another cpu to service the qdisc and spare us. > Wouldn't this need to be an active operation? That is to queue the qdisc on another CPU's output_queue? Tom > >
Powered by blists - more mailing lists