netdev - Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALx6S353F0HiqF8JNZwkcAR_dSt1=EauDj1U57MqOtpi4_SbdA@mail.gmail.com>
Date:   Thu, 8 Sep 2016 11:16:34 -0700
From:   Tom Herbert <tom@...bertland.com>
To:     Jesper Dangaard Brouer <brouer@...hat.com>
Cc:     John Fastabend <john.fastabend@...il.com>,
        Saeed Mahameed <saeedm@....mellanox.co.il>,
        Eric Dumazet <eric.dumazet@...il.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        iovisor-dev <iovisor-dev@...ts.iovisor.org>,
        Linux Netdev List <netdev@...r.kernel.org>,
        Tariq Toukan <tariqt@...lanox.com>,
        Brenden Blanco <bblanco@...mgrid.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Martin KaFai Lau <kafai@...com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Achiad Shochat <achiad@...lanox.com>
Subject: Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer
<brouer@...hat.com> wrote:
> On Thu, 8 Sep 2016 09:26:03 -0700
> Tom Herbert <tom@...bertland.com> wrote:
>
>> On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
>> <brouer@...hat.com> wrote:
>> >
>> > On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@...bertland.com> wrote:
>> >
>> >> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@...il.com> wrote:
>> >> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
>> >> >>
>> >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@....mellanox.co.il> wrote:
>> >> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
>> >> >>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>> >> >>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
>> >> >>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> >> >> [...]
>> >> >>>>
>> >> >>>> Only if a qdisc is present and pressure is high enough.
>> >> >>>>
>> >> >>>> But in a forwarding setup, we likely receive at a lower rate than the
>> >> >>>> NIC can transmit.
>> >> >>
>> >> >> Yes, I can confirm this happens in my experiments.
>> >> >>
>> >> >>>>
>> >> >>>
>> >> >>> Jesper has a similar Idea to make the qdisc think it is under
>> >> >>> pressure, when the device TX ring is idle most of the time, i think
>> >> >>> his idea can come in handy here. I am not fully involved in the
>> >> >>> details, maybe he can elaborate more.
>> >> >>>
>> >> >>> But if it works, it will be transparent to napi, and xmit more will
>> >> >>> happen by design.
>> >> >>
>> >> >> Yes. I have some ideas around getting more bulking going from the qdisc
>> >> >> layer, by having the drivers provide some feedback to the qdisc layer
>> >> >> indicating xmit_more should be possible.  This will be a topic at the
>> >> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
>> >> >> challenge people to come up with a good solution ;-)
>> >> >>
>> >> >
>> >> > One thing I've noticed but haven't yet actually analyzed much is if
>> >> > I shrink the nic descriptor ring size to only be slightly larger than
>> >> > the qdisc layer bulking size I get more bulking and better perf numbers.
>> >> > At least on microbenchmarks. The reason being the nic pushes back more
>> >> > on the qdisc. So maybe a case for making the ring size in the NIC some
>> >> > factor of the expected number of queues feeding the descriptor ring.
>> >> >
>> >
>> > I've also played with shrink the NIC descriptor ring size, it works,
>> > but it is an ugly hack to get NIC pushes backs, and I foresee it will
>> > hurt normal use-cases. (There are other reasons for shrinking the ring
>> > size like cache usage, but that is unrelated to this).
>> >
>> >
>> >> BQL is not helping with that?
>> >
>> > Exactly. But the BQL _byte_ limit is not what is needed, what we need
>> > to know is the _packets_ currently "in-flight".  Which Tom already have
>> > a patch for :-)  Once we have that the algorithm is simple.
>> >
>> > Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
>> > packets in-flight, the qdisc start it's bulk dequeue building phase,
>> > before calling the driver. The allowed max qdisc bulk size should
>> > likely be related to pkts-in-flight.
>> >
>> Sorry, I'm still missing it. The point of BQL is that we minimize the
>> amount of data (and hence number of packets) that needs to be queued
>> in the device in order to prevent the link from going idle while there
>> are outstanding packets to be sent. The algorithm is based on counting
>> bytes not packets because bytes are roughly an equal cost unit of
>> work. So if we've queued 100K of bytes on the queue we know how long
>> that takes around 80 usecs @10G, but if we count packets then we
>> really don't know much about that. 100 packets enqueued could
>> represent 6400 bytes or 6400K worth of data so time to transmit is
>> anywhere from 5usecs to 5msecs....
>>
>> Shouldn't qdisc bulk size be based on the BQL limit? What is the
>> simple algorithm to apply to in-flight packets?
>
> Maybe the algorithm is not so simple, and we likely also have to take
> BQL bytes into account.
>
> The reason for wanting packets-in-flight is because we are attacking a
> transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
> data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
> 70.74). The 10G wirespeed small packets budget is 67.2ns, this with
> fixed overhead per packet of 70ns we can never reach 10G wirespeed.
>
But you should be able to do this with BQL and it is more accurate.
BQL tells how many bytes need to be sent and that can be used to
create a bulk of packets to send with one doorbell.

> The idea/algo is trying to predict the future.  If we see a given/high
> packet rate, which equals a high transaction cost, then lets try not
> calling the driver, and instead backlog the packet in the qdisc,
> speculatively hoping the current rate continues.  This will in effect
> allow bulking and amortize the 70ns transaction cost over N packets.
>
> Instead of tracking a rate of packets or doorbells per sec, I will let
> BQLs packet-in-flight tell me when the driver sees a rate high enough
> that the drivers (DMA-TX completion) consider several packets are
> in-flight.
> When that happens, I will bet on, I can stop sending packets to the
> device, and instead queue them in the qdisc layer.  If I'm unlucky and
> the flow stops, then I'm hoping that the last packet stuck in the qdisc,
> will be picked by the next napi-schedule, before the device driver runs
> "dry".
>
This is exactly what BQL already does (except the queue limit is on
bytes). Once the byte limit is reached the queue is stopped. At TX
completion time some number of bytes are freed up so that a bulk of
packets can be sent to the queue limit.

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer