netdev - Re: Throughput regression with `tcp: refine TSO autosizing`

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAA93jw45o-Tnt4hEYgc-wHJ26J11dDZLptO9ZONy0Jxp350S3A@mail.gmail.com>
Date:	Fri, 6 Feb 2015 08:50:03 +1300
From:	Dave Taht <dave.taht@...il.com>
To:	Michal Kazior <michal.kazior@...to.com>
Cc:	Eric Dumazet <eric.dumazet@...il.com>,
	Neal Cardwell <ncardwell@...gle.com>,
	Eyal Perry <eyalpe@....mellanox.co.il>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	linux-wireless <linux-wireless@...r.kernel.org>
Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On Fri, Feb 6, 2015 at 2:44 AM, Michal Kazior <michal.kazior@...to.com> wrote:
>
> On 5 February 2015 at 14:19, Eric Dumazet <eric.dumazet@...il.com> wrote:
> > On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:
> >
> >> The intention is to control the queues to the following :
> >>
> >> 1 ms of buffering, but limited to a configurable value.
> >>
> >> On a 40Gbps flow, 1ms represents 5 MB, which is insane.
> >>
> >> We do not want to queue 5 MB of traffic, this would destroy latencies
> >> for all concurrent flows. (Or would require having fq_codel or fq as
> >> packet schedulers, instead of default pfifo_fast)
> >>
> >> This is why having 1.5 ms delay between the transmit and TX completion
> >> is a problem in your case.
>
> I do get your point. But 1.5ms is really tough on Wi-Fi.
>
> Just look at this:
>
> ; ping 192.168.1.2 -c 3
> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms
>
> ; ping 192.168.1.2 -c 3 -Q 224
> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.939 ms
> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.906 ms
> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.946 ms
>
> This was run with no load so batching code in the driver itself should
> have no measurable effect. The channel was near-ideal: low noise
> floor, cabled rf, no other traffic.
>
> The lower latency ping is when 802.11 QoS Voice Access Category is
> used. I also get 400mbps instead of 250mbps in this case with 5 flows
> (net/master).
>

The VO queue is now nearly useless in a real world environment. Whlle
it does grab the media mildly faster in some cases, on a good day with
no other competing APs, it cannot aggregate packets, and wastes TXOPS.
It is far saner to aim for better aggregate (use the VI queue if you
must try to get better media acquisition).

It is disabled in multiple products I know of.

And I really, really, really wish, that just once during this thread,
someone had bothered to try running a test
at a real world MCS rate - say MCS1, or MCS4, and measured the latency
under load of that...

or tried talking to two or more stations at the same time.

Instead of trying for 1.5Gbits in a faraday cage.


>
> Dealing with black box firmware blobs is a pain.
>

+10

>
>
> > Note that TCP stack could detect when this happens, *if* ACK where
> > delivered before the TX completions, or when TX completion happens,
> > we could detect that the clone of the freed packet was freed.
> >
> > In my test, when I did "ethtool -C eth0 tx-usecs 1024 tx-frames 64", and
> > disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
> > blocks on tcp_limit_output_bytes.
> >
> > Then we receive 2 stretch ACKS after ~50 usec.
> >
> > TCP stack tries to push again some packets but blocks on
> > tcp_limit_output_bytes again.
> >
> > 1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
> > push following ~60 packets.
> >
> >
> > TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
> > using a per flow dynamic value, but I would rather not add a kludge in
> > TCP stack only to deal with a possible bug in ath10k driver.
> >
> > niu has a similar issue and simply had to call skb_orphan() :
> >
> > drivers/net/ethernet/sun/niu.c:6669:            skb_orphan(skb);
>
> Ok. I tried calling skb_orphan() right after I submit each Tx frame
> (similar to niu which does this in start_xmit):
>
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
> sk_buff *msdu)
>         if (res)
>                 goto err_unmap_msdu;
>
> +       skb_orphan(msdu);
> +
>         return 0;
>
>  err_unmap_msdu:
>
>
> Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
> single flow (even better then before). Wow.
>
> Does this look ok/safe as a solution to you?
>
>
> Michał
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
Dave Täht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html