netdev - Re: debugging TCP stalls on high-speed wifi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMrEMU-WdaAe2wOxsnMn=npPyAjf1KkuxA8cHE==yez_rUELUQ@mail.gmail.com>
Date:   Thu, 12 Dec 2019 20:15:14 -0800
From:   Justin Capella <justincapella@...il.com>
To:     Johannes Berg <johannes@...solutions.net>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        Toke Høiland-Jørgensen <toke@...hat.com>,
        linux-wireless@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: debugging TCP stalls on high-speed wifi

Could TCP window size (waiting for ACKnowledgements) be a contributing factor?

On Thu, Dec 12, 2019 at 6:52 AM Johannes Berg <johannes@...solutions.net> wrote:
>
> Hi Eric, all,
>
> I've been debugging (much thanks to bpftrace) TCP stalls on wifi, in
> particular on iwlwifi.
>
> What happens, essentially, is that we transmit large aggregates (63
> packets of 7.5k A-MSDU size each, for something on the order of 500kB
> per PPDU). Theoretically we can have ~240 A-MSDUs on our hardware
> queues, and the hardware aggregates them into up to 63 to send as a
> single PPDU.
>
> At HE rates (160 MHz, high rates) such a large PPDU takes less than 2ms
> to transmit.
>
> I'm seeing around 1400 Mbps TCP throughput (a bit more than 1800 UDP),
> but I'm expecting more. A bit more than 1800 for UDP is about the max I
> can expect on this AP (it only does 8k A-MSDU size), but I'd think TCP
> then shouldn't be so much less (and our Windows drivers gets >1600).
>
>
> What I see is that occasionally - and this doesn't happen all the time
> but probably enough to matter - we reclaim a few of those large
> aggregates and free the transmit SKBs, and then we try to pull from
> mac80211's TXQs but they're empty.
>
> At this point - we've just freed 400+k of data, I'd expect TCP to
> immediately push more, but it doesn't happen. I sometimes see another
> set of reclaims emptying the queue entirely (literally down to 0 packets
> on the queue) and it then takes another millisecond or two for TCP to
> start pushing packets again.
>
> Once that happens, I also observe that TCP stops pushing large TSO
> packets and goes down to sometimes less than a single A-MSDU (i.e.
> ~7.5k) in a TSO, perhaps even an MTU-sized frame - didn't check this,
> only the # of frames we make out of this.
>
>
> If you have any thoughts on this, I'd appreciate it.
>
>
> Something I've been wondering is if our TSO implementation causes
> issues, but apart from higher CPU usage I see no real difference if I
> turned it off. I thought so because it splits up the SKBs into those A-
> MSDU sized chunks using skb_gso_segment() and then splits them down into
> MTU-sized all packed together into an A-MSDU using the hardware engine.
> But that means we release a bunch of A-MSDU-sized SKBs back to the TCP
> stack when they transmitted.
>
> Another thought I had was our broken NAPI, but this is TX traffic so the
> only RX thing is sync, and I'm currently still using kernel 5.4 anyway.
>
> Thanks,
> johannes
>