lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <35A4DDAA-7E8D-43CB-A1F5-D1E46A4ED42E@gmail.com>
Date:   Tue, 23 Feb 2021 11:09:19 +0100
From:   Gil Pedersen <kanongil@...il.com>
To:     davem@...emloft.net, yoshfuji@...ux-ipv6.org, dsahern@...nel.org
Cc:     netdev@...r.kernel.org
Subject: TCP stall issue

Hi,

I am investigating a TCP stall that can occur when sending to an Android device (kernel 4.9.148) from an Ubuntu server running kernel 5.11.0.

The issue seems to be that RACK is not applied when a D-SACK (with SACK) is received on the server after an RTO re-transmission (CA_Loss state). Here the re-transmitted segment is considered to be already delivered and loss undo logic is applied. Then nothing is re-transmitted until the next RTO, where the next segment is sent and the same thing happens again. The causes the retransmitted segments to be delivered at a rate of ~1 per second, so a burst loss of eg. 20 segments cause a 20+ second stall. I would expect RACK to kick in long before this happens.

Note the D-SACK should not be considered spurious, as the TSecr value matches the re-transmission TSval.

Also, the Android receiver is definitely sending strange D-SACKs that does not properly advance the ACK number to include received segments. However, I can't control it and need to fix it on the server by quickly re-transmitting the segments. The connection itself is functional. If the client makes a request to the server in this state, it can respond and the client will receive any segments sent in reply.

I can see from counters that TcpExtTCPLossUndo & TcpExtTCPSackFailures are incremented on the server when this happens.
The issue appears both with F-RTO enabled and disabled. Also appears both with BBR and RENO.

Any idea of why this happens, or suggestions on how to debug the issue further?

/Gil

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ