netdev - Re: nonagle flags for TSQ

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1391788738.10160.53.camel@edumazet-glaptop2.roam.corp.google.com>
Date:	Fri, 07 Feb 2014 07:58:58 -0800
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	John Ogness <john.ogness@...utronix.de>
Cc:	netdev@...r.kernel.org
Subject: Re: nonagle flags for TSQ

On Fri, 2014-02-07 at 07:34 -0800, Eric Dumazet wrote:
> On Fri, 2014-02-07 at 16:08 +0100, John Ogness wrote:
> > Hi,
> > 
> > This email is referring to your Linux patch
> > 46d3ceabd8d98ed0ad10f20c595ca784e34786c5.
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=46d3ceabd8d98ed0ad10f20c595ca784e34786c5
> > 
> > I have a question about the use of tcp_write_xmit() in
> > net/ipv4/tcp_output.c
> > 
> > When tcp_write_xmit() is called, the nonagle flag of the tcp socket is
> > ignored and instead 0 is passed. This causes the Nagle-algorithm to be
> > used even if it should not be, which in some cases causes a large delay.
> > 
> > Was there a reason that 0 was hard-coded?
> > 
> > Although current mainline code has been refactored, 0 is still
> > hard-coded for TSQ cases.
> 
> Hi John
> 
> Do you have any data, like exact kernel version you use, tcpdump or
> things like that ?
> 
> When the TCP writes are throttled, its only up to the point next packet
> is TX completed, and only if you have at least 128KB worth of bytes
> consumed in the QDISC/NIC layers for this socket.
> 
> We had some issues at very high speeds, not related to Nagle at all.
> 
> 98e09386c0ef tcp: tsq: restore minimal amount of queueing
> c9eeec26e32e tcp: TSQ can use a dynamic limit
> d6a4a1041176 tcp: GSO should be TSQ friendly
> d01cb20711e3 tcp: add LAST_ACK as a valid state for TSQ
> 
> I am not aware of TSQ being a problem for Nagle.
> 
> Also take a look at recent TCP autocork patches, as they are more
> related to Nagle
> 
> a181ceb501b3 tcp: autocork should not hold first packet in write queue
> f54b311142a9 tcp: auto corking
> 
> Thanks

I think I mentioned this once, but the "a181ceb501b3" fix
included this bit :

 Also, as TX completion is lockless, it's safer to perform sk_wmem_alloc
 test after setting TSQ_THROTTLED.

So its possible you hit the same race, its only a guess...

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 03d26b85eab8..c99a63c6e91a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1904,7 +1904,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
-			break;
+			/* It is possible TX completion already happened
+			 * before we set TSQ_THROTTLED, so we must
+			 * test again the condition.
+			 */
+			if (atomic_read(&sk->sk_wmem_alloc) > limit)
+				break;
 		}
 
 		limit = mss_now;


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html