[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQy=ayiUVjXPscUkONDOGhfQ7xDyqn=KFzrndjw2mrPufrw@mail.gmail.com>
Date: Fri, 13 Dec 2013 13:17:32 -0500
From: Neal Cardwell <ncardwell@...gle.com>
To: Eric Dumazet <eric.dumazet@...il.com>
Cc: David Miller <davem@...emloft.net>,
netdev <netdev@...r.kernel.org>,
Yuchung Cheng <ycheng@...gle.com>,
Nandita Dukkipati <nanditad@...gle.com>,
Van Jacobson <vanj@...gle.com>
Subject: Re: [PATCH v2 net-next] tcp: remove a bogus TSO split
On Fri, Dec 13, 2013 at 1:13 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> From: Eric Dumazet <edumazet@...gle.com>
>
> While investigating performance problems on small RPC workloads,
> I noticed linux TCP stack was always splitting the last TSO skb
> into two parts (skbs). One being a multiple of MSS, and a small one
> with the Push flag. This split is done even if TCP_NODELAY is set.
>
...
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> Cc: Yuchung Cheng <ycheng@...gle.com>
> Cc: Neal Cardwell <ncardwell@...gle.com>
> Cc: Nandita Dukkipati <nanditad@...gle.com>
> Cc: Van Jacobson <vanj@...gle.com>
> ---
> v2: changed tcp_minshall_update() as Neal pointed out.
It occurred to me after staring at this code a little longer that
although the Nagle test is already done before the call to
tcp_mss_split_point(), the tcp_nagle_check() is only run if
tso_segs==1 and is only checking whether skb->len < mss_now, so the
Nagle code currently is implicitly assuming that if there is an skb
that is not an exact multiple of MSS, then the tcp_mss_split_point()
will chop off the odd bytes at the end and we'll loop back to the top
of the tcp_write_xmit() loop to make a Nagle decision on an skb that
has tso_segs==1.
So if we just make this improvement to tcp_mss_split_point() and the
adjustment to tcp_minshall_update() (as in patch v2), we will still
have broken Nagle behavior in a scenario like:
- suppose MSS=1460 and Nagle is enabled (TCP_NODELAY is not set)
- suppose there is an outstanding un-ACKed small packet, so that
tcp_minshall_check() returns true
- user writes 2000 bytes
- tso_segs is 2, so we do not call tcp_nagle_test()
- new version of tcp_mss_split_point() sends out the full 2000 bytes,
leading to having 2 packets smaller than an MSS un-ACKed in the network,
violating the invariant the minshall/nagle code is trying to maintain
or having only one such packet un-ACKed in the network.
Previously, tcp_mss_split_point() would have split off the 2000-1460
bytes into a new skb, we would have looped back and executed the
tcp_nagle_test() code and decided not to send that small sub-MSS
2000-1460 packet.
neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists