[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK6E8=e_cmLHzn=RJ=Bj3aPXj0Vv9OXY=Fe5Y=pRiDALoh9ptQ@mail.gmail.com>
Date: Fri, 2 Mar 2012 08:28:48 -0800
From: Yuchung Cheng <ycheng@...gle.com>
To: Neal Cardwell <ncardwell@...gle.com>
Cc: David Miller <davem@...emloft.net>, netdev@...r.kernel.org,
ilpo.jarvinen@...sinki.fi, Nandita Dukkipati <nanditad@...gle.com>,
Tom Herbert <therbert@...gle.com>
Subject: Re: [PATCH] tcp: fix tcp_retransmit_skb() to maintain MSS invariant
On Fri, Mar 2, 2012 at 6:27 AM, Neal Cardwell <ncardwell@...gle.com> wrote:
> This commit fixes tcp_retransmit_skb() to respect the invariant that
> an skb in the write queue that might be SACKed (that is, that precedes
> tcp_send_head()) is either less than tcp_skb_mss(skb) or an integral
> multiple of tcp_skb_mss(skb).
>
> Various parts of the TCP code maintain or assume this invariant,
> including at least tcp_write_xmit(), tcp_mss_split_point(),
> tcp_match_skb_to_sack(), and tcp_shifted_skb().
>
> tcp_retransmit_skb() did not maintain this invariant. It checked the
> current MSS and called tcp_fragment() to make sure that the skb we're
> retransmitting is at most cur_mss, but in the process it took the
> excess bytes and created an arbitrary-length skb (one that is not
> necessarily an integral multiple of its MSS) and inserted it in the
> write queue after the skb we're retransmitting.
>
> One potential indirect effect of this problem is tcp_shifted_skb()
> creating a coalesced SACKed skb that has a pcount that is 1 too large
> for its length. This happened because tcp_shifted_skb() assumed that
> skbs are integral multiples of MSS, so you can just add pcounts of
> input skbs to find the pcount of the output skb. Suspected specific
> symtoms of this problem include the WARN_ON(len > skb->len) in
> tcp_fragment() firing, as the 1-too-large pcount ripples though to
> tcp_mark_head_lost() trying to chop off too many bytes to mark as
> lost.
>
> It's also possible this bug is related to recent reports of sacked_out
> becoming negative.
>
> Signed-off-by: Neal Cardwell <ncardwell@...gle.com>
Acked-by: Yuchung Cheng <ycheng@...gle.com>
I especially like the comment about the invariant, which is less
explicit in other parts of GSO code.
> ---
> net/ipv4/tcp_output.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
> 1 files changed, 43 insertions(+), 1 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 4ff3b6d..13034ad 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2070,6 +2070,48 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
> }
> }
>
> +/* So we can retransmit skb, fragment it to be cur_mss bytes. In
> + * addition, we must maintain the invariant that whatever skbs we
> + * leave in the write queue are integral multiples of the MSS or a
> + * remaining small sub-MSS portion. This means we fragment the skb
> + * into potentially three skbs in the write queue:
> + *
> + * (1) The first skb of exactly 1*cur_mss, which we will retransmit now.
> + * (2) A "bulk" skb that is an integral multiple of the cur_mss
> + * (3) A "left-over" skb that has any remaining portion smaller than cur_mss
> + *
> + * Since either of the two required fragmentation operations can fail
> + * (e.g. due to ENOMEM), and we want this invariant to be maintained
> + * if either fails, we chop off (3) first and then chop off (1).
> + *
> + * Returns non-zero if an error occurred which prevented the full splitting.
> + */
> +static int tcp_retrans_mss_split(struct sock *sk, struct sk_buff *skb,
> + unsigned int cur_mss)
> +{
> + int err;
> + unsigned int len;
> +
> + /* Chop off any "left-over" at end that is not aligned to cur_mss. */
> + if (cur_mss != tcp_skb_mss(skb)) {
> + len = skb->len - skb->len % cur_mss;
> + if (len < skb->len) {
> + err = tcp_fragment(sk, skb, len, cur_mss);
> + if (err < 0)
> + return err;
> + }
> + }
> +
> + /* Chop off a single MSS at the beginning to retransmit now. */
> + if (skb->len > cur_mss) {
> + err = tcp_fragment(sk, skb, cur_mss, cur_mss);
> + if (err < 0)
> + return err;
> + }
> +
> + return 0;
> +}
> +
> /* This retransmits one SKB. Policy decisions and retransmit queue
> * state updates are done by the caller. Returns non-zero if an
> * error occurred which prevented the send.
> @@ -2115,7 +2157,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
> return -EAGAIN;
>
> if (skb->len > cur_mss) {
> - if (tcp_fragment(sk, skb, cur_mss, cur_mss))
> + if (tcp_retrans_mss_split(sk, skb, cur_mss))
> return -ENOMEM; /* We'll try again later. */
> } else {
> int oldpcount = tcp_skb_pcount(skb);
> --
> 1.7.7.3
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists