[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4AC357D3.7080606@gmail.com>
Date: Wed, 30 Sep 2009 15:06:27 +0200
From: Eric Dumazet <eric.dumazet@...il.com>
To: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
CC: Gilad Ben-Yossef <gilad@...efidence.com>,
Netdev <netdev@...r.kernel.org>, Ori Finkalman <ori@...sleep.com>
Subject: Re: [PATCH] [RFC] IPv4 TCP fails to send window scale option when
window scale is zero
Ilpo Järvinen a écrit :
> On Wed, 30 Sep 2009, Eric Dumazet wrote:
>
>> Gilad Ben-Yossef a écrit :
>>> Eric Dumazet wrote:
>>>
>>>> Gilad Ben-Yossef a écrit :
>>>>
>>>>> From: Ori Finkalman <ori@...sleep.com>
>>>>>
>>>>>
>>>>> Acknowledge TCP window scale support by inserting the proper option in
>>>>> SYN/ACK header
>>>>> even if our window scale is zero.
>>>>>
>>>>>
>>>>> This fixes the following observed behavior:
>>>>>
>>>>>
>>>>> 1. Client sends a SYN with TCP window scaling option and non zero window
>>>>> scale value to a Linux box.
>>>>>
>>>>> 2. Linux box notes large receive window from client.
>>>>>
>>>>> 3. Linux decides on a zero value of window scale for its part.
>>>>>
>>>>> 4. Due to compare against requested window scale size option, Linux does
>>>>> not to send windows scale
>>>>>
>>>>> TCP option header on SYN/ACK at all.
>>>>>
>>>>>
>>>>> Result:
>>>>>
>>>>>
>>>>> Client box thinks TCP window scaling is not supported, since SYN/ACK had
>>>>> no TCP window scale option,
>>>>> while Linux thinks that TCP window scaling is supported (and scale might
>>>>> be non zero), since SYN had
>>>>>
>>>>> TCP window scale option and we have a mismatched idea between the client
>>>>> and server regarding window sizes.
>>>>>
>>>>>
>>>>> Please comment and/or apply.
>>>>> ...
>>>>>
>>>>>
>>>>> Signed-off-by: Gilad Ben-Yossef <gilad@...efidence.com>
>>>>> Signed-off-by: Ori Finkelman <ori@...sleep.com>
>>>>>
>>>>>
>>>>> Index: net/ipv4/tcp_output.c
>>>>> ===================================================================
>>>>> --- net/ipv4/tcp_output.c (revision 46)
>>>>> +++ net/ipv4/tcp_output.c (revision 210)
>>>>> @@ -353,6 +353,7 @@ static void tcp_init_nondata_skb(struct
>>>>> #define OPTION_SACK_ADVERTISE (1 << 0)
>>>>> #define OPTION_TS (1 << 1)
>>>>> #define OPTION_MD5 (1 << 2)
>>>>> +#define OPTION_WSCALE (1 << 3)
>>>>>
>>>>> struct tcp_out_options {
>>>>> u8 options; /* bit field of OPTION_* */
>>>>> @@ -417,7 +418,7 @@ static void tcp_options_write(__be32 *pt
>>>>> TCPOLEN_SACK_PERM);
>>>>> }
>>>>>
>>>>> - if (unlikely(opts->ws)) {
>>>>> + if (unlikely(OPTION_WSCALE & opts->options)) {
>>>>> *ptr++ = htonl((TCPOPT_NOP << 24) |
>>>>> (TCPOPT_WINDOW << 16) |
>>>>> (TCPOLEN_WINDOW << 8) |
>>>>> @@ -530,8 +531,8 @@ static unsigned tcp_synack_options(struc
>>>>>
>>>>> if (likely(ireq->wscale_ok)) {
>>>>> opts->ws = ireq->rcv_wscale;
>>>>> - if(likely(opts->ws))
>>>>> - size += TCPOLEN_WSCALE_ALIGNED;
>>>>> + opts->options |= OPTION_WSCALE;
>>>>> + size += TCPOLEN_WSCALE_ALIGNED;
>>>>> }
>>>>> if (likely(doing_ts)) {
>>>>> opts->options |= OPTION_TS;
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Seems not the more logical places to put this logic...
>>>>
>>>> How about this instead ?
>>>>
>>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>>> index 5200aab..b78c084 100644
>>>> --- a/net/ipv4/tcp_output.c
>>>> +++ b/net/ipv4/tcp_output.c
>>>> @@ -216,6 +216,11 @@ void tcp_select_initial_window(int __space, __u32
>>>> mss,
>>>> space >>= 1;
>>>> (*rcv_wscale)++;
>>>> }
>>>> + /*
>>>> + * Set a minimum wscale of 1
>>>> + */
>>>> + if (*rcv_wscale == 0)
>>>> + *rcv_wscale = 1;
>>>> }
>>>>
>>>> /* Set initial window to value enough for senders,
>>>>
>>>>
>>> Thank you for the patch review. The suggested replacement patch
>>> certainly is shorter, code wise, which is an advantage.
>>>
>>> I cant help but feel though, that it is less readable - a window scale
>>> of zero is a perfectly legit value. Adding special logic to rule it out
>>> just because we chose to overload this setting for something else
>>> (whether window scaling is supported or not) seems like an invitation
>>> for someone to get it wrong again down the line, in my opinion.
>> As a matter of fact I didnot test your patch.
>>
>> My reaction was driven by :
>>
>> Your version slows down the tcp_options_write() function, once per tx packet.
>
> Are you serious that anding would cost that much? :-/
Not really :)
>
>> tcp_options_write() should not change socket state,
>
> I fail to see how his patch was changing socket state in anyway in
> anywhere?
Me too, now you say it :)
>
>> while
>> tcp_select_initial_window() is the exact place where we are supposed to
>> compute wscale.
>
> And it calculated yielding to result of 0, which is perfectly valid. The
> problem is that tcp_write_options thinks that 0 is indication of no window
> scaling, instead of the correct interpretation of zero window scaling
> which makes the huge difference for the opposite direction traffic as
> these guys have noted. Not that I find your approach that bad either as
> we only lose 1-bit accuracy for the window which is rather insignificant
> as 1-byte window increments do not really make that much sense anyway
> (and we have to specifically code against doing them anyway so the
> effective granularity is much higher).
Yes, wscale 0 is RFC valid, but are we sure some equipment wont play funny games
with such value ? At least sending "wscale 1-14" must be working...
My quick&dirty patch was only for discussion, I have no strong opinion on it,
only that was on one place to patch instead of two/three/four I dont know yet.
So please Gilad & Ori send us a new patch :)
>
>> Also how is managed tcp_syn_options() case (for outgoing connections ?)
>>
>> if (likely(sysctl_tcp_window_scaling)) {
>> opts->ws = tp->rx_opt.rcv_wscale;
>> if (likely(opts->ws))
>> size += TCPOLEN_WSCALE_ALIGNED;
>> }
>>
>> Dont you need to patch it as well ?
>
> One certainly should change that too if that patch is the way to go
> forward.
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists