lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 6 Jun 2023 09:54:03 -0500
From: Mike Freemon <mfreemon@...udflare.com>
To: Stephen Hemminger <stephen@...workplumber.org>
Cc: netdev@...r.kernel.org, kernel-team@...udflare.com
Subject: Re: [PATCH] Add a sysctl to allow TCP window shrinking in order to
 honor memory limits


On 6/5/23 17:42, Stephen Hemminger wrote:
> On Mon,  5 Jun 2023 15:38:57 -0500
> Mike Freemon <mfreemon@...udflare.com> wrote:
> 
>> From: "mfreemon@...udflare.com" <mfreemon@...udflare.com>
>>
>> Under certain circumstances, the tcp receive buffer memory limit
>> set by autotuning is ignored, and the receive buffer can grow
>> unrestrained until it reaches tcp_rmem[2].
>>
>> To reproduce:  Connect a TCP session with the receiver doing
>> nothing and the sender sending small packets (an infinite loop
>> of socket send() with 4 bytes of payload with a sleep of 1 ms
>> in between each send()).  This will fill the tcp receive buffer
>> all the way to tcp_rmem[2], ignoring the autotuning limit
>> (sk_rcvbuf).
>>
>> As a result, a host can have individual tcp sessions with receive
>> buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
>> limits, causing the host to go into tcp memory pressure mode.
>>
>> The fundamental issue is the relationship between the granularity
>> of the window scaling factor and the number of byte ACKed back
>> to the sender.  This problem has previously been identified in
>> RFC 7323, appendix F [1].
>>
>> The Linux kernel currently adheres to never shrinking the window.
>>
>> In addition to the overallocation of memory mentioned above, this
>> is also functionally incorrect, because once tcp_rmem[2] is
>> reached, the receiver will drop in-window packets resulting in
>> retransmissions and an eventual timeout of the tcp session.  A
>> receive buffer full condition should instead result in a zero
>> window and an indefinite wait.
>>
>> In practice, this problem is largely hidden for most flows.  It
>> is not applicable to mice flows.  Elephant flows can send data
>> fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
>> triggering a zero window.
>>
>> But this problem does show up for other types of flows.  A good
>> example are websockets and other type of flows that send small
>> amounts of data spaced apart slightly in time.  In these cases,
>> we directly encounter the problem described in [1].
>>
>> RFC 7323, section 2.4 [2], says there are instances when a retracted
>> window can be offered, and that TCP implementations MUST ensure
>> that they handle a shrinking window, as specified in RFC 1122,
>> section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
>> management have made clear that sender must accept a shrunk window
>> from the receiver, including RFC 793 [4] and RFC 1323 [5].
>>
>> This patch implements the functionality to shrink the tcp window
>> when necessary to keep the right edge within the memory limit by
>> autotuning (sk_rcvbuf).  This new functionality is enabled with
>> the following sysctl:
>>
>> sysctl: net.ipv4.tcp_shrink_window
>>
>> This sysctl changes how the TCP window is calculated.
>>
>> If sysctl tcp_shrink_window is zero (the default value), then the
>> window is never shrunk.
>>
>> If sysctl tcp_shrink_window is non-zero, then the memory limit
>> set by autotuning is honored.  This requires that the TCP window
>> be shrunk ("retracted") as described in RFC 1122.
>>
>> [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
>> [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
>> [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
>> [4] https://www.rfc-editor.org/rfc/rfc793
>> [5] https://www.rfc-editor.org/rfc/rfc1323
>>
>> Signed-off-by: Mike Freemon <mfreemon@...udflare.com>
> 
> Does Linux TCP really need another tuning parameter?

It's useful to make testing faster, i.e. comparing enabled vs disabled.
It could also be useful as a quick diagnostic test, i.e. someone is
having a problem and they want to quickly eliminate this patch as a
cause.

But I left it in mainly as a risk response.  This patch requires that
the receiving TCP implementation handle the shrinking window correctly.  
This patch has been deployed at Cloudflare and we have not discovered
any cases where the peer TCP fails to be RFC compliant.  But we cannot
rule out the possibility completely.  The concern is what if someone is
running some old software on a non-public network and their software
does not handle a shrinking window.  Simply disabling this feature via
a sysctl parameter seems like a good solution for that situation.

If the consensus is to not have a sysctl parameter, I am happy to
remove it.

A related question:  If we leave it in, what do we think the default
value should be?  It's disabled by default right now, but that is just
me being conservative.  If we are comfortable enabling this by default,
I'm happy to do that too.

> Will tests get run with both feature on and off?

More background and details about the patch is here, including the test
results you're looking for:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

> What default will distributions ship with?

I'm not sure how to answer this.  Isn't that up to the distributions to decide?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ