[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <7B78B532-4EA5-4E5F-BB57-52A887302255@amazon.com>
Date: Mon, 7 Dec 2020 17:49:33 +0000
From: "Mohamed Abuelfotoh, Hazem" <abuehaze@...zon.com>
To: Eric Dumazet <edumazet@...gle.com>
CC: netdev <netdev@...r.kernel.org>,
"stable@...r.kernel.org" <stable@...r.kernel.org>,
Yuchung Cheng <ycheng@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>,
Wei Wang <weiwan@...gle.com>,
"Strohman, Andy" <astroh@...zon.com>,
"Herrenschmidt, Benjamin" <benh@...zon.com>
Subject: Re: [PATCH net] tcp: fix receive buffer autotuning to trigger for any valid
advertised MSS
> I find using icsk->icsk_ack.rcv_mss misleading.
> I would either use TCP_MSS_DEFAULT , or maybe simply 0, since we had
> no samples yet, there is little point to use a magic value.
My point is by definition rcv_space is used in TCP's internal auto-tuning to grow socket buffers based on how much data the kernel estimates the sender can send so we are talking about an estimation anyway that won't be 100% accurate especially during the connection initialization, I proposed using TCP_INIT_CWND * icsk->icsk_ack.rcv_mss as initial receive space because it's the minimum that the sender can send assuming they are filling up their initial congestion window this shouldn't cause security impact in my opinion because the rcv_buf and receive window won't scale unless the sender sent 5360 in one round trip so anything lower than that won't trigger the TCP autotuning.
Another point is that the proposed patch is impacting the initial rcv_space.space but has no impact on how the receive window/receive buffer will scale while the connection is ongoing.
I was actually thinking about the below options before proposing my final patch because I am afraid that they will have impact of higher memory footprint or connection get stuck later with packet loss.
I am listing them below as well.
A)The below patch would be enough as the usercopied data would be usually more than 50KB so the window will scale
# diff -u a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c 2020-11-18 19:54:23.624306129 +0000
+++ b/net/ipv4/tcp_input.c 2020-11-18 19:55:05.032259419 +0000
@@ -605,7 +605,7 @@
/* Number of bytes copied to user in last RTT */
copied = tp->copied_seq - tp->rcvq_space.seq;
- if (copied <= tp->rcvq_space.space)
+ if (copied <= (tp->rcvq_space.space >> 1))
goto new_measure;
Pros:
-it will decrease the threshold where we are scaling up the receive buffers and advertised window to half what it's currently on so we should see the connection stuck using the current default kernel configurations with High RTT.
-it's likely hood for getting stuck later is low because it's dynamic and should impact the DRS during the whole connection lifetime not just during the initialization phase.
Cons:
-This may have Higher memory footprint because we are increasing the receiver buffer size on lower threshold than before.
################################################################################################################################
B)Otherwise we can be looking into decreasing the initial rcv_wnd and accordingly rcvq_space.space by decreasing the default ipv4.sysctl_tcp_rmem[1] from 131072 to 87380 as it was before https://lore.kernel.org/patchwork/patch/1157936/.
#################################################################################################################################
C)Other solution would be to bound current rcvq_space.space with the usercopied amount of the previous RTT, something as below would also work.
--- a/net/ipv4/tcp_input.c 2020-11-19 15:43:10.441524021 +0000
+++ b/net/ipv4/tcp_input.c 2020-11-19 15:45:42.772614521 +0000
@@ -649,6 +649,7 @@
tp->rcvq_space.space = copied;
new_measure:
+ tp->rcvq_space.space = copied;
tp->rcvq_space.seq = tp->copied_seq;
tp->rcvq_space.time = tp->tcp_mstamp;
}
Cons:
-When I emulated packet loss later after connection establishment I could see the connection get stuck on low speed for a long time.
On 07/12/2020, 16:09, "Eric Dumazet" <edumazet@...gle.com> wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Mon, Dec 7, 2020 at 4:37 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Mon, Dec 7, 2020 at 12:41 PM Hazem Mohamed Abuelfotoh
> <abuehaze@...zon.com> wrote:
> >
> > Previously receiver buffer auto-tuning starts after receiving
> > one advertised window amount of data.After the initial
> > receiver buffer was raised by
> > commit a337531b942b ("tcp: up initial rmem to 128KB
> > and SYN rwin to around 64KB"),the receiver buffer may
> > take too long for TCP autotuning to start raising
> > the receiver buffer size.
> > commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
> > tried to decrease the threshold at which TCP auto-tuning starts
> > but it's doesn't work well in some environments
> > where the receiver has large MTU (9001) especially with high RTT
> > connections as in these environments rcvq_space.space will be the same
> > as rcv_wnd so TCP autotuning will never start because
> > sender can't send more than rcv_wnd size in one round trip.
> > To address this issue this patch is decreasing the initial
> > rcvq_space.space so TCP autotuning kicks in whenever the sender is
> > able to send more than 5360 bytes in one round trip regardless the
> > receiver's configured MTU.
> >
> > Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
> > Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
> >
> > Signed-off-by: Hazem Mohamed Abuelfotoh <abuehaze@...zon.com>
> > ---
> > net/ipv4/tcp_input.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 389d1b340248..f0ffac9e937b 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
> > static void tcp_init_buffer_space(struct sock *sk)
> > {
> > int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
> > + struct inet_connection_sock *icsk = inet_csk(sk);
> > struct tcp_sock *tp = tcp_sk(sk);
> > int maxwin;
> >
> > if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
> > tcp_sndbuf_expand(sk);
> >
> > - tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
> > + tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);
>
0 will not work, since we use a do_div(grow, tp->rcvq_space.space)
>
> Note that if a driver uses 16KB of memory to hold a 1500 bytes packet,
> then a 10 MSS GRO packet is consuming 160 KB of memory,
> which is bigger than tcp_rmem[1]. TCP could decide to drop these fat packets.
>
> I wonder if your patch does not work around a more fundamental issue,
> I am still unable to reproduce the issue.
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
Powered by blists - more mailing lists