[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89iJ0SYX_oxjZE_2BOLzWXemA2mZeMeOdPoEFiu-AxE2GMQ@mail.gmail.com>
Date: Thu, 13 Oct 2022 15:02:26 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Wei Wang <weiwan@...gle.com>, netdev@...r.kernel.org,
"David S . Miller" <davem@...emloft.net>, cgroups@...r.kernel.org,
linux-mm@...ck.org, Shakeel Butt <shakeelb@...gle.com>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Neil Spring <ntspring@...a.com>, ycheng@...gle.com
Subject: Re: [PATCH net-next] net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()
On Thu, Oct 13, 2022 at 2:49 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Wed, 12 Oct 2022 16:33:00 -0700 Jakub Kicinski wrote:
> > This patch is causing a little bit of pain to us, to workloads running
> > with just memory.max set. After this change the TCP rx path no longer
> > forces the charging.
> >
> > Any recommendation for the fix? Setting memory.high a few MB under
> > memory.max seems to remove the failures.
>
> Eric, is there anything that would make the TCP perform particularly
> poorly under mem pressure?
>
> Dropping and pruning happens a lot here:
>
> # nstat -a | grep -i -E 'Prune|Drop'
> TcpExtPruneCalled 1202577 0.0
> TcpExtOfoPruned 734606 0.0
> TcpExtTCPOFODrop 64191 0.0
> TcpExtTCPRcvQDrop 384305 0.0
>
> Same workload on 5.6 kernel:
>
> TcpExtPruneCalled 1223043 0.0
> TcpExtOfoPruned 3377 0.0
> TcpExtListenDrops 10596 0.0
> TcpExtTCPOFODrop 22 0.0
> TcpExtTCPRcvQDrop 734 0.0
>
> From a quick look at the code and with what Shakeel explained in mind -
> previously we would have "loaded up the cache" after the first failed
> try, so we never got into the loop inside tcp_try_rmem_schedule() which
> most likely nukes the entire OFO queue:
>
> static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
> unsigned int size)
> {
> if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
> !sk_rmem_schedule(sk, skb, size)) {
> /* ^ would fail but "load up the cache" ^ */
>
> if (tcp_prune_queue(sk) < 0)
> return -1;
>
> /* v this one would not fail due to the cache v */
> while (!sk_rmem_schedule(sk, skb, size)) {
> if (!tcp_prune_ofo_queue(sk))
> return -1;
>
> Neil mentioned that he's seen multi-second stalls when SACKed segments
> get dropped from the OFO queue. Sender waits for a very long time before
> retrying something that was already SACKed if the receiver keeps
> sacking new, later segments. Even when ACK reaches the previously-SACKed
> block which should prove to the sender that something is very wrong.
>
> I tried to repro this with a packet drill and it's not what I see
> exactly, I need to keep shortening the RTT otherwise the retx comes
> out before the next SACK arrives.
>
> I'll try to read the code, and maybe I'll get lucky and manage capture
> the exact impacted flows :S But does anything of this nature ring the
> bell?
>
> `../common/defaults.sh`
>
> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
>
> +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 8>
> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> +.1 < . 1:1(0) ack 1 win 2048
> +0 accept(3, ..., ...) = 4
>
> +0 write(4, ..., 60000) = 60000
> +0 > P. 1:10001(10000) ack 1
>
> // Do some SACK-ing
> +.1 < . 1:1(0) ack 1 win 513 <sack 1001:2001,nop,nop>
> +.001 < . 1:1(0) ack 1 win 513 <sack 1001:2001 3001:4001 5001:6001,nop,nop>
> // ..and we pretend we lost 1001:2001
> +.001 < . 1:1(0) ack 1 win 513 <sack 2001:10001,nop,nop>
>
> // re-xmit holes and send more
> +0 > . 10001:11001(1000) ack 1
> +0 > . 1:1001(1000) ack 1
> +0 > . 2001:3001(1000) ack 1 win 256
> +0 > P. 11001:13001(2000) ack 1 win 256
> +0 > P. 13001:15001(2000) ack 1 win 256
>
> +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:15001,nop,nop>
>
> +0 > P. 15001:18001(3000) ack 1 win 256
> +0 > P. 18001:20001(2000) ack 1 win 256
> +0 > P. 20001:22001(2000) ack 1 win 256
>
> +.1 < . 1:1(0) ack 1001 win 513 <sack 2001:22001,nop,nop>
>
> +0 > P. 22001:24001(2000) ack 1 win 256
> +0 > P. 24001:26001(2000) ack 1 win 256
> +0 > P. 26001:28001(2000) ack 1 win 256
> +0 > . 28001:29001(1000) ack 1 win 256
>
> +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>
>
> +0 > P. 29001:31001(2000) ack 1 win 256
> +0 > P. 31001:33001(2000) ack 1 win 256
> +0 > P. 33001:35001(2000) ack 1 win 256
> +0 > . 35001:36001(1000) ack 1 win 256
>
> +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:36001,nop,nop>
>
> +0 > P. 36001:38001(2000) ack 1 win 256
> +0 > P. 38001:40001(2000) ack 1 win 256
> +0 > P. 40001:42001(2000) ack 1 win 256
> +0 > . 42001:43001(1000) ack 1 win 256
>
> +0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:43001,nop,nop>
>
> +0 > P. 43001:45001(2000) ack 1 win 256
> +0 > P. 45001:47001(2000) ack 1 win 256
> +0 > P. 47001:49001(2000) ack 1 win 256
> +0 > . 49001:50001(1000) ack 1 win 256
>
> +0.04 < . 1:1(0) ack 1001 win 257 <sack 2001:50001,nop,nop>
>
> +0 > P. 50001:52001(2000) ack 1 win 256
> +0 > P. 52001:54001(2000) ack 1 win 256
> +0 > P. 54001:56001(2000) ack 1 win 256
> +0 > . 56001:57001(1000) ack 1 win 256
>
> +0.04 > . 1001:2001(1000) ack 1 win 256
>
>
This is SACK reneging, I would have to double check linux behavior, but
reverting to timeout could very well happen.
> +.1 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>
>
Powered by blists - more mailing lists