[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221013144950.44b52f90@kernel.org>
Date: Thu, 13 Oct 2022 14:49:50 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Wei Wang <weiwan@...gle.com>, netdev@...r.kernel.org,
"David S . Miller" <davem@...emloft.net>, cgroups@...r.kernel.org,
linux-mm@...ck.org, Shakeel Butt <shakeelb@...gle.com>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Neil Spring <ntspring@...a.com>, ycheng@...gle.com
Subject: Re: [PATCH net-next] net-memcg: pass in gfp_t mask to
mem_cgroup_charge_skmem()
On Wed, 12 Oct 2022 16:33:00 -0700 Jakub Kicinski wrote:
> This patch is causing a little bit of pain to us, to workloads running
> with just memory.max set. After this change the TCP rx path no longer
> forces the charging.
>
> Any recommendation for the fix? Setting memory.high a few MB under
> memory.max seems to remove the failures.
Eric, is there anything that would make the TCP perform particularly
poorly under mem pressure?
Dropping and pruning happens a lot here:
# nstat -a | grep -i -E 'Prune|Drop'
TcpExtPruneCalled 1202577 0.0
TcpExtOfoPruned 734606 0.0
TcpExtTCPOFODrop 64191 0.0
TcpExtTCPRcvQDrop 384305 0.0
Same workload on 5.6 kernel:
TcpExtPruneCalled 1223043 0.0
TcpExtOfoPruned 3377 0.0
TcpExtListenDrops 10596 0.0
TcpExtTCPOFODrop 22 0.0
TcpExtTCPRcvQDrop 734 0.0
From a quick look at the code and with what Shakeel explained in mind -
previously we would have "loaded up the cache" after the first failed
try, so we never got into the loop inside tcp_try_rmem_schedule() which
most likely nukes the entire OFO queue:
static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
unsigned int size)
{
if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
!sk_rmem_schedule(sk, skb, size)) {
/* ^ would fail but "load up the cache" ^ */
if (tcp_prune_queue(sk) < 0)
return -1;
/* v this one would not fail due to the cache v */
while (!sk_rmem_schedule(sk, skb, size)) {
if (!tcp_prune_ofo_queue(sk))
return -1;
Neil mentioned that he's seen multi-second stalls when SACKed segments
get dropped from the OFO queue. Sender waits for a very long time before
retrying something that was already SACKed if the receiver keeps
sacking new, later segments. Even when ACK reaches the previously-SACKed
block which should prove to the sender that something is very wrong.
I tried to repro this with a packet drill and it's not what I see
exactly, I need to keep shortening the RTT otherwise the retx comes
out before the next SACK arrives.
I'll try to read the code, and maybe I'll get lucky and manage capture
the exact impacted flows :S But does anything of this nature ring the
bell?
`../common/defaults.sh`
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 8>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
+.1 < . 1:1(0) ack 1 win 2048
+0 accept(3, ..., ...) = 4
+0 write(4, ..., 60000) = 60000
+0 > P. 1:10001(10000) ack 1
// Do some SACK-ing
+.1 < . 1:1(0) ack 1 win 513 <sack 1001:2001,nop,nop>
+.001 < . 1:1(0) ack 1 win 513 <sack 1001:2001 3001:4001 5001:6001,nop,nop>
// ..and we pretend we lost 1001:2001
+.001 < . 1:1(0) ack 1 win 513 <sack 2001:10001,nop,nop>
// re-xmit holes and send more
+0 > . 10001:11001(1000) ack 1
+0 > . 1:1001(1000) ack 1
+0 > . 2001:3001(1000) ack 1 win 256
+0 > P. 11001:13001(2000) ack 1 win 256
+0 > P. 13001:15001(2000) ack 1 win 256
+.1 < . 1:1(0) ack 1001 win 513 <sack 2001:15001,nop,nop>
+0 > P. 15001:18001(3000) ack 1 win 256
+0 > P. 18001:20001(2000) ack 1 win 256
+0 > P. 20001:22001(2000) ack 1 win 256
+.1 < . 1:1(0) ack 1001 win 513 <sack 2001:22001,nop,nop>
+0 > P. 22001:24001(2000) ack 1 win 256
+0 > P. 24001:26001(2000) ack 1 win 256
+0 > P. 26001:28001(2000) ack 1 win 256
+0 > . 28001:29001(1000) ack 1 win 256
+0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>
+0 > P. 29001:31001(2000) ack 1 win 256
+0 > P. 31001:33001(2000) ack 1 win 256
+0 > P. 33001:35001(2000) ack 1 win 256
+0 > . 35001:36001(1000) ack 1 win 256
+0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:36001,nop,nop>
+0 > P. 36001:38001(2000) ack 1 win 256
+0 > P. 38001:40001(2000) ack 1 win 256
+0 > P. 40001:42001(2000) ack 1 win 256
+0 > . 42001:43001(1000) ack 1 win 256
+0.05 < . 1:1(0) ack 1001 win 257 <sack 2001:43001,nop,nop>
+0 > P. 43001:45001(2000) ack 1 win 256
+0 > P. 45001:47001(2000) ack 1 win 256
+0 > P. 47001:49001(2000) ack 1 win 256
+0 > . 49001:50001(1000) ack 1 win 256
+0.04 < . 1:1(0) ack 1001 win 257 <sack 2001:50001,nop,nop>
+0 > P. 50001:52001(2000) ack 1 win 256
+0 > P. 52001:54001(2000) ack 1 win 256
+0 > P. 54001:56001(2000) ack 1 win 256
+0 > . 56001:57001(1000) ack 1 win 256
+0.04 > . 1001:2001(1000) ack 1 win 256
+.1 < . 1:1(0) ack 1001 win 257 <sack 2001:29001,nop,nop>
Powered by blists - more mailing lists