netdev - Re: TCP socket send return EAGAIN unexpectedly when sending small fragments

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMJ=MEdctBNSihixym1ZO9RVaCa_FpTQ8e4xFukz3eN8F1P8bQ@mail.gmail.com>
Date:   Fri, 10 Jun 2022 17:16:20 +0200
From:   Ronny Meeus <ronny.meeus@...il.com>
To:     Eric Dumazet <erdnetdev@...il.com>
Cc:     netdev <netdev@...r.kernel.org>
Subject: Re: TCP socket send return EAGAIN unexpectedly when sending small fragments

Op vr 10 jun. 2022 om 16:21 schreef Eric Dumazet <erdnetdev@...il.com>:
>
>
> On 6/10/22 05:48, Ronny Meeus wrote:
> > Hello
> >
> > I have a small test application written in C that creates a local (in
> > process) TCP channel via loopback.
> > (kernel version is 3.10.0 but the same issue is also seen for example
> > on a 4.9 kernel).
> >
> > On the client side, an SO_SNDBUF of 5Kb is configured (which is
> > doubled by the kernel) while on the server side the default size is
> > used.
> > Both client and server side are running in non-blocking mode.
> >
> > The server side is not reading its socket so all data is simply queued
> > in the socket's receive buffer.
> > On the client side, the client is writing data into the socket in a
> > loop until the write returns an error (EAGAIN in this case).
> >
> > Depending on the size of data I send on this socket, I get the EAGAIN
> > on the client side already after a small number of messages.
> > For example when sending 106 byte messages, I see the first EAGAIN
> > after 21 write calls:
> > # ./tcp_client_server 106
> > using data size 106
> > client send buffer size 5000
> > client socket snd_buf: 10000
> > ERRNO = 11 count=21
> >
> > The same is observed for all sizes smaller than or equal to 107 bytes.
> >
> > When getting the socket stats using ss I see all data (2226b) pending
> > in the socket on the server side:
> > # ss -tmi  | grep -i X11 -A 1
> > ESTAB      0      0      127.0.0.1:59792                127.0.0.1:x11
> > skmem:(r0,rb1061296,t0,tb10000,f0,w0,o0,bl0,d0) cubic wscale:7,7
> > rto:202 rtt:1.276/2.504 mss:21888 rcvmss:536 advmss:65483 cwnd:10
> > bytes_acked:2227 segs_out:24 segs_in:18 send 1372.3Mbps lastsnd:3883
> > lastrcv:771546999 lastack:3883 pacing_rate 2744.3Mbps retrans:0/1
> > rcv_space:43690
> > --
> > ESTAB      2226   0      127.0.0.1:x11
> > 127.0.0.1:59792
> > skmem:(r4608,rb1061488,t0,tb2626560,f3584,w0,o0,bl0,d1) cubic
> > wscale:7,7 rto:200 ato:40 mss:21888 rcvmss:536 advmss:65483 cwnd:10
> > bytes_received:2226 segs_out:17 segs_in:24 lastsnd:3893 lastrcv:3893
> > lastack:3883 rcv_space:43690
> >
> >
> > When sending larger messages, the EAGAIN only is seen after 2116 writes.
> > # ./tcp_client_server 108
> > using data size 108
> > client send buffer size 5000
> > client socket snd_buf: 10000
> > ERRNO = 11 count=2116
> >
> > Again, the ss shows all data being present on the server side (108 *
> > 2116 = 228528)
> > ESTAB      228528 0      127.0.0.1:x11
> > 127.0.0.1:59830
> > skmem:(r976896,rb1061488,t0,tb2626560,f2048,w0,o0,bl0,d1) cubic
> > wscale:7,7 rto:200 ato:80 mss:21888 rcvmss:536 advmss:65483 cwnd:10
> > bytes_received:228528 segs_out:436 segs_in:2119 lastsnd:3615
> > lastrcv:3606 lastack:3596 rcv_rtt:1 rcv_space:43690
> > --
> > ESTAB      0      0      127.0.0.1:59830                127.0.0.1:x11
> > skmem:(r0,rb1061296,t0,tb10000,f0,w0,o0,bl0,d0) cubic wscale:7,7
> > rto:206 rtt:5.016/9.996 mss:21888 rcvmss:536 advmss:65483 cwnd:10
> > bytes_acked:228529 segs_out:2119 segs_in:437 send 349.1Mbps
> > lastsnd:3596 lastrcv:771704718 lastack:3566 pacing_rate 698.1Mbps
> > retrans:0/1 rcv_space:43690
> >
> >
> >
> > When I enlarge the SNDBUF on the client side to for example 10K, I see
> > that more messages can be sent:
> > # ./tcp_client_server 106 10000
> > using data size 106
> > client send buffer size 10000
> > client socket snd_buf: 20000
> > ERRNO = 11 count=1291
> >
> > I also captured the packets on the interface using wireshark and it
> > looks like the error is returned after the TCP layer has done a
> > retransmit (after 10ms) of the last packet sent on the connection.
> >
> > 10:12:38.186451 IP localhost.48470 > localhost.etlservicemgr: Flags
> > [P.], seq 1165:1265, ack 165, win 342, options [nop,nop,TS val
> > 593860562 ecr 593860562], length 100
> > 10:12:38.186461 IP localhost.etlservicemgr > localhost.48470: Flags
> > [.], ack 1265, win 342, options [nop,nop,TS val 593860562 ecr
> > 593860562], length 0
> > 10:12:38.186478 IP localhost.48470 > localhost.etlservicemgr: Flags
> > [P.], seq 1265:1365, ack 165, win 342, options [nop,nop,TS val
> > 593860562 ecr 593860562], length 100
> > 10:12:38.186488 IP localhost.etlservicemgr > localhost.48470: Flags
> > [.], ack 1365, win 342, options [nop,nop,TS val 593860562 ecr
> > 593860562], length 0
> > 10:12:38.186505 IP localhost.48470 > localhost.etlservicemgr: Flags
> > [P.], seq 1365:1465, ack 165, win 342, options [nop,nop,TS val
> > 593860562 ecr 593860562], length 100
> > 10:12:38.186516 IP localhost.etlservicemgr > localhost.48470: Flags
> > [.], ack 1465, win 342, options [nop,nop,TS val 593860562 ecr
> > 593860562], length 0
> > 10:12:38.186533 IP localhost.48470 > localhost.etlservicemgr: Flags
> > [P.], seq 1465:1565, ack 165, win 342, options [nop,nop,TS val
> > 593860562 ecr 593860562], length 100
> > 10:12:38.186543 IP localhost.etlservicemgr > localhost.48470: Flags
> > [.], ack 1565, win 342, options [nop,nop,TS val 593860562 ecr
> > 593860562], length 0
> > 10:12:38.186560 IP localhost.48470 > localhost.etlservicemgr: Flags
> > [P.], seq 1565:1665, ack 165, win 342, options [nop,nop,TS val
> > 593860562 ecr 593860562], length 100
> > 10:12:38.186578 IP localhost.48470 > localhost.etlservicemgr: Flags
> > [P.], seq 1665:1765, ack 165, win 342, options [nop,nop,TS val
> > 593860562 ecr 593860562], length 100
> > 10:12:38.186595 IP localhost.48470 > localhost.etlservicemgr: Flags
> > [P.], seq 1765:1865, ack 165, win 342, options [nop,nop,TS val
> > 593860562 ecr 593860562], length 100
> > 10:12:38.186615 IP localhost.48470 > localhost.etlservicemgr: Flags
> > [P.], seq 1865:1965, ack 165, win 342, options [nop,nop,TS val
> > 593860562 ecr 593860562], length 100
> > 10:12:38.186632 IP localhost.48470 > localhost.etlservicemgr: Flags
> > [P.], seq 1965:2065, ack 165, win 342, options [nop,nop,TS val
> > 593860562 ecr 593860562], length 100
> > 10:12:38.196064 IP localhost.48470 > localhost.etlservicemgr: Flags
> > [P.], seq 1965:2065, ack 165, win 342, options [nop,nop,TS val
> > 593860572 ecr 593860562], length 100
> > 10:12:38.196128 IP localhost.etlservicemgr > localhost.48470: Flags
> > [.], ack 2065, win 342, options [nop,nop,TS val 593860572 ecr
> > 593860562,nop,nop,sack 1 {1965:2065}], length 0
> >
> > Now my question is: how is it possible that I can only send 21
> > messages of 106 bytes before I see the EAGAIN while when sending
> > larger messages I can send 2K+ messages.
>
>
> This is because kernel tracks kernel memory usage.
>
> Small packets incur more overhead.
>
> The doubling of SO_SNDBUF user value, is an heuristic based on the fact
> the kernel
>
> uses a 2x overhead estimation,
>
> while effective overhead can vary from ~1.001x to ~768x depending on
> number of bytes per skb.
>

Hi Eric,

thanks for the feedback but it is not completely clear to me.

The small SNDBUF is on the client side and a large part of the data
has already ACKed by the receiving side.
Only the last 5 or 6 messages are waiting to be ACKed.
So I would expect that the biggest part of the overhead is located on
the receiving side where there is sufficient memory space (default
RCVBUF size is used).

If the 5 queued packets on the sending side would cause the EAGAIN
issue, the real question maybe is why the receiving side is not
sending the ACK within the 10ms while for earlier messages the ACK is
sent much sooner.

Do you have any clue which kernel function does generate this EAGAIN?
And what would be the correct solution to this problem?

Best regards,
Ronny

>