netdev - Re: TCP socket send return EAGAIN unexpectedly when sending small fragments

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <13735ebc-fc54-22ff-21a1-da1895a74446@gmail.com>
Date:   Fri, 10 Jun 2022 10:46:58 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Ronny Meeus <ronny.meeus@...il.com>
Cc:     netdev <netdev@...r.kernel.org>
Subject: Re: TCP socket send return EAGAIN unexpectedly when sending small
 fragments


On 6/10/22 08:16, Ronny Meeus wrote:
> Op vr 10 jun. 2022 om 16:21 schreef Eric Dumazet <erdnetdev@...il.com>:
>>
>> On 6/10/22 05:48, Ronny Meeus wrote:
>>> Hello
>>>
>>> I have a small test application written in C that creates a local (in
>>> process) TCP channel via loopback.
>>> (kernel version is 3.10.0 but the same issue is also seen for example
>>> on a 4.9 kernel).
>>>
>>> On the client side, an SO_SNDBUF of 5Kb is configured (which is
>>> doubled by the kernel) while on the server side the default size is
>>> used.
>>> Both client and server side are running in non-blocking mode.
>>>
>>> The server side is not reading its socket so all data is simply queued
>>> in the socket's receive buffer.
>>> On the client side, the client is writing data into the socket in a
>>> loop until the write returns an error (EAGAIN in this case).
>>>
>>> Depending on the size of data I send on this socket, I get the EAGAIN
>>> on the client side already after a small number of messages.
>>> For example when sending 106 byte messages, I see the first EAGAIN
>>> after 21 write calls:
>>> # ./tcp_client_server 106
>>> using data size 106
>>> client send buffer size 5000
>>> client socket snd_buf: 10000
>>> ERRNO = 11 count=21
>>>
>>> The same is observed for all sizes smaller than or equal to 107 bytes.
>>>
>>> When getting the socket stats using ss I see all data (2226b) pending
>>> in the socket on the server side:
>>> # ss -tmi  | grep -i X11 -A 1
>>> ESTAB      0      0      127.0.0.1:59792                127.0.0.1:x11
>>> skmem:(r0,rb1061296,t0,tb10000,f0,w0,o0,bl0,d0) cubic wscale:7,7
>>> rto:202 rtt:1.276/2.504 mss:21888 rcvmss:536 advmss:65483 cwnd:10
>>> bytes_acked:2227 segs_out:24 segs_in:18 send 1372.3Mbps lastsnd:3883
>>> lastrcv:771546999 lastack:3883 pacing_rate 2744.3Mbps retrans:0/1
>>> rcv_space:43690
>>> --
>>> ESTAB      2226   0      127.0.0.1:x11
>>> 127.0.0.1:59792
>>> skmem:(r4608,rb1061488,t0,tb2626560,f3584,w0,o0,bl0,d1) cubic
>>> wscale:7,7 rto:200 ato:40 mss:21888 rcvmss:536 advmss:65483 cwnd:10
>>> bytes_received:2226 segs_out:17 segs_in:24 lastsnd:3893 lastrcv:3893
>>> lastack:3883 rcv_space:43690
>>>
>>>
>>> When sending larger messages, the EAGAIN only is seen after 2116 writes.
>>> # ./tcp_client_server 108
>>> using data size 108
>>> client send buffer size 5000
>>> client socket snd_buf: 10000
>>> ERRNO = 11 count=2116
>>>
>>> Again, the ss shows all data being present on the server side (108 *
>>> 2116 = 228528)
>>> ESTAB      228528 0      127.0.0.1:x11
>>> 127.0.0.1:59830
>>> skmem:(r976896,rb1061488,t0,tb2626560,f2048,w0,o0,bl0,d1) cubic
>>> wscale:7,7 rto:200 ato:80 mss:21888 rcvmss:536 advmss:65483 cwnd:10
>>> bytes_received:228528 segs_out:436 segs_in:2119 lastsnd:3615
>>> lastrcv:3606 lastack:3596 rcv_rtt:1 rcv_space:43690
>>> --
>>> ESTAB      0      0      127.0.0.1:59830                127.0.0.1:x11
>>> skmem:(r0,rb1061296,t0,tb10000,f0,w0,o0,bl0,d0) cubic wscale:7,7
>>> rto:206 rtt:5.016/9.996 mss:21888 rcvmss:536 advmss:65483 cwnd:10
>>> bytes_acked:228529 segs_out:2119 segs_in:437 send 349.1Mbps
>>> lastsnd:3596 lastrcv:771704718 lastack:3566 pacing_rate 698.1Mbps
>>> retrans:0/1 rcv_space:43690
>>>
>>>
>>>
>>> When I enlarge the SNDBUF on the client side to for example 10K, I see
>>> that more messages can be sent:
>>> # ./tcp_client_server 106 10000
>>> using data size 106
>>> client send buffer size 10000
>>> client socket snd_buf: 20000
>>> ERRNO = 11 count=1291
>>>
>>> I also captured the packets on the interface using wireshark and it
>>> looks like the error is returned after the TCP layer has done a
>>> retransmit (after 10ms) of the last packet sent on the connection.
>>>
>>> 10:12:38.186451 IP localhost.48470 > localhost.etlservicemgr: Flags
>>> [P.], seq 1165:1265, ack 165, win 342, options [nop,nop,TS val
>>> 593860562 ecr 593860562], length 100
>>> 10:12:38.186461 IP localhost.etlservicemgr > localhost.48470: Flags
>>> [.], ack 1265, win 342, options [nop,nop,TS val 593860562 ecr
>>> 593860562], length 0
>>> 10:12:38.186478 IP localhost.48470 > localhost.etlservicemgr: Flags
>>> [P.], seq 1265:1365, ack 165, win 342, options [nop,nop,TS val
>>> 593860562 ecr 593860562], length 100
>>> 10:12:38.186488 IP localhost.etlservicemgr > localhost.48470: Flags
>>> [.], ack 1365, win 342, options [nop,nop,TS val 593860562 ecr
>>> 593860562], length 0
>>> 10:12:38.186505 IP localhost.48470 > localhost.etlservicemgr: Flags
>>> [P.], seq 1365:1465, ack 165, win 342, options [nop,nop,TS val
>>> 593860562 ecr 593860562], length 100
>>> 10:12:38.186516 IP localhost.etlservicemgr > localhost.48470: Flags
>>> [.], ack 1465, win 342, options [nop,nop,TS val 593860562 ecr
>>> 593860562], length 0
>>> 10:12:38.186533 IP localhost.48470 > localhost.etlservicemgr: Flags
>>> [P.], seq 1465:1565, ack 165, win 342, options [nop,nop,TS val
>>> 593860562 ecr 593860562], length 100
>>> 10:12:38.186543 IP localhost.etlservicemgr > localhost.48470: Flags
>>> [.], ack 1565, win 342, options [nop,nop,TS val 593860562 ecr
>>> 593860562], length 0
>>> 10:12:38.186560 IP localhost.48470 > localhost.etlservicemgr: Flags
>>> [P.], seq 1565:1665, ack 165, win 342, options [nop,nop,TS val
>>> 593860562 ecr 593860562], length 100
>>> 10:12:38.186578 IP localhost.48470 > localhost.etlservicemgr: Flags
>>> [P.], seq 1665:1765, ack 165, win 342, options [nop,nop,TS val
>>> 593860562 ecr 593860562], length 100
>>> 10:12:38.186595 IP localhost.48470 > localhost.etlservicemgr: Flags
>>> [P.], seq 1765:1865, ack 165, win 342, options [nop,nop,TS val
>>> 593860562 ecr 593860562], length 100
>>> 10:12:38.186615 IP localhost.48470 > localhost.etlservicemgr: Flags
>>> [P.], seq 1865:1965, ack 165, win 342, options [nop,nop,TS val
>>> 593860562 ecr 593860562], length 100
>>> 10:12:38.186632 IP localhost.48470 > localhost.etlservicemgr: Flags
>>> [P.], seq 1965:2065, ack 165, win 342, options [nop,nop,TS val
>>> 593860562 ecr 593860562], length 100
>>> 10:12:38.196064 IP localhost.48470 > localhost.etlservicemgr: Flags
>>> [P.], seq 1965:2065, ack 165, win 342, options [nop,nop,TS val
>>> 593860572 ecr 593860562], length 100
>>> 10:12:38.196128 IP localhost.etlservicemgr > localhost.48470: Flags
>>> [.], ack 2065, win 342, options [nop,nop,TS val 593860572 ecr
>>> 593860562,nop,nop,sack 1 {1965:2065}], length 0
>>>
>>> Now my question is: how is it possible that I can only send 21
>>> messages of 106 bytes before I see the EAGAIN while when sending
>>> larger messages I can send 2K+ messages.
>>
>> This is because kernel tracks kernel memory usage.
>>
>> Small packets incur more overhead.
>>
>> The doubling of SO_SNDBUF user value, is an heuristic based on the fact
>> the kernel
>>
>> uses a 2x overhead estimation,
>>
>> while effective overhead can vary from ~1.001x to ~768x depending on
>> number of bytes per skb.
>>
> Hi Eric,
>
> thanks for the feedback but it is not completely clear to me.
>
> The small SNDBUF is on the client side and a large part of the data
> has already ACKed by the receiving side.
> Only the last 5 or 6 messages are waiting to be ACKed.
> So I would expect that the biggest part of the overhead is located on
> the receiving side where there is sufficient memory space (default
> RCVBUF size is used).
>
> If the 5 queued packets on the sending side would cause the EAGAIN
> issue, the real question maybe is why the receiving side is not
> sending the ACK within the 10ms while for earlier messages the ACK is
> sent much sooner.


delayed acks on receiver, when receiving small packet(s)


>
> Do you have any clue which kernel function does generate this EAGAIN?
> And what would be the correct solution to this problem?


I do not really see what is the problem you have.

Perhaps you should not set SO_SNDBUF, and let the kernel decide (auto 
tune, and auto cork)


>
> Best regards,
> Ronny
>