netdev - Re: TCP socket send return EAGAIN unexpectedly when sending small fragments

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMJ=MEdP0eYVEUT+6LS7O8Mj5NMXzdOP_gfRtsc8_JubV+7oOA@mail.gmail.com>
Date:   Fri, 10 Jun 2022 20:21:03 +0200
From:   Ronny Meeus <ronny.meeus@...il.com>
To:     Eric Dumazet <eric.dumazet@...il.com>
Cc:     netdev <netdev@...r.kernel.org>
Subject: Re: TCP socket send return EAGAIN unexpectedly when sending small fragments

Op vr 10 jun. 2022 om 19:47 schreef Eric Dumazet <eric.dumazet@...il.com>:
>
>
> On 6/10/22 08:16, Ronny Meeus wrote:
> > Op vr 10 jun. 2022 om 16:21 schreef Eric Dumazet <erdnetdev@...il.com>:
> >>
> >> On 6/10/22 05:48, Ronny Meeus wrote:
> >>> Hello
> >>>
> >>> I have a small test application written in C that creates a local (in
> >>> process) TCP channel via loopback.
> >>> (kernel version is 3.10.0 but the same issue is also seen for example
> >>> on a 4.9 kernel).
> >>>
> >>> On the client side, an SO_SNDBUF of 5Kb is configured (which is
> >>> doubled by the kernel) while on the server side the default size is
> >>> used.
> >>> Both client and server side are running in non-blocking mode.
> >>>
> >>> The server side is not reading its socket so all data is simply queued
> >>> in the socket's receive buffer.
> >>> On the client side, the client is writing data into the socket in a
> >>> loop until the write returns an error (EAGAIN in this case).
> >>>
> >>> Depending on the size of data I send on this socket, I get the EAGAIN
> >>> on the client side already after a small number of messages.
> >>> For example when sending 106 byte messages, I see the first EAGAIN
> >>> after 21 write calls:
> >>> # ./tcp_client_server 106
> >>> using data size 106
> >>> client send buffer size 5000
> >>> client socket snd_buf: 10000
> >>> ERRNO = 11 count=21
> >>>
> >>> The same is observed for all sizes smaller than or equal to 107 bytes.
> >>>
> >>> When getting the socket stats using ss I see all data (2226b) pending
> >>> in the socket on the server side:
> >>> # ss -tmi  | grep -i X11 -A 1
> >>> ESTAB      0      0      127.0.0.1:59792                127.0.0.1:x11
> >>> skmem:(r0,rb1061296,t0,tb10000,f0,w0,o0,bl0,d0) cubic wscale:7,7
> >>> rto:202 rtt:1.276/2.504 mss:21888 rcvmss:536 advmss:65483 cwnd:10
> >>> bytes_acked:2227 segs_out:24 segs_in:18 send 1372.3Mbps lastsnd:3883
> >>> lastrcv:771546999 lastack:3883 pacing_rate 2744.3Mbps retrans:0/1
> >>> rcv_space:43690
> >>> --
> >>> ESTAB      2226   0      127.0.0.1:x11
> >>> 127.0.0.1:59792
> >>> skmem:(r4608,rb1061488,t0,tb2626560,f3584,w0,o0,bl0,d1) cubic
> >>> wscale:7,7 rto:200 ato:40 mss:21888 rcvmss:536 advmss:65483 cwnd:10
> >>> bytes_received:2226 segs_out:17 segs_in:24 lastsnd:3893 lastrcv:3893
> >>> lastack:3883 rcv_space:43690
> >>>
> >>>
> >>> When sending larger messages, the EAGAIN only is seen after 2116 writes.
> >>> # ./tcp_client_server 108
> >>> using data size 108
> >>> client send buffer size 5000
> >>> client socket snd_buf: 10000
> >>> ERRNO = 11 count=2116
> >>>
> >>> Again, the ss shows all data being present on the server side (108 *
> >>> 2116 = 228528)
> >>> ESTAB      228528 0      127.0.0.1:x11
> >>> 127.0.0.1:59830
> >>> skmem:(r976896,rb1061488,t0,tb2626560,f2048,w0,o0,bl0,d1) cubic
> >>> wscale:7,7 rto:200 ato:80 mss:21888 rcvmss:536 advmss:65483 cwnd:10
> >>> bytes_received:228528 segs_out:436 segs_in:2119 lastsnd:3615
> >>> lastrcv:3606 lastack:3596 rcv_rtt:1 rcv_space:43690
> >>> --
> >>> ESTAB      0      0      127.0.0.1:59830                127.0.0.1:x11
> >>> skmem:(r0,rb1061296,t0,tb10000,f0,w0,o0,bl0,d0) cubic wscale:7,7
> >>> rto:206 rtt:5.016/9.996 mss:21888 rcvmss:536 advmss:65483 cwnd:10
> >>> bytes_acked:228529 segs_out:2119 segs_in:437 send 349.1Mbps
> >>> lastsnd:3596 lastrcv:771704718 lastack:3566 pacing_rate 698.1Mbps
> >>> retrans:0/1 rcv_space:43690
> >>>
> >>>
> >>>
> >>> When I enlarge the SNDBUF on the client side to for example 10K, I see
> >>> that more messages can be sent:
> >>> # ./tcp_client_server 106 10000
> >>> using data size 106
> >>> client send buffer size 10000
> >>> client socket snd_buf: 20000
> >>> ERRNO = 11 count=1291
> >>>
> >>> I also captured the packets on the interface using wireshark and it
> >>> looks like the error is returned after the TCP layer has done a
> >>> retransmit (after 10ms) of the last packet sent on the connection.
> >>>
> >>> 10:12:38.186451 IP localhost.48470 > localhost.etlservicemgr: Flags
> >>> [P.], seq 1165:1265, ack 165, win 342, options [nop,nop,TS val
> >>> 593860562 ecr 593860562], length 100
> >>> 10:12:38.186461 IP localhost.etlservicemgr > localhost.48470: Flags
> >>> [.], ack 1265, win 342, options [nop,nop,TS val 593860562 ecr
> >>> 593860562], length 0
> >>> 10:12:38.186478 IP localhost.48470 > localhost.etlservicemgr: Flags
> >>> [P.], seq 1265:1365, ack 165, win 342, options [nop,nop,TS val
> >>> 593860562 ecr 593860562], length 100
> >>> 10:12:38.186488 IP localhost.etlservicemgr > localhost.48470: Flags
> >>> [.], ack 1365, win 342, options [nop,nop,TS val 593860562 ecr
> >>> 593860562], length 0
> >>> 10:12:38.186505 IP localhost.48470 > localhost.etlservicemgr: Flags
> >>> [P.], seq 1365:1465, ack 165, win 342, options [nop,nop,TS val
> >>> 593860562 ecr 593860562], length 100
> >>> 10:12:38.186516 IP localhost.etlservicemgr > localhost.48470: Flags
> >>> [.], ack 1465, win 342, options [nop,nop,TS val 593860562 ecr
> >>> 593860562], length 0
> >>> 10:12:38.186533 IP localhost.48470 > localhost.etlservicemgr: Flags
> >>> [P.], seq 1465:1565, ack 165, win 342, options [nop,nop,TS val
> >>> 593860562 ecr 593860562], length 100
> >>> 10:12:38.186543 IP localhost.etlservicemgr > localhost.48470: Flags
> >>> [.], ack 1565, win 342, options [nop,nop,TS val 593860562 ecr
> >>> 593860562], length 0
> >>> 10:12:38.186560 IP localhost.48470 > localhost.etlservicemgr: Flags
> >>> [P.], seq 1565:1665, ack 165, win 342, options [nop,nop,TS val
> >>> 593860562 ecr 593860562], length 100
> >>> 10:12:38.186578 IP localhost.48470 > localhost.etlservicemgr: Flags
> >>> [P.], seq 1665:1765, ack 165, win 342, options [nop,nop,TS val
> >>> 593860562 ecr 593860562], length 100
> >>> 10:12:38.186595 IP localhost.48470 > localhost.etlservicemgr: Flags
> >>> [P.], seq 1765:1865, ack 165, win 342, options [nop,nop,TS val
> >>> 593860562 ecr 593860562], length 100
> >>> 10:12:38.186615 IP localhost.48470 > localhost.etlservicemgr: Flags
> >>> [P.], seq 1865:1965, ack 165, win 342, options [nop,nop,TS val
> >>> 593860562 ecr 593860562], length 100
> >>> 10:12:38.186632 IP localhost.48470 > localhost.etlservicemgr: Flags
> >>> [P.], seq 1965:2065, ack 165, win 342, options [nop,nop,TS val
> >>> 593860562 ecr 593860562], length 100
> >>> 10:12:38.196064 IP localhost.48470 > localhost.etlservicemgr: Flags
> >>> [P.], seq 1965:2065, ack 165, win 342, options [nop,nop,TS val
> >>> 593860572 ecr 593860562], length 100
> >>> 10:12:38.196128 IP localhost.etlservicemgr > localhost.48470: Flags
> >>> [.], ack 2065, win 342, options [nop,nop,TS val 593860572 ecr
> >>> 593860562,nop,nop,sack 1 {1965:2065}], length 0
> >>>
> >>> Now my question is: how is it possible that I can only send 21
> >>> messages of 106 bytes before I see the EAGAIN while when sending
> >>> larger messages I can send 2K+ messages.
> >>
> >> This is because kernel tracks kernel memory usage.
> >>
> >> Small packets incur more overhead.
> >>
> >> The doubling of SO_SNDBUF user value, is an heuristic based on the fact
> >> the kernel
> >>
> >> uses a 2x overhead estimation,
> >>
> >> while effective overhead can vary from ~1.001x to ~768x depending on
> >> number of bytes per skb.
> >>
> > Hi Eric,
> >
> > thanks for the feedback but it is not completely clear to me.
> >
> > The small SNDBUF is on the client side and a large part of the data
> > has already ACKed by the receiving side.
> > Only the last 5 or 6 messages are waiting to be ACKed.
> > So I would expect that the biggest part of the overhead is located on
> > the receiving side where there is sufficient memory space (default
> > RCVBUF size is used).
> >
> > If the 5 queued packets on the sending side would cause the EAGAIN
> > issue, the real question maybe is why the receiving side is not
> > sending the ACK within the 10ms while for earlier messages the ACK is
> > sent much sooner.
>
>
> delayed acks on receiver, when receiving small packet(s)
>

Great feedback.

>
> >
> > Do you have any clue which kernel function does generate this EAGAIN?
> > And what would be the correct solution to this problem?
>
>
> I do not really see what is the problem you have.

The actual problem I have is that the application behavior is
completely different depending on the message size.
I created the test application to be able to investigate something we
see in the real-application and that is that libevent either queues a
lot or does not queue anything.
Libevent starts to queue data as soon as it sees the EAGAIN error.

>
> Perhaps you should not set SO_SNDBUF, and let the kernel decide (auto
> tune, and auto cork)
>

I think this might indeed be a solution. Just remove the SO_SNDBUF, or
at least use a much larger value.

> >
> > Best regards,
> > Ronny
> >