netdev - Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20190425161643.0eef0009@pluto.restena.lu>
Date:   Thu, 25 Apr 2019 16:16:43 +0200
From:   Bruno Prémont <bonbons@...ophe.eu>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     Richard Purdie <richard.purdie@...uxfoundation.org>,
        Neal Cardwell <ncardwell@...gle.com>,
        Yuchung Cheng <ycheng@...gle.com>,
        "David S. Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Alexander Kanavin <alex.kanavin@...il.com>,
        Bruce Ashfield <bruce.ashfield@...il.com>
Subject: Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

On Thu, 25 Apr 2019 06:30:24 -0700 Eric Dumazet wrote:
> On Thu, Apr 25, 2019 at 6:13 AM Bruno Prémont wrote:
> > On Thu, 25 Apr 2019 09:55:31 +0200 Bruno Prémont wrote:  
> > > On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote:  
> > > > On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont wrote:  
> > > > > I'm seeing issues with this patch as well, not as regular as for
> > > > > Richard but still (about up to one in 30-50 TCP sessions).
> > > > >
> > > > > In my case I have a virtual machine (on VMWare) with this patch where
> > > > > NGINX as reverse proxy misses part (end) of payload from its upstream
> > > > > and times out on the upstream connection (while according to tcpdump all
> > > > > packets including upstream's FIN were sent and the upstream did get
> > > > > ACKs from the VM).
> > > > >
> > > > > From when browsers get from NGINX it feels as if at some point reading
> > > > > from the socket or waiting for data using select() never returned data
> > > > > that arrived as more than just EOF is missing.
> > > > >
> > > > > The upstream is a hardware machine in the same subnet.
> > > > >
> > > > > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> > > > > as network adapter which lists the following features:
> > > > >  
> > > >
> > > > Hi Bruno.
> > > >
> > > > I suspect a EPOLLIN notification being lost by the application.
> > > >
> > > > Fact that TCP backlog contains 1 instead of 2+ packets should not
> > > > change stack behavior,
> > > > this packet should land into socket receive queue eventually.
> > > >
> > > > Are you using epoll() in Edge Trigger mode. You mention select() but
> > > > select() is a rather old and inefficient API.  
> > >
> > > nginx is using epoll (c.f. http://nginx.org/en/docs/events.html)
> > >
> > > For source, see here
> > > https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epoll_module.c?rev=ebf8c9686b8ce7428f975d8a567935ea3722da70  
> >
> > I've been trying to reproduce while running nginx under strace (with verbose logging), but
> > have not been successful. Looks like tracing modifies the timing in a
> > way to prevent the issue.
> > Without verbose logging I could hit it (though needed way more perseverence).
> >
> > What I do have for a working session under strace is:
> >   socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 12
> >   ioctl(12, FIONBIO, [1]) = 0
> >   epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221158320, u64=139682147558832}}) = 0
> >   connect(12, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "2001:a18:1:6::217", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
> >   epoll_wait(9, [{EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57230) = 1
> >   getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> >   writev(12, [{""..., 830}, {""..., 115}], 2) = 945
> >   epoll_wait(9, [{EPOLLIN|EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57220) = 1
> >   recvfrom(12, ""..., 4096, 0, NULL, NULL) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 4096
> >   readv(12, [{""..., 4096}], 1) = 3855
> >   epoll_wait(9, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1221158320, u64=139682147558832}}], 512, 57190) = 1
> >   readv(12, [{"", 241}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}], 7) = 0
> >   close(12)               = 0
> >
> > Under failing session I got:
> >   socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
> >   ioctl(6, FIONBIO, [1])  = 0
> >   epoll_ctl(86, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221159521, u64=139682147560033}}) = 0
> >   connect(6, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("158.64.2.217")}, 16) = -1 EINPROGRESS (Operation now in progress)
> >   epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
> >   getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> >   writev(6, [{""..., 830}, {""..., 120}], 2) = 950
> >   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
> >   recvfrom(6, ""..., 4096, 0, NULL, NULL) = 1740
> >   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
> >   readv(6, [{""..., 2356}], 1) = 2356
> >   readv(6, [{""..., 4096}], 1) = 3362
> >   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
> >   readv(6, [{""..., 734}, {""..., 4096}], 2) = 4830
> >   readv(6, [{""..., 4096}], 1) = 4096
> >   readv(6, [{""..., 4096}], 1) = 4096
> >   readv(6, [{""..., 4096}], 1) = 4096
> >   readv(6, [{""..., 4096}], 1) = 3619
> >   epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
> >   ... some other socket with activity and epoll_wait calls for 50 seconds
> >   epoll_wait(86, [], 512, 3331) = 0
> >   close(6)                = 0
> >
> >  So the EPOLLRDHUP seems to never happen here.
> >
> >
> > As a side-note, in strace output (over lots of sessions) I see almost no readv() returning -1 EAGAIN
> > except for the first readv() following revcfrom().
> >
> >
> > I hope this helps at looking for the cause, as well as that not collapsing FIN packets
> > also prevents the issue.  
> 
> Very interesting Bruno, thanks a lot for all these details.
> 
> I wonder if you have some existing epoll notification bug, that would
> trigger if the last packet received from the peer
> contains both data and FIN.

I just did take a tcpdump on the nginx host and have a TCP session
for which the issue occurred in the dump, and a few working ones too (I
can't share it in public though unless I could annonymise the payload).
For almost all connections I have in the dump, last packet from
upstream has payload as well as FIN flag, same for the failing one.
So data with payload is not the only trigger (amount of data probably
neither as one with 6 bytes passed, the failing one had 7), if at all.

The dump doesn't reveal anything that would catch my attention though.
At best a difference in timing as to when nginx host sends ACKs.

> Were the "ss" outputs taken from nginx side ?

Yes, they were.

Cheers,
Bruno

> > > > Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?  
> > >
> > > Here it is, three distinct reproducing attempts:
> > > State      Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
> > > ESTAB      0       0       158.64.2.228:44248  158.64.2.217:webcache              uid:83 ino:13245 sk:87 <->
> > >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack:46790 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.156
> > >
> > >
> > > ESTAB      0      0                                                                   2001:a18:1:6::228:33572                                                                             2001:a18:1:6::217:webcache              uid:83 ino:16699 sk:e1 <->
> > >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 cwnd:10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8360 pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 rcv_space:14280 minrtt:0.161
> > >
> > >
> > > ESTAB      0      0                                                                        158.64.2.228:44578                                                                                  158.64.2.217:webcache              uid:83 ino:17628 sk:12c <->
> > >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastack:11340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.187
> > >
> > >
> > > From nginx debug logging I don't get a real clue though it seems for working connections
> > > the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM |
> > > EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous ones are 5
> > > while for failing connections it looks like last event seen is 5 (EPOLLIN | EPOLLOUT).
> > >  
> > > > This migtht give us a clue about packets being dropped, say the the
> > > > accumulated packet became too big.  
> > >
> > >
> > > The following minor patch (might be white-space mangled) does prevent the issue
> > > happening for me:
> > >
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 4904250a9aac..c102cd367c79 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
> > >         if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
> > >             TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
> > >             ((TCP_SKB_CB(tail)->tcp_flags |
> > > -             TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
> > > +             TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)) ||
> > >             ((TCP_SKB_CB(tail)->tcp_flags ^
> > >               TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
> > >  #ifdef CONFIG_TLS_DEVICE
> > >
> > > Cheers,
> > > Bruno