netdev - Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190425151326.305b5c70@pluto.restena.lu>
Date:   Thu, 25 Apr 2019 15:13:26 +0200
From:   Bruno Prémont <bonbons@...ophe.eu>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     richard.purdie@...uxfoundation.org,
        Neal Cardwell <ncardwell@...gle.com>,
        Yuchung Cheng <ycheng@...gle.com>,
        "David S. Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Alexander Kanavin <alex.kanavin@...il.com>,
        Bruce Ashfield <bruce.ashfield@...il.com>
Subject: Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

Hi Eric,

On Thu, 25 Apr 2019 09:55:31 +0200 Bruno Prémont wrote:
> Hi Eric,
> 
> On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote:
> > On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont wrote:  
> > >
> > > Hi Eric,
> > >
> > > I'm seeing issues with this patch as well, not as regular as for
> > > Richard but still (about up to one in 30-50 TCP sessions).
> > >
> > > In my case I have a virtual machine (on VMWare) with this patch where
> > > NGINX as reverse proxy misses part (end) of payload from its upstream
> > > and times out on the upstream connection (while according to tcpdump all
> > > packets including upstream's FIN were sent and the upstream did get
> > > ACKs from the VM).
> > >
> > > From when browsers get from NGINX it feels as if at some point reading
> > > from the socket or waiting for data using select() never returned data
> > > that arrived as more than just EOF is missing.
> > >
> > > The upstream is a hardware machine in the same subnet.
> > >
> > > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> > > as network adapter which lists the following features:
> > >    
> > 
> > Hi Bruno.
> > 
> > I suspect a EPOLLIN notification being lost by the application.
> > 
> > Fact that TCP backlog contains 1 instead of 2+ packets should not
> > change stack behavior,
> > this packet should land into socket receive queue eventually.
> > 
> > Are you using epoll() in Edge Trigger mode. You mention select() but
> > select() is a rather old and inefficient API.  
> 
> nginx is using epoll (c.f. http://nginx.org/en/docs/events.html)
> 
> For source, see here
> https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epoll_module.c?rev=ebf8c9686b8ce7428f975d8a567935ea3722da70

I've been trying to reproduce while running nginx under strace (with verbose logging), but
have not been successful. Looks like tracing modifies the timing in a
way to prevent the issue.
Without verbose logging I could hit it (though needed way more perseverence).

What I do have for a working session under strace is:
  socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 12
  ioctl(12, FIONBIO, [1]) = 0
  epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221158320, u64=139682147558832}}) = 0
  connect(12, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "2001:a18:1:6::217", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
  epoll_wait(9, [{EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57230) = 1
  getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
  writev(12, [{""..., 830}, {""..., 115}], 2) = 945
  epoll_wait(9, [{EPOLLIN|EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57220) = 1
  recvfrom(12, ""..., 4096, 0, NULL, NULL) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 4096
  readv(12, [{""..., 4096}], 1) = 3855
  epoll_wait(9, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1221158320, u64=139682147558832}}], 512, 57190) = 1
  readv(12, [{"", 241}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}], 7) = 0
  close(12)               = 0

Under failing session I got:
  socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
  ioctl(6, FIONBIO, [1])  = 0
  epoll_ctl(86, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221159521, u64=139682147560033}}) = 0
  connect(6, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("158.64.2.217")}, 16) = -1 EINPROGRESS (Operation now in progress)
  epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
  getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
  writev(6, [{""..., 830}, {""..., 120}], 2) = 950
  epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
  recvfrom(6, ""..., 4096, 0, NULL, NULL) = 1740
  epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
  readv(6, [{""..., 2356}], 1) = 2356
  readv(6, [{""..., 4096}], 1) = 3362
  epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
  readv(6, [{""..., 734}, {""..., 4096}], 2) = 4830
  readv(6, [{""..., 4096}], 1) = 4096
  readv(6, [{""..., 4096}], 1) = 4096
  readv(6, [{""..., 4096}], 1) = 4096
  readv(6, [{""..., 4096}], 1) = 3619
  epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
  ... some other socket with activity and epoll_wait calls for 50 seconds
  epoll_wait(86, [], 512, 3331) = 0
  close(6)                = 0

 So the EPOLLRDHUP seems to never happen here.


As a side-note, in strace output (over lots of sessions) I see almost no readv() returning -1 EAGAIN
except for the first readv() following revcfrom().


I hope this helps at looking for the cause, as well as that not collapsing FIN packets
also prevents the issue.

Cheers,
Bruno


> > Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?  
> 
> Here it is, three distinct reproducing attempts:
> State      Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
> ESTAB      0       0       158.64.2.228:44248  158.64.2.217:webcache              uid:83 ino:13245 sk:87 <->
>          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack:46790 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.156
> 
> 
> ESTAB      0      0                                                                   2001:a18:1:6::228:33572                                                                             2001:a18:1:6::217:webcache              uid:83 ino:16699 sk:e1 <->
>          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 cwnd:10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8360 pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 rcv_space:14280 minrtt:0.161
> 
> 
> ESTAB      0      0                                                                        158.64.2.228:44578                                                                                  158.64.2.217:webcache              uid:83 ino:17628 sk:12c <->
>          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastack:11340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.187
> 
> 
> From nginx debug logging I don't get a real clue though it seems for working connections
> the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM |
> EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous ones are 5
> while for failing connections it looks like last event seen is 5 (EPOLLIN | EPOLLOUT).
> 
> > This migtht give us a clue about packets being dropped, say the the
> > accumulated packet became too big.  
> 
> 
> The following minor patch (might be white-space mangled) does prevent the issue
> happening for me:
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 4904250a9aac..c102cd367c79 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>         if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
>             TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
>             ((TCP_SKB_CB(tail)->tcp_flags |
> -             TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
> +             TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)) ||
>             ((TCP_SKB_CB(tail)->tcp_flags ^
>               TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
>  #ifdef CONFIG_TLS_DEVICE
> 
> Cheers,
> Bruno
> 
> 
> > > rx-checksumming: on
> > > tx-checksumming: on
> > >         tx-checksum-ipv4: off [fixed]
> > >         tx-checksum-ip-generic: on
> > >         tx-checksum-ipv6: off [fixed]
> > >         tx-checksum-fcoe-crc: off [fixed]
> > >         tx-checksum-sctp: off [fixed]
> > > scatter-gather: on
> > >         tx-scatter-gather: on
> > >         tx-scatter-gather-fraglist: off [fixed]
> > > tcp-segmentation-offload: on
> > >         tx-tcp-segmentation: on
> > >         tx-tcp-ecn-segmentation: off [fixed]
> > >         tx-tcp-mangleid-segmentation: off
> > >         tx-tcp6-segmentation: on
> > > udp-fragmentation-offload: off
> > > generic-segmentation-offload: on
> > > generic-receive-offload: on
> > > large-receive-offload: on
> > > rx-vlan-offload: on
> > > tx-vlan-offload: on
> > > ntuple-filters: off [fixed]
> > > receive-hashing: off [fixed]
> > > highdma: on
> > > rx-vlan-filter: on [fixed]
> > > vlan-challenged: off [fixed]
> > > tx-lockless: off [fixed]
> > > netns-local: off [fixed]
> > > tx-gso-robust: off [fixed]
> > > tx-fcoe-segmentation: off [fixed]
> > > tx-gre-segmentation: off [fixed]
> > > tx-gre-csum-segmentation: off [fixed]
> > > tx-ipxip4-segmentation: off [fixed]
> > > tx-ipxip6-segmentation: off [fixed]
> > > tx-udp_tnl-segmentation: off [fixed]
> > > tx-udp_tnl-csum-segmentation: off [fixed]
> > > tx-gso-partial: off [fixed]
> > > tx-sctp-segmentation: off [fixed]
> > > tx-esp-segmentation: off [fixed]
> > > tx-udp-segmentation: off [fixed]
> > > fcoe-mtu: off [fixed]
> > > tx-nocache-copy: off
> > > loopback: off [fixed]
> > > rx-fcs: off [fixed]
> > > rx-all: off [fixed]
> > > tx-vlan-stag-hw-insert: off [fixed]
> > > rx-vlan-stag-hw-parse: off [fixed]
> > > rx-vlan-stag-filter: off [fixed]
> > > l2-fwd-offload: off [fixed]
> > > hw-tc-offload: off [fixed]
> > > esp-hw-offload: off [fixed]
> > > esp-tx-csum-hw-offload: off [fixed]
> > > rx-udp_tunnel-port-offload: off [fixed]
> > > tls-hw-tx-offload: off [fixed]
> > > tls-hw-rx-offload: off [fixed]
> > > rx-gro-hw: off [fixed]
> > > tls-hw-record: off [fixed]
> > >
> > >
> > > I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.
> > >
> > > Cheers,
> > > Bruno
> > >
> > > On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@...uxfoundation.org wrote:    
> > > > Hi,
> > > >
> > > > I've been chasing down why a python test from the python3 testsuite
> > > > started failing and it seems to point to this kernel change in the
> > > > networking stack.
> > > >
> > > > In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> > > > test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> > > > 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
> > > >
> > > > $ python3 -m test test_httplib -v
> > > > == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> > > > == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> > > > == cwd: /var/volatile/tmp/test_python_288
> > > > == CPU count: 1
> > > > == encodings: locale=UTF-8, FS=utf-8
> > > > [...]
> > > > test_response_fileno (test.test_httplib.BasicTest) ...
> > > >
> > > > and it hangs in test_response_fileno.
> > > >
> > > > The test in question comes from Lib/test/test_httplib.py in the python
> > > > source tree and the code is:
> > > >
> > > >     def test_response_fileno(self):
> > > >         # Make sure fd returned by fileno is valid.
> > > >         serv = socket.socket(
> > > >             socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
> > > >         self.addCleanup(serv.close)
> > > >         serv.bind((HOST, 0))
> > > >         serv.listen()
> > > >
> > > >         result = None
> > > >         def run_server():
> > > >             [conn, address] = serv.accept()
> > > >             with conn, conn.makefile("rb") as reader:
> > > >                 # Read the request header until a blank line
> > > >                 while True:
> > > >                     line = reader.readline()
> > > >                     if not line.rstrip(b"\r\n"):
> > > >                         break
> > > >                 conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
> > > >                 nonlocal result
> > > >                 result = reader.read()
> > > >
> > > >         thread = threading.Thread(target=run_server)
> > > >         thread.start()
> > > >         self.addCleanup(thread.join, float(1))
> > > >         conn = client.HTTPConnection(*serv.getsockname())
> > > >         conn.request("CONNECT", "dummy:1234")
> > > >         response = conn.getresponse()
> > > >         try:
> > > >             self.assertEqual(response.status, client.OK)
> > > >             s = socket.socket(fileno=response.fileno())
> > > >             try:
> > > >                 s.sendall(b"proxied data\n")
> > > >             finally:
> > > >                 s.detach()
> > > >         finally:
> > > >             response.close()
> > > >             conn.close()
> > > >         thread.join()
> > > >         self.assertEqual(result, b"proxied data\n")
> > > >
> > > > I was hoping someone with more understanding of the networking stack
> > > > could look at this and tell whether its a bug in the python test, the
> > > > kernel change or otherwise give a pointer to where the problem might
> > > > be? I'll freely admit this is not an area I know much about.
> > > >
> > > > Cheers,
> > > >
> > > > Richard
> > > >
> > > >
> > > >