lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iL3nwHHXM_VNXZbB_143wW+1nnmMqMLDQTXM=-6vLY3bA@mail.gmail.com>
Date:   Thu, 25 Apr 2019 06:30:24 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     Bruno Prémont <bonbons@...ophe.eu>
Cc:     Richard Purdie <richard.purdie@...uxfoundation.org>,
        Neal Cardwell <ncardwell@...gle.com>,
        Yuchung Cheng <ycheng@...gle.com>,
        "David S. Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Alexander Kanavin <alex.kanavin@...il.com>,
        Bruce Ashfield <bruce.ashfield@...il.com>
Subject: Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

On Thu, Apr 25, 2019 at 6:13 AM Bruno Prémont <bonbons@...ophe.eu> wrote:
>
> Hi Eric,
>
> On Thu, 25 Apr 2019 09:55:31 +0200 Bruno Prémont wrote:
> > Hi Eric,
> >
> > On Wed, 24 Apr 2019 08:47:27 -0700 Eric Dumazet wrote:
> > > On Wed, Apr 24, 2019 at 7:51 AM Bruno Prémont wrote:
> > > >
> > > > Hi Eric,
> > > >
> > > > I'm seeing issues with this patch as well, not as regular as for
> > > > Richard but still (about up to one in 30-50 TCP sessions).
> > > >
> > > > In my case I have a virtual machine (on VMWare) with this patch where
> > > > NGINX as reverse proxy misses part (end) of payload from its upstream
> > > > and times out on the upstream connection (while according to tcpdump all
> > > > packets including upstream's FIN were sent and the upstream did get
> > > > ACKs from the VM).
> > > >
> > > > From when browsers get from NGINX it feels as if at some point reading
> > > > from the socket or waiting for data using select() never returned data
> > > > that arrived as more than just EOF is missing.
> > > >
> > > > The upstream is a hardware machine in the same subnet.
> > > >
> > > > My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
> > > > as network adapter which lists the following features:
> > > >
> > >
> > > Hi Bruno.
> > >
> > > I suspect a EPOLLIN notification being lost by the application.
> > >
> > > Fact that TCP backlog contains 1 instead of 2+ packets should not
> > > change stack behavior,
> > > this packet should land into socket receive queue eventually.
> > >
> > > Are you using epoll() in Edge Trigger mode. You mention select() but
> > > select() is a rather old and inefficient API.
> >
> > nginx is using epoll (c.f. http://nginx.org/en/docs/events.html)
> >
> > For source, see here
> > https://trac.nginx.org/nginx/browser/nginx/src/event/modules/ngx_epoll_module.c?rev=ebf8c9686b8ce7428f975d8a567935ea3722da70
>
> I've been trying to reproduce while running nginx under strace (with verbose logging), but
> have not been successful. Looks like tracing modifies the timing in a
> way to prevent the issue.
> Without verbose logging I could hit it (though needed way more perseverence).
>
> What I do have for a working session under strace is:
>   socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 12
>   ioctl(12, FIONBIO, [1]) = 0
>   epoll_ctl(9, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221158320, u64=139682147558832}}) = 0
>   connect(12, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "2001:a18:1:6::217", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
>   epoll_wait(9, [{EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57230) = 1
>   getsockopt(12, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
>   writev(12, [{""..., 830}, {""..., 115}], 2) = 945
>   epoll_wait(9, [{EPOLLIN|EPOLLOUT, {u32=1221158320, u64=139682147558832}}], 512, 57220) = 1
>   recvfrom(12, ""..., 4096, 0, NULL, NULL) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 4096
>   readv(12, [{""..., 4096}], 1) = 3855
>   epoll_wait(9, [{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1221158320, u64=139682147558832}}], 512, 57190) = 1
>   readv(12, [{"", 241}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}, {"", 4096}], 7) = 0
>   close(12)               = 0
>
> Under failing session I got:
>   socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 6
>   ioctl(6, FIONBIO, [1])  = 0
>   epoll_ctl(86, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1221159521, u64=139682147560033}}) = 0
>   connect(6, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("158.64.2.217")}, 16) = -1 EINPROGRESS (Operation now in progress)
>   epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
>   getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
>   writev(6, [{""..., 830}, {""..., 120}], 2) = 950
>   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 60000) = 1
>   recvfrom(6, ""..., 4096, 0, NULL, NULL) = 1740
>   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
>   readv(6, [{""..., 2356}], 1) = 2356
>   readv(6, [{""..., 4096}], 1) = 3362
>   epoll_wait(86, [{EPOLLIN|EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
>   readv(6, [{""..., 734}, {""..., 4096}], 2) = 4830
>   readv(6, [{""..., 4096}], 1) = 4096
>   readv(6, [{""..., 4096}], 1) = 4096
>   readv(6, [{""..., 4096}], 1) = 4096
>   readv(6, [{""..., 4096}], 1) = 3619
>   epoll_wait(86, [{EPOLLOUT, {u32=1221159521, u64=139682147560033}}], 512, 59980) = 1
>   ... some other socket with activity and epoll_wait calls for 50 seconds
>   epoll_wait(86, [], 512, 3331) = 0
>   close(6)                = 0
>
>  So the EPOLLRDHUP seems to never happen here.
>
>
> As a side-note, in strace output (over lots of sessions) I see almost no readv() returning -1 EAGAIN
> except for the first readv() following revcfrom().
>
>
> I hope this helps at looking for the cause, as well as that not collapsing FIN packets
> also prevents the issue.

Very interesting Bruno, thanks a lot for all these details.

I wonder if you have some existing epoll notification bug, that would
trigger if the last packet received from the peer
contains both data and FIN.

Were the "ss" outputs taken from nginx side ?

>
> Cheers,
> Bruno
>
>
> > > Could you watch/report the output of " ss -temoi "  for the frozen TCP flow ?
> >
> > Here it is, three distinct reproducing attempts:
> > State      Recv-Q  Send-Q  Local Address:Port  Peer Address:Port
> > ESTAB      0       0       158.64.2.228:44248  158.64.2.217:webcache              uid:83 ino:13245 sk:87 <->
> >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.24/0.118 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28381 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 482.7Mbps lastsnd:46810 lastrcv:46790 lastack:46790 pacing_rate 965.3Mbps delivery_rate 74.3Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.156
> >
> >
> > ESTAB      0      0                                                                   2001:a18:1:6::228:33572                                                                             2001:a18:1:6::217:webcache              uid:83 ino:16699 sk:e1 <->
> >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.231/0.11 ato:40 mss:1428 rcvmss:1428 advmss:1428 cwnd:10 bytes_acked:948 bytes_received:28474 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 494.5Mbps lastsnd:8380 lastrcv:8360 lastack:8360 pacing_rate 989.1Mbps delivery_rate 71.0Mbps app_limited rcv_rtt:1.109 rcv_space:14280 minrtt:0.161
> >
> >
> > ESTAB      0      0                                                                        158.64.2.228:44578                                                                                  158.64.2.217:webcache              uid:83 ino:17628 sk:12c <->
> >          skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) ts sack cubic wscale:7,7 rto:210 rtt:0.279/0.136 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:949 bytes_received:28481 segs_out:12 segs_in:12 data_segs_out:1 data_segs_in:10 send 415.2Mbps lastsnd:11360 lastrcv:11330 lastack:11340 pacing_rate 828.2Mbps delivery_rate 61.9Mbps app_limited rcv_rtt:1 rcv_space:14480 minrtt:0.187
> >
> >
> > From nginx debug logging I don't get a real clue though it seems for working connections
> > the last event obtained is 2005 (EPOLLMSG | EPOLLWRBAND | EPOLLWRNORM |
> > EPOLLRDBAND | EPOLLRDNORM | EPOLLHUP | EPOLLIN | EPOLLOUT) - previous ones are 5
> > while for failing connections it looks like last event seen is 5 (EPOLLIN | EPOLLOUT).
> >
> > > This migtht give us a clue about packets being dropped, say the the
> > > accumulated packet became too big.
> >
> >
> > The following minor patch (might be white-space mangled) does prevent the issue
> > happening for me:
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 4904250a9aac..c102cd367c79 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1667,7 +1667,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
> >         if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
> >             TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
> >             ((TCP_SKB_CB(tail)->tcp_flags |
> > -             TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
> > +             TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_URG | TCPHDR_FIN)) ||
> >             ((TCP_SKB_CB(tail)->tcp_flags ^
> >               TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
> >  #ifdef CONFIG_TLS_DEVICE
> >
> > Cheers,
> > Bruno
> >
> >
> > > > rx-checksumming: on
> > > > tx-checksumming: on
> > > >         tx-checksum-ipv4: off [fixed]
> > > >         tx-checksum-ip-generic: on
> > > >         tx-checksum-ipv6: off [fixed]
> > > >         tx-checksum-fcoe-crc: off [fixed]
> > > >         tx-checksum-sctp: off [fixed]
> > > > scatter-gather: on
> > > >         tx-scatter-gather: on
> > > >         tx-scatter-gather-fraglist: off [fixed]
> > > > tcp-segmentation-offload: on
> > > >         tx-tcp-segmentation: on
> > > >         tx-tcp-ecn-segmentation: off [fixed]
> > > >         tx-tcp-mangleid-segmentation: off
> > > >         tx-tcp6-segmentation: on
> > > > udp-fragmentation-offload: off
> > > > generic-segmentation-offload: on
> > > > generic-receive-offload: on
> > > > large-receive-offload: on
> > > > rx-vlan-offload: on
> > > > tx-vlan-offload: on
> > > > ntuple-filters: off [fixed]
> > > > receive-hashing: off [fixed]
> > > > highdma: on
> > > > rx-vlan-filter: on [fixed]
> > > > vlan-challenged: off [fixed]
> > > > tx-lockless: off [fixed]
> > > > netns-local: off [fixed]
> > > > tx-gso-robust: off [fixed]
> > > > tx-fcoe-segmentation: off [fixed]
> > > > tx-gre-segmentation: off [fixed]
> > > > tx-gre-csum-segmentation: off [fixed]
> > > > tx-ipxip4-segmentation: off [fixed]
> > > > tx-ipxip6-segmentation: off [fixed]
> > > > tx-udp_tnl-segmentation: off [fixed]
> > > > tx-udp_tnl-csum-segmentation: off [fixed]
> > > > tx-gso-partial: off [fixed]
> > > > tx-sctp-segmentation: off [fixed]
> > > > tx-esp-segmentation: off [fixed]
> > > > tx-udp-segmentation: off [fixed]
> > > > fcoe-mtu: off [fixed]
> > > > tx-nocache-copy: off
> > > > loopback: off [fixed]
> > > > rx-fcs: off [fixed]
> > > > rx-all: off [fixed]
> > > > tx-vlan-stag-hw-insert: off [fixed]
> > > > rx-vlan-stag-hw-parse: off [fixed]
> > > > rx-vlan-stag-filter: off [fixed]
> > > > l2-fwd-offload: off [fixed]
> > > > hw-tc-offload: off [fixed]
> > > > esp-hw-offload: off [fixed]
> > > > esp-tx-csum-hw-offload: off [fixed]
> > > > rx-udp_tunnel-port-offload: off [fixed]
> > > > tls-hw-tx-offload: off [fixed]
> > > > tls-hw-rx-offload: off [fixed]
> > > > rx-gro-hw: off [fixed]
> > > > tls-hw-record: off [fixed]
> > > >
> > > >
> > > > I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.
> > > >
> > > > Cheers,
> > > > Bruno
> > > >
> > > > On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@...uxfoundation.org wrote:
> > > > > Hi,
> > > > >
> > > > > I've been chasing down why a python test from the python3 testsuite
> > > > > started failing and it seems to point to this kernel change in the
> > > > > networking stack.
> > > > >
> > > > > In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> > > > > test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> > > > > 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
> > > > >
> > > > > $ python3 -m test test_httplib -v
> > > > > == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> > > > > == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> > > > > == cwd: /var/volatile/tmp/test_python_288
> > > > > == CPU count: 1
> > > > > == encodings: locale=UTF-8, FS=utf-8
> > > > > [...]
> > > > > test_response_fileno (test.test_httplib.BasicTest) ...
> > > > >
> > > > > and it hangs in test_response_fileno.
> > > > >
> > > > > The test in question comes from Lib/test/test_httplib.py in the python
> > > > > source tree and the code is:
> > > > >
> > > > >     def test_response_fileno(self):
> > > > >         # Make sure fd returned by fileno is valid.
> > > > >         serv = socket.socket(
> > > > >             socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
> > > > >         self.addCleanup(serv.close)
> > > > >         serv.bind((HOST, 0))
> > > > >         serv.listen()
> > > > >
> > > > >         result = None
> > > > >         def run_server():
> > > > >             [conn, address] = serv.accept()
> > > > >             with conn, conn.makefile("rb") as reader:
> > > > >                 # Read the request header until a blank line
> > > > >                 while True:
> > > > >                     line = reader.readline()
> > > > >                     if not line.rstrip(b"\r\n"):
> > > > >                         break
> > > > >                 conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
> > > > >                 nonlocal result
> > > > >                 result = reader.read()
> > > > >
> > > > >         thread = threading.Thread(target=run_server)
> > > > >         thread.start()
> > > > >         self.addCleanup(thread.join, float(1))
> > > > >         conn = client.HTTPConnection(*serv.getsockname())
> > > > >         conn.request("CONNECT", "dummy:1234")
> > > > >         response = conn.getresponse()
> > > > >         try:
> > > > >             self.assertEqual(response.status, client.OK)
> > > > >             s = socket.socket(fileno=response.fileno())
> > > > >             try:
> > > > >                 s.sendall(b"proxied data\n")
> > > > >             finally:
> > > > >                 s.detach()
> > > > >         finally:
> > > > >             response.close()
> > > > >             conn.close()
> > > > >         thread.join()
> > > > >         self.assertEqual(result, b"proxied data\n")
> > > > >
> > > > > I was hoping someone with more understanding of the networking stack
> > > > > could look at this and tell whether its a bug in the python test, the
> > > > > kernel change or otherwise give a pointer to where the problem might
> > > > > be? I'll freely admit this is not an area I know much about.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Richard
> > > > >
> > > > >
> > > > >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ