[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190424165150.1420b046@pluto.restena.lu>
Date: Wed, 24 Apr 2019 16:51:50 +0200
From: Bruno Prémont <bonbons@...ophe.eu>
To: Eric Dumazet <edumazet@...gle.com>
Cc: richard.purdie@...uxfoundation.org,
Neal Cardwell <ncardwell@...gle.com>,
Yuchung Cheng <ycheng@...gle.com>,
"David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
Alexander Kanavin <alex.kanavin@...il.com>,
Bruce Ashfield <bruce.ashfield@...il.com>
Subject: Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
Hi Eric,
I'm seeing issues with this patch as well, not as regular as for
Richard but still (about up to one in 30-50 TCP sessions).
In my case I have a virtual machine (on VMWare) with this patch where
NGINX as reverse proxy misses part (end) of payload from its upstream
and times out on the upstream connection (while according to tcpdump all
packets including upstream's FIN were sent and the upstream did get
ACKs from the VM).
From when browsers get from NGINX it feels as if at some point reading
from the socket or waiting for data using select() never returned data
that arrived as more than just EOF is missing.
The upstream is a hardware machine in the same subnet.
My VM is using VMware VMXNET3 Ethernet Controller [15ad:07b0] (rev 01)
as network adapter which lists the following features:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: off
tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
I can reproduce the issue with kernels 5.0.x and as recent as 5.1-rc6.
Cheers,
Bruno
On Sunday, April 7, 2019 11:28:30 PM CEST, richard.purdie@...uxfoundation.org wrote:
> Hi,
>
> I've been chasing down why a python test from the python3 testsuite
> started failing and it seems to point to this kernel change in the
> networking stack.
>
> In kernels beyond commit 4f693b55c3d2d2239b8a0094b518a1e533cf75d5 the
> test hangs about 90% of the time (I've reproduced with 5.1-rc3, 5.0.7,
> 5.0-rc1 but not 4.18, 4.19 or 4.20). The reproducer is:
>
> $ python3 -m test test_httplib -v
> == CPython 3.7.2 (default, Apr 5 2019, 15:17:15) [GCC 8.3.0]
> == Linux-5.0.0-yocto-standard-x86_64-with-glibc2.2.5 little-endian
> == cwd: /var/volatile/tmp/test_python_288
> == CPU count: 1
> == encodings: locale=UTF-8, FS=utf-8
> [...]
> test_response_fileno (test.test_httplib.BasicTest) ...
>
> and it hangs in test_response_fileno.
>
> The test in question comes from Lib/test/test_httplib.py in the python
> source tree and the code is:
>
> def test_response_fileno(self):
> # Make sure fd returned by fileno is valid.
> serv = socket.socket(
> socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP)
> self.addCleanup(serv.close)
> serv.bind((HOST, 0))
> serv.listen()
>
> result = None
> def run_server():
> [conn, address] = serv.accept()
> with conn, conn.makefile("rb") as reader:
> # Read the request header until a blank line
> while True:
> line = reader.readline()
> if not line.rstrip(b"\r\n"):
> break
> conn.sendall(b"HTTP/1.1 200 Connection established\r\n\r\n")
> nonlocal result
> result = reader.read()
>
> thread = threading.Thread(target=run_server)
> thread.start()
> self.addCleanup(thread.join, float(1))
> conn = client.HTTPConnection(*serv.getsockname())
> conn.request("CONNECT", "dummy:1234")
> response = conn.getresponse()
> try:
> self.assertEqual(response.status, client.OK)
> s = socket.socket(fileno=response.fileno())
> try:
> s.sendall(b"proxied data\n")
> finally:
> s.detach()
> finally:
> response.close()
> conn.close()
> thread.join()
> self.assertEqual(result, b"proxied data\n")
>
> I was hoping someone with more understanding of the networking stack
> could look at this and tell whether its a bug in the python test, the
> kernel change or otherwise give a pointer to where the problem might
> be? I'll freely admit this is not an area I know much about.
>
> Cheers,
>
> Richard
>
>
>
Powered by blists - more mailing lists