netdev - Re: TCP many-connection regression (bisected to 4.5.0-rc2+)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 23 Jan 2018 15:21:24 -0800
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Ben Greear <greearb@...delatech.com>,
        netdev <netdev@...r.kernel.org>
Subject: Re: TCP many-connection regression (bisected to 4.5.0-rc2+)

On Tue, 2018-01-23 at 15:10 -0800, Ben Greear wrote:
> On 01/23/2018 02:29 PM, Eric Dumazet wrote:
> > On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote:
> > > On 01/23/2018 02:07 PM, Eric Dumazet wrote:
> > > > On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote:
> > > > > On 01/22/2018 10:16 AM, Eric Dumazet wrote:
> > > > > > On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote:
> > > > > > > My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other
> > > > > > > on a system with 16GB RAM and send slow-speed data.  This works fine on a 4.7 kernel, but
> > > > > > > will not work at all on a 4.13.  The 4.13 first complains about running out of tcp memory,
> > > > > > > but even after forcing those values higher, the max connections we can get is around 15k.
> > > > > > > 
> > > > > > > Both kernels have my out-of-tree patches applied, so it is possible it is my fault
> > > > > > > at this point.
> > > > > > > 
> > > > > > > Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels?
> > > > > > > 
> > > > > > > I will start bisecting in the meantime...
> > > > > > > 
> > > > > > 
> > > > > > Hi Ben
> > > > > > 
> > > > > > Unfortunately I have no idea.
> > > > > > 
> > > > > > Are you using loopback flows, or have I misunderstood you ?
> > > > > > 
> > > > > > How loopback connections can be slow-speed ?
> > > > > > 
> > > > > 
> > > > > Hello Eric, looks like it is one of your commits that causes the issue
> > > > > I see.
> > > > > 
> > > > > Here are some more details on my specific test case I used to bisect:
> > > > > 
> > > > > I have two ixgbe ports looped back, configured on same subnet, but with different IPs.
> > > > > Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client and server
> > > > > side let me send-to-self over the external looped cable.
> > > > > 
> > > > > I have 2 mac-vlans on each physical interface.
> > > > > 
> > > > > I created 5 server-side connections on one physical port, and two more on one of the mac-vlans.
> > > > > 
> > > > > On the client-side, I create a process that spawns 5000 connections to the corresponding server side.
> > > > > 
> > > > > End result is 25,000 connections on one pair of real interfaces, and 10,000 connections on the
> > > > > mac-vlan ports.
> > > > > 
> > > > > In the passing case, I get very close to all 5000 connections on all endpoints quickly.
> > > > > 
> > > > > In the failing case, I get a max of around 16k connections on the two physical ports.  The two mac-vlans have 10k connections
> > > > > across them working reliably.  It seems to be an issue with 'connect' failing.
> > > > > 
> > > > > connect(2074, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress)
> > > > > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075
> > > > > fcntl(2075, F_GETFD)                    = 0
> > > > > fcntl(2075, F_SETFD, FD_CLOEXEC)        = 0
> > > > > setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
> > > > > setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > > > > bind(2075, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0
> > > > > getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
> > > > > getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
> > > > > setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0
> > > > > fcntl(2075, F_GETFL)                    = 0x2 (flags O_RDWR)
> > > > > fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
> > > > > connect(2075, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress)
> > > > > socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076
> > > > > fcntl(2076, F_GETFD)                    = 0
> > > > > fcntl(2076, F_SETFD, FD_CLOEXEC)        = 0
> > > > > setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
> > > > > setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > > > > bind(2076, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0
> > > > > getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
> > > > > getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
> > > > > setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0
> > > > > fcntl(2076, F_GETFL)                    = 0x2 (flags O_RDWR)
> > > > > fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0
> > > > > connect(2076, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address)
> > > > > ....
> > > > > 
> > > > > 
> > > > > ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit
> > > > > commit ea8add2b190395408b22a9127bed2c0912aecbc8
> > > > > Author: Eric Dumazet <edumazet@...gle.com>
> > > > > Date:   Thu Feb 11 16:28:50 2016 -0800
> > > > > 
> > > > >      tcp/dccp: better use of ephemeral ports in bind()
> > > > > 
> > > > >      Implement strategy used in __inet_hash_connect() in opposite way :
> > > > > 
> > > > >      Try to find a candidate using odd ports, then fallback to even ports.
> > > > > 
> > > > >      We no longer disable BH for whole traversal, but one bucket at a time.
> > > > >      We also use cond_resched() to yield cpu to other tasks if needed.
> > > > > 
> > > > >      I removed one indentation level and tried to mirror the loop we have
> > > > >      in __inet_hash_connect() and variable names to ease code maintenance.
> > > > > 
> > > > >      Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > > > >      Signed-off-by: David S. Miller <davem@...emloft.net>
> > > > > 
> > > > > :040000 040000 3af4595c6eb6d331e1cba78a142d44e00f710d81 e0c014ae8b7e2867256eff60f6210821d36eacef M	net
> > > > > 
> > > > > 
> > > > > I will be happy to test patches or try to get any other results that might help diagnose
> > > > > this problem better.
> > > > 
> > > > Problem is I do not see anything obvious here.
> > > > 
> > > > Please provide /proc/sys/net/ipv4/ip_local_port_range
> > > 
> > > [root@...003-e3v2-13100124-f20x64 ~]# cat /proc/sys/net/ipv4/ip_local_port_range
> > > 10000	61001
> > > 
> > > > 
> > > > Also you probably could use IP_BIND_ADDRESS_NO_PORT socket option
> > > > before the bind()
> > > 
> > > I'll read up on that to see what it does...
> > 
> > man 7 ip
> > 
> >        IP_BIND_ADDRESS_NO_PORT (since Linux 4.2)
> >               Inform
> > the kernel to not reserve an ephemeral  port
> >               when  using
> > bind(2)  with a port number of 0.  The
> >               port will later be
> > automatically  chosen  at  con‐
> >               nect(2) time, in a way
> > that allows sharing a source
> >               port as long as the 4-tuple
> > is unique.
> > 
> 
> Yes, I found that.
> 
> It appears this option works well for my case, and I see 30k connections across my pair of e1000e
> (though the NIC is wretching again, so I guess its issues are not fully resolved).
> 
> I tested this on my 4.13.16+ kernel.
> 
> But that said, maybe there is still some issue with the patch I bisected to, so if you have
> other suggestions, I can back out this IP_BIND_ADDRESS_NO_PORT feature and re-test.
> 
> Also, I had to increase /proc/sys/net/ipv4/tcp_mem to get 30k connections to work without
> the kernel spamming:
> 
> Jan 23 15:02:41 lf1003-e3v2-13100124-f20x64 kernel: TCP: out of memory -- consider tuning tcp_mem
> Jan 23 15:02:41 lf1003-e3v2-13100124-f20x64 kernel: TCP: out of memory -- consider tuning tcp_mem
> 
> This is a 16 GB RAM system, and I did not have to tune this on the 4.5.0-rc2+ (good) kernels
> to get the similar performance.  I was testing on ixgbe there though, possibly that is part
> of it, or maybe I just need to force tcp_mem to be larger on more recent kernels??


Since linux-4.2 tcp_mem[0,1,2] defaults are 4.68%, 6.25%, 9.37% of
physical memory. 

It used to be twice that in older kernels.

It is also possible that some change in TCP congestion or autotuning
allows each of your TCP flow to store more data in its write queue,
if your application is pushing bulk data as much as it can.

It is virtually not possible to change anything in the kernel without
zero impact on very pathological use cases.

tcp_wmem[2] is 4MB.

30,000 * 4MB = 120 GB

Definitely more than your physical memory.