lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Tue, 23 Jan 2018 16:05:50 -0800 From: Ben Greear <greearb@...delatech.com> To: Eric Dumazet <eric.dumazet@...il.com>, netdev <netdev@...r.kernel.org> Subject: Re: TCP many-connection regression (bisected to 4.5.0-rc2+) On 01/23/2018 03:27 PM, Ben Greear wrote: > On 01/23/2018 03:21 PM, Eric Dumazet wrote: >> On Tue, 2018-01-23 at 15:10 -0800, Ben Greear wrote: >>> On 01/23/2018 02:29 PM, Eric Dumazet wrote: >>>> On Tue, 2018-01-23 at 14:09 -0800, Ben Greear wrote: >>>>> On 01/23/2018 02:07 PM, Eric Dumazet wrote: >>>>>> On Tue, 2018-01-23 at 13:49 -0800, Ben Greear wrote: >>>>>>> On 01/22/2018 10:16 AM, Eric Dumazet wrote: >>>>>>>> On Mon, 2018-01-22 at 09:28 -0800, Ben Greear wrote: >>>>>>>>> My test case is to have 6 processes each create 5000 TCP IPv4 connections to each other >>>>>>>>> on a system with 16GB RAM and send slow-speed data. This works fine on a 4.7 kernel, but >>>>>>>>> will not work at all on a 4.13. The 4.13 first complains about running out of tcp memory, >>>>>>>>> but even after forcing those values higher, the max connections we can get is around 15k. >>>>>>>>> >>>>>>>>> Both kernels have my out-of-tree patches applied, so it is possible it is my fault >>>>>>>>> at this point. >>>>>>>>> >>>>>>>>> Any suggestions as to what this might be caused by, or if it is fixed in more recent kernels? >>>>>>>>> >>>>>>>>> I will start bisecting in the meantime... >>>>>>>>> >>>>>>>> >>>>>>>> Hi Ben >>>>>>>> >>>>>>>> Unfortunately I have no idea. >>>>>>>> >>>>>>>> Are you using loopback flows, or have I misunderstood you ? >>>>>>>> >>>>>>>> How loopback connections can be slow-speed ? >>>>>>>> >>>>>>> >>>>>>> Hello Eric, looks like it is one of your commits that causes the issue >>>>>>> I see. >>>>>>> >>>>>>> Here are some more details on my specific test case I used to bisect: >>>>>>> >>>>>>> I have two ixgbe ports looped back, configured on same subnet, but with different IPs. >>>>>>> Routing table rules, SO_BINDTODEVICE, binding to specific IPs on both client and server >>>>>>> side let me send-to-self over the external looped cable. >>>>>>> >>>>>>> I have 2 mac-vlans on each physical interface. >>>>>>> >>>>>>> I created 5 server-side connections on one physical port, and two more on one of the mac-vlans. >>>>>>> >>>>>>> On the client-side, I create a process that spawns 5000 connections to the corresponding server side. >>>>>>> >>>>>>> End result is 25,000 connections on one pair of real interfaces, and 10,000 connections on the >>>>>>> mac-vlan ports. >>>>>>> >>>>>>> In the passing case, I get very close to all 5000 connections on all endpoints quickly. >>>>>>> >>>>>>> In the failing case, I get a max of around 16k connections on the two physical ports. The two mac-vlans have 10k connections >>>>>>> across them working reliably. It seems to be an issue with 'connect' failing. >>>>>>> >>>>>>> connect(2074, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) >>>>>>> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2075 >>>>>>> fcntl(2075, F_GETFD) = 0 >>>>>>> fcntl(2075, F_SETFD, FD_CLOEXEC) = 0 >>>>>>> setsockopt(2075, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 >>>>>>> setsockopt(2075, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 >>>>>>> bind(2075, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 >>>>>>> getsockopt(2075, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 >>>>>>> getsockopt(2075, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 >>>>>>> setsockopt(2075, SOL_TCP, TCP_NODELAY, [0], 4) = 0 >>>>>>> fcntl(2075, F_GETFL) = 0x2 (flags O_RDWR) >>>>>>> fcntl(2075, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 >>>>>>> connect(2075, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress) >>>>>>> socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 2076 >>>>>>> fcntl(2076, F_GETFD) = 0 >>>>>>> fcntl(2076, F_SETFD, FD_CLOEXEC) = 0 >>>>>>> setsockopt(2076, SOL_SOCKET, SO_BINDTODEVICE, "eth4\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0 >>>>>>> setsockopt(2076, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 >>>>>>> bind(2076, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.1.1.4")}, 16) = 0 >>>>>>> getsockopt(2076, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0 >>>>>>> getsockopt(2076, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0 >>>>>>> setsockopt(2076, SOL_TCP, TCP_NODELAY, [0], 4) = 0 >>>>>>> fcntl(2076, F_GETFL) = 0x2 (flags O_RDWR) >>>>>>> fcntl(2076, F_SETFL, O_ACCMODE|O_NONBLOCK) = 0 >>>>>>> connect(2076, {sa_family=AF_INET, sin_port=htons(33012), sin_addr=inet_addr("10.1.1.5")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) >>>>>>> .... >>>>>>> >>>>>>> >>>>>>> ea8add2b190395408b22a9127bed2c0912aecbc8 is the first bad commit >>>>>>> commit ea8add2b190395408b22a9127bed2c0912aecbc8 >>>>>>> Author: Eric Dumazet <edumazet@...gle.com> >>>>>>> Date: Thu Feb 11 16:28:50 2016 -0800 >>>>>>> >>>>>>> tcp/dccp: better use of ephemeral ports in bind() >>>>>>> >>>>>>> Implement strategy used in __inet_hash_connect() in opposite way : >>>>>>> >>>>>>> Try to find a candidate using odd ports, then fallback to even ports. >>>>>>> >>>>>>> We no longer disable BH for whole traversal, but one bucket at a time. >>>>>>> We also use cond_resched() to yield cpu to other tasks if needed. >>>>>>> >>>>>>> I removed one indentation level and tried to mirror the loop we have >>>>>>> in __inet_hash_connect() and variable names to ease code maintenance. >>>>>>> >>>>>>> Signed-off-by: Eric Dumazet <edumazet@...gle.com> >>>>>>> Signed-off-by: David S. Miller <davem@...emloft.net> >>>>>>> >>>>>>> :040000 040000 3af4595c6eb6d331e1cba78a142d44e00f710d81 e0c014ae8b7e2867256eff60f6210821d36eacef M net >>>>>>> >>>>>>> >>>>>>> I will be happy to test patches or try to get any other results that might help diagnose >>>>>>> this problem better. >>>>>> >>>>>> Problem is I do not see anything obvious here. >>>>>> >>>>>> Please provide /proc/sys/net/ipv4/ip_local_port_range >>>>> >>>>> [root@...003-e3v2-13100124-f20x64 ~]# cat /proc/sys/net/ipv4/ip_local_port_range >>>>> 10000 61001 >>>>> >>>>>> >>>>>> Also you probably could use IP_BIND_ADDRESS_NO_PORT socket option >>>>>> before the bind() >>>>> >>>>> I'll read up on that to see what it does... >>>> >>>> man 7 ip >>>> >>>> IP_BIND_ADDRESS_NO_PORT (since Linux 4.2) >>>> Inform >>>> the kernel to not reserve an ephemeral port >>>> when using >>>> bind(2) with a port number of 0. The >>>> port will later be >>>> automatically chosen at con‐ >>>> nect(2) time, in a way >>>> that allows sharing a source >>>> port as long as the 4-tuple >>>> is unique. >>>> >>> >>> Yes, I found that. >>> >>> It appears this option works well for my case, and I see 30k connections across my pair of e1000e >>> (though the NIC is wretching again, so I guess its issues are not fully resolved). >>> >>> I tested this on my 4.13.16+ kernel. >>> >>> But that said, maybe there is still some issue with the patch I bisected to, so if you have >>> other suggestions, I can back out this IP_BIND_ADDRESS_NO_PORT feature and re-test. >>> >>> Also, I had to increase /proc/sys/net/ipv4/tcp_mem to get 30k connections to work without >>> the kernel spamming: >>> >>> Jan 23 15:02:41 lf1003-e3v2-13100124-f20x64 kernel: TCP: out of memory -- consider tuning tcp_mem >>> Jan 23 15:02:41 lf1003-e3v2-13100124-f20x64 kernel: TCP: out of memory -- consider tuning tcp_mem >>> >>> This is a 16 GB RAM system, and I did not have to tune this on the 4.5.0-rc2+ (good) kernels >>> to get the similar performance. I was testing on ixgbe there though, possibly that is part >>> of it, or maybe I just need to force tcp_mem to be larger on more recent kernels?? >> >> >> Since linux-4.2 tcp_mem[0,1,2] defaults are 4.68%, 6.25%, 9.37% of >> physical memory. >> >> It used to be twice that in older kernels. >> >> It is also possible that some change in TCP congestion or autotuning >> allows each of your TCP flow to store more data in its write queue, >> if your application is pushing bulk data as much as it can. >> >> It is virtually not possible to change anything in the kernel without >> zero impact on very pathological use cases. > > Yes, but also pathological use cases may uncover a real issue that normal > users will not often notice.... Based on the commit message, it seems you > expected no real regressions with that patch, but at least in my case, I see > large ones, so something might be off with it. > >> >> tcp_wmem[2] is 4MB. >> >> 30,000 * 4MB = 120 GB >> >> Definitely more than your physical memory. > > I'll spend some time looking at the tcp-mem issue now that I have a work-around for > the many connection issues... Looks like when I use the ixgbe, I can do 70k connections (using 2 physical, plus two mac-vlans on each to ensure no more than 30k connections per IP pair). It runs solid, 3GB RAM free, and no tcp-mem warnings. e1000e has lots of tx-hangs, and that exacerbates tcp memory pressure it seems. So, seems I'm good to go on this as long as I stay away from e1000e. Thanks, Ben -- Ben Greear <greearb@...delatech.com> Candela Technologies Inc http://www.candelatech.com
Powered by blists - more mailing lists