netdev - Re: Performance regression on kernels 3.10 and newer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1408041962.6804.31.camel@edumazet-glaptop2.roam.corp.google.com>
Date:	Thu, 14 Aug 2014 11:46:02 -0700
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Alexander Duyck <alexander.h.duyck@...el.com>
Cc:	David Miller <davem@...emloft.net>, netdev <netdev@...r.kernel.org>
Subject: Re: Performance regression on kernels 3.10 and newer

On Thu, 2014-08-14 at 11:19 -0700, Alexander Duyck wrote:
> Yesterday I tripped over a bit of an issue and it seems like we are
> seeing significant cache thrash on kernels 3.10 and newer when running
> multiple streams of small packet stress on multiple NUMA nodes for a
> single NIC.
> 
> I did some bisection and found that I was able to trace it back to
> upstream commit 093162553c33e9479283e107b4431378271c735d (tcp: force a
> dst refcount when prequeue packet).
> 
> Recreating this issue is pretty strait forward.  All I did was setup 2
> dual socket Xeon systems connected back to back with ixgbe and ran the
> following script after disabling tcp_autocork on the transmitting system:
>   for i in `seq 0 19`
>   do
>     for j in `seq 0 2`
>     do
>       netperf -H 192.168.10.1 -t TCP_STREAM \
>               -l 10 -c -C -T $i,$i -P 0 -- \
>               -m 64 -s 64K -D
>     done
>   done
> 
> The current net tree as-is will give me about 2Gb/s of data w/ 100% CPU
> utilization on the receiving system, and with the patch above reverted
> on that system it gives me about 4Gb/s with only 21% CPU utilization.
> If I set tcp_low_latency=1 I can get the CPU utilization down to about
> 12% on the same test with about 4Gb/s of throughput.
> 
> I'm still working on determining the exact root cause but it looks to me
> like there is some significant cache thrash going on in regards to the
> dst entries.
> 
> Below is a quick breakdown of the top CPU users for tcp_low_latency
> on/off using perf top:
> 
> tcp_low_latency = 0

> 
> tcp_low_latency = 1
> Any input/advice on where I should look or patches to possibly test
> would be appreciated.


I believe you answered your own question : prequeue mode does not work
very well when one host has hundred of active TCP flows to one other.

In real life, applications do not use prequeue, because nobody wants one
thread per flow.

Each socket has its own dst now route cache was removed, but if your
netperf migrates cpu (and NUMA node), we do not detect the dst should be
re-created onto a different NUMA node.

But really, I am not sure we want to care about prequeue, as modern
applications uses epoll()/poll()/select() instead of blocking on
recvmsg()



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html