netdev - Re: Performance regression on kernels 3.10 and newer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 15 Aug 2014 10:15:15 -0700
From:	Alexander Duyck <alexander.h.duyck@...el.com>
To:	David Miller <davem@...emloft.net>
CC:	eric.dumazet@...il.com, netdev@...r.kernel.org
Subject: Re: Performance regression on kernels 3.10 and newer

On 08/14/2014 04:20 PM, David Miller wrote:
> From: Alexander Duyck <alexander.h.duyck@...el.com>
> Date: Thu, 14 Aug 2014 16:16:36 -0700
> 
>> Are you sure about each socket having it's own DST?  Everything I see
>> seems to indicate it is somehow associated with IP.
> 
> Right it should be, unless you have exception entries created by path
> MTU or redirects.
> 
> WRT prequeue, it does the right thing for dumb apps that block in
> receive.  But because it causes the packet to cross domains as it
> does, we can't do a lot of tricks which we normally can do, and that's
> why the refcounting on the dst is there now.
> 
> Perhaps we can find a clever way to elide that refcount, who knows.

Actually I would consider the refcount issue just the coffin nail in all
of this.  It seems like there are multiple issues that have been there
for some time and they are just getting worse with the refcount change
in 3.10.

With the prequeue disabled what happens is that the frames are making it
up and hitting tcp_rcv_established before being pushed into the backlog
queues and coalesced there.  I believe the lack of coalescing on the
prequeue path is one of the reasons why it is twice as expensive as the
non-prequeue path CPU wise even if you eliminate the refcount issue.

I realize most of my data is anecdotal as I only have the ixgbe/igb
adapters and netperf to work with.  This is one of the reasons why I
keep asking if someone can tell me what the use case is for this where
it performs well.  From what I can tell it might have had some value
back in the day before the introduction of things such as RPS/RFS where
some of the socket processing would be offloaded to other CPUs for a
single queue device, but even that use case is now deprecated since
RPS/RFS are there and function better than this.  What I am basically
looking for is a way to weight the gain versus the penalties to
determine if this code is even viable anymore.

In the meantime I think I will put together a patch to default
tcp_low_latency to 1 for net and stable, and if we cannot find a good
reason for keeping it then I can submit a patch to net-next that will
strip it out since I don't see any benefit to having this code.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html