netdev - Re: FlameGraph of mlx4 early drop with order-0 pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160417132357.GB11792@techsingularity.net>
Date:	Sun, 17 Apr 2016 14:23:57 +0100
From:	Mel Gorman <mgorman@...hsingularity.net>
To:	Jesper Dangaard Brouer <brouer@...hat.com>
Cc:	linux-mm <linux-mm@...ck.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	Brenden Blanco <bblanco@...mgrid.com>, tom@...bertland.com,
	alexei.starovoitov@...il.com, ogerlitz@...lanox.com,
	daniel@...earbox.net, eric.dumazet@...il.com, ecree@...arflare.com,
	john.fastabend@...il.com, tgraf@...g.ch, johannes@...solutions.net,
	eranlinuxmellanox@...il.com
Subject: Re: FlameGraph of mlx4 early drop with order-0 pages

On Fri, Apr 15, 2016 at 09:40:34PM +0200, Jesper Dangaard Brouer wrote:
> Hi Mel,
> 
> I did an experiment that you might find interesting.  Using Brenden's
> early drop with eBPF in the mxl4 driver.  I changed the mlx4 driver to
> use order-0 pages.  It usually use order-3 pages to amortize the cost
> of calling the page allocator (which is problematic for other reasons,
> like memory pin-down, latency spikes and multi CPU scalability)
> 
> With this change I could do around 12Mpps (Mill packet per sec) drops,
> usually does 14.5Mpps (limited due to a HW setup/limit, with idle cycles). 
> 
> Looking at the perf report as a FlameGraph, the page allocator clearly
> show up as the bottleneck: 
> 

Yeah, it's very obvious there. You didn't say if this had the optimisations
included or not but it doesn't really matter. Even halving the cost would
still be a lot.

FWIW, the latest series included an optimisation around the debugging
check. I also have an extreme patch that creates a special fast path for
order-0 pages only when there is plenty of free memory. It halved the
cost of the allocation side even on top of the current optimisations. I'm
not super-happy with it though as it duplicates some code and it requires
node-lru to be merged. Right now, node-lru is colliding very badly with
what's in mmotm so there is legwork required.

I also prototyped something that caches high-order pages on the per-cpu
lists on the flight over. It is at the "it builds so it must be ok"
stage. It's at the horrible hack and the accounting is quesionable but
something like it may be justified for SLUB even if network drivers move
away from high-order pages.

> Signing off, heading for the plane soon... see you at MM-summit!

Indeed and we'll slap some sort of plan together. If there is a slot free,
we might spend 15-30 minutes on it. Failing that, we'll grab a table
somewhere. We'll see how far we can get before considering a page-recycle
layer that preserves cache coherent state.

-- 
Mel Gorman
SUSE Labs