netdev - Re: [PATCH net-next] mlx4: allow order-0 memory allocations in RX path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1372022003.3301.47.camel@edumazet-glaptop>
Date:	Sun, 23 Jun 2013 14:13:23 -0700
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Or Gerlitz <or.gerlitz@...il.com>
Cc:	"David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
	Or Gerlitz <ogerlitz@...lanox.com>,
	Eugenia Emantayev <eugenia@...lanox.com>,
	Saeed Mahameed <saeedm@...lanox.com>
Subject: Re: [PATCH net-next] mlx4: allow order-0 memory allocations in RX
 path

On Sun, 2013-06-23 at 23:17 +0300, Or Gerlitz wrote:
> On Sun, Jun 23, 2013 at 6:17 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> >
> > mlx4 exclusively uses order-2 allocations in RX path, which are
> > likely to fail under memory pressure.
> >
> > We therefore drop frames more than needed.
> >
> > This patch tries order-3, order-2, order-1 and finally order-0
> > allocations to keep good performance, yet allow allocations if/when
> > memory gets fragmented.
> >
> > By using larger pages, and avoiding unnecessary get_page()/put_page()
> > on compound pages, this patch improves performance as well, lowering
> > false sharing on struct page.
> 
> Hi Eric, thanks for the patch, both Amir and Yevgeny are OOO, so it
> will take us a bit more time to conduct the review... but lets start:
> could you explain a little further what do you exactly refer to by
> "false sharing" in this context?

Every time mlx4 prepared a page frag into a skb, it did :
 - a get_page() in mlx4_en_alloc_frags()
 - a get_page() in mlx4_en_complete_rx_desc()
 - a put_page() in mlx4_en_free_frag()

-> lot of changes of page->_count

When this skb is consumed, frag is freed -> put_page()

-> decrement of page->_count

If the consumer is on a different cpu, this adds false sharing on
"struct page"

After my patch, mlx4 driver touches this "struct page" only once,
and the consumers will do their get_page() without being slowed down by
mlx4 driver/cpu. This reduces latencies.

> 
> Also, I am not fully sure, but I think the current driver code doesn't
> support splice and this somehow relates to how RX skbs are spread over
> pages. In that repsect, I wonder if this patch goes in the direction
> that would allow to support splice, or maybe takes us a bit back, as
> of moving to use order-3 allocations?

splice is supported by core networking, no worries ;)

It doesn't depend on order-whatever allocations.

BTW, splice() works well for TCP over loopback, and TX already uses
fragments in order-3 pages.

> 
> You've mentioned performance improvement, could you be more specific?
> what's the scheme under which you saw the improvement and what was
> that improvement.

A cpu might be fully dedicated to softirq handling, and skb consumed on
other cpus.

My patch removes ~60 atomic operations per allocated page

(21 frags, and for each frag, two get_page() and one put_page())

> 
> Last, as Amir wrote you, we're looking on re-using skbs on the RX
> patch to avoid sever performance hits when IOMMU is enabled. The team
> has not provided me yet the patch, but basically, if you look on the
> ixgbe patch that was made largely for that very same purpose
> (improving perf under IOMMU) f800326dca7bc158f4c886aa92f222de37993c80
> "ixgbe: Replace standard receive path with a page based receive" ,
> they use there order-0 or order-1 allocations, but not order-2 or
> order-3, also here I have some more catch up to conduct, so we'll
> see...

ixgbe do not support frag_size of 1536 bytes, but 2048 or 4096 bytes.
So using order-3 pages is not win for it.

But for mlx4, we gain 5% occupancy using order-3 pages (21 frags per
32K) over order-2 pages (10 frags per 16K), and 30 % over order-0 pages
(2 frags per 4K)

I don't know, current mlx4 driver is barely usable as is, unless you
make sure the host has enough memory, with plenty of order-2 pages.

And unless you have really specialized applications, there is never
enough memory.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html