netdev - Re: [RFC 2/2] page_frag_cache: Store metadata in struct page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 16 Mar 2018 14:05:00 -0700
From:   Matthew Wilcox <willy@...radead.org>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     Alexander Duyck <alexander.h.duyck@...el.com>,
        linux-mm@...r.kernel.org, Netdev <netdev@...r.kernel.org>,
        Matthew Wilcox <mawilcox@...rosoft.com>
Subject: Re: [RFC 2/2] page_frag_cache: Store metadata in struct page

On Thu, Mar 15, 2018 at 02:26:28PM -0700, Alexander Duyck wrote:
> On Thu, Mar 15, 2018 at 12:53 PM, Matthew Wilcox <willy@...radead.org> wrote:
> > From: Matthew Wilcox <mawilcox@...rosoft.com>
> >
> > Shrink page_frag_cache from 24 to 8 bytes (a single pointer to the
> > currently-in-use struct page) by using the page's refcount directly
> > (instead of maintaining a bias) and storing our current progress through
> > the page in the same bits currently used for page->index.  We no longer
> > need to reflect the page pfmemalloc state if we're storing the page
> > directly.
> >
> > On the downside, we now call page_address() on every allocation, and we
> > do an atomic_inc() rather than a non-atomic decrement, but we should
> > touch the same number of cachelines and there is far less code (and
> > the code is less complex).
> >
> > Signed-off-by: Matthew Wilcox <mawilcox@...rosoft.com>
> 
> So my concern with this patch is that it is essentially trading off
> CPU performance for reduced size. One of the reasons for getting away
> from using the page pointer is because it is expensive to access the
> page when the ref_count is bouncing on multiple cache lines.

I'm not really interested in trading off SMP scalability for reduced
memory usage; if we see more than a trivial penalty then I'll withdraw
this RFC.  I read & understand the commits that you made to get us to
the current codebase, but there's no history of this particular variation
in the tree.

0e39250845c0f91acc64264709b25f7f9b85c2c3
9451980a6646ed487efce04a9df28f450935683e
540eb7bf0bbedb65277d68ab89ae43cdec3fd6ba

I think that by moving the 'offset' into the struct page, we end up updating
only one cacheline instead of two, which should be a performance win.
I understand your concern about the cacheline bouncing between the freeing
and allocating CPUs.  Is cross-CPU freeing a frequent occurrence?  From
looking at its current usage, it seemed like the allocation and freeing
were usually on the same CPU.

> In
> addition converting from a page to a virtual address is actually more
> expensive then you would think it should be. I went through and
> optimized that the best I could, but there is still a pretty
> significant cost to the call.

Were you able to break down where that cost occurred?  It looks
pretty cheap on x86-32 nohighmem:

    343e:       a1 00 00 00 00          mov    0x0,%eax
                        343f: R_386_32  mem_map
    3443:       29 f3                   sub    %esi,%ebx
    3445:       89 5a 08                mov    %ebx,0x8(%edx)
    3448:       29 c2                   sub    %eax,%edx
    344a:       c1 fa 05                sar    $0x5,%edx
    344d:       c1 e2 0c                shl    $0xc,%edx
    3450:       8d 84 13 00 00 00 c0    lea    -0x40000000(%ebx,%edx,1),%eax
    3457:       8b 5d f8                mov    -0x8(%ebp),%ebx
    345a:       8b 75 fc                mov    -0x4(%ebp),%esi
    345d:       89 ec                   mov    %ebp,%esp
    345f:       5d                      pop    %ebp
    3460:       c3                      ret    

I did code up a variant which stored the virtual address of the offset
in page->index, but I threw that away after seeing how much more code
it turned into.  Fortunately I had a better idea of how to implement that
early this morning.  I haven't even tested it yet, but it looks like better
generated code:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 28d9a4c9c5fd..ec38204eb903 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4357,8 +4357,8 @@ struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
 		unsigned int offset = PAGE_SIZE << compound_order(old);
 
 		if (offset > size) {
-			old->offset = offset;
-			return old;
+			page = old;
+			goto reset;
 		}
 	}
 
@@ -4379,7 +4379,10 @@ struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
 	if (old)
 		put_page(old);
 	nc->page = page;
-	page->offset = PAGE_SIZE << compound_order(page);
+	page->private = (PAGE_SIZE << compound_order(page)) - 1;
+reset:
+	page->index = (unsigned long)page_to_virt(page) +
+					(PAGE_SIZE << compound_order(page));
 	return page;
 }
 
@@ -4402,20 +4405,26 @@ void *page_frag_alloc(struct page_frag_cache *nc,
 		      unsigned int size, gfp_t gfp_mask)
 {
 	struct page *page = nc->page;
-	unsigned int offset = page->offset;
+	unsigned long addr = addr;
+	unsigned int offset = 0;
 
-	if (unlikely(!page || offset < size)) {
+	if (page) {
+		addr = page->index;
+		offset = addr & page->private;
+	}
+
+	if (unlikely(offset < size)) {
 		page = __page_frag_cache_refill(nc, size, gfp_mask);
 		if (!page)
 			return NULL;
-		offset = page->offset;
+		addr = page->index;
 	}
 
 	page_ref_inc(page);
-	offset -= size;
-	page->offset = offset;
+	addr -= size;
+	page->index = addr;
 
-	return page_address(page) + offset;
+	return (void *)addr;
 }
 EXPORT_SYMBOL(page_frag_alloc);
 

Obviously if the problem turns out to be the cacheline thrashing rather
than the call to page_to_virt, then this is pointless to test.

> I won't be able to test the patches until next week, but I expect I
> will probably see a noticeable regression when performing a small
> packet routing test.

I really appreciate you being willing to try this for me.  I need to
get myself a dual-socket machine to test things like this.