[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180316210500.GH27498@bombadil.infradead.org>
Date: Fri, 16 Mar 2018 14:05:00 -0700
From: Matthew Wilcox <willy@...radead.org>
To: Alexander Duyck <alexander.duyck@...il.com>
Cc: Alexander Duyck <alexander.h.duyck@...el.com>,
linux-mm@...r.kernel.org, Netdev <netdev@...r.kernel.org>,
Matthew Wilcox <mawilcox@...rosoft.com>
Subject: Re: [RFC 2/2] page_frag_cache: Store metadata in struct page
On Thu, Mar 15, 2018 at 02:26:28PM -0700, Alexander Duyck wrote:
> On Thu, Mar 15, 2018 at 12:53 PM, Matthew Wilcox <willy@...radead.org> wrote:
> > From: Matthew Wilcox <mawilcox@...rosoft.com>
> >
> > Shrink page_frag_cache from 24 to 8 bytes (a single pointer to the
> > currently-in-use struct page) by using the page's refcount directly
> > (instead of maintaining a bias) and storing our current progress through
> > the page in the same bits currently used for page->index. We no longer
> > need to reflect the page pfmemalloc state if we're storing the page
> > directly.
> >
> > On the downside, we now call page_address() on every allocation, and we
> > do an atomic_inc() rather than a non-atomic decrement, but we should
> > touch the same number of cachelines and there is far less code (and
> > the code is less complex).
> >
> > Signed-off-by: Matthew Wilcox <mawilcox@...rosoft.com>
>
> So my concern with this patch is that it is essentially trading off
> CPU performance for reduced size. One of the reasons for getting away
> from using the page pointer is because it is expensive to access the
> page when the ref_count is bouncing on multiple cache lines.
I'm not really interested in trading off SMP scalability for reduced
memory usage; if we see more than a trivial penalty then I'll withdraw
this RFC. I read & understand the commits that you made to get us to
the current codebase, but there's no history of this particular variation
in the tree.
0e39250845c0f91acc64264709b25f7f9b85c2c3
9451980a6646ed487efce04a9df28f450935683e
540eb7bf0bbedb65277d68ab89ae43cdec3fd6ba
I think that by moving the 'offset' into the struct page, we end up updating
only one cacheline instead of two, which should be a performance win.
I understand your concern about the cacheline bouncing between the freeing
and allocating CPUs. Is cross-CPU freeing a frequent occurrence? From
looking at its current usage, it seemed like the allocation and freeing
were usually on the same CPU.
> In
> addition converting from a page to a virtual address is actually more
> expensive then you would think it should be. I went through and
> optimized that the best I could, but there is still a pretty
> significant cost to the call.
Were you able to break down where that cost occurred? It looks
pretty cheap on x86-32 nohighmem:
343e: a1 00 00 00 00 mov 0x0,%eax
343f: R_386_32 mem_map
3443: 29 f3 sub %esi,%ebx
3445: 89 5a 08 mov %ebx,0x8(%edx)
3448: 29 c2 sub %eax,%edx
344a: c1 fa 05 sar $0x5,%edx
344d: c1 e2 0c shl $0xc,%edx
3450: 8d 84 13 00 00 00 c0 lea -0x40000000(%ebx,%edx,1),%eax
3457: 8b 5d f8 mov -0x8(%ebp),%ebx
345a: 8b 75 fc mov -0x4(%ebp),%esi
345d: 89 ec mov %ebp,%esp
345f: 5d pop %ebp
3460: c3 ret
I did code up a variant which stored the virtual address of the offset
in page->index, but I threw that away after seeing how much more code
it turned into. Fortunately I had a better idea of how to implement that
early this morning. I haven't even tested it yet, but it looks like better
generated code:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 28d9a4c9c5fd..ec38204eb903 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4357,8 +4357,8 @@ struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
unsigned int offset = PAGE_SIZE << compound_order(old);
if (offset > size) {
- old->offset = offset;
- return old;
+ page = old;
+ goto reset;
}
}
@@ -4379,7 +4379,10 @@ struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
if (old)
put_page(old);
nc->page = page;
- page->offset = PAGE_SIZE << compound_order(page);
+ page->private = (PAGE_SIZE << compound_order(page)) - 1;
+reset:
+ page->index = (unsigned long)page_to_virt(page) +
+ (PAGE_SIZE << compound_order(page));
return page;
}
@@ -4402,20 +4405,26 @@ void *page_frag_alloc(struct page_frag_cache *nc,
unsigned int size, gfp_t gfp_mask)
{
struct page *page = nc->page;
- unsigned int offset = page->offset;
+ unsigned long addr = addr;
+ unsigned int offset = 0;
- if (unlikely(!page || offset < size)) {
+ if (page) {
+ addr = page->index;
+ offset = addr & page->private;
+ }
+
+ if (unlikely(offset < size)) {
page = __page_frag_cache_refill(nc, size, gfp_mask);
if (!page)
return NULL;
- offset = page->offset;
+ addr = page->index;
}
page_ref_inc(page);
- offset -= size;
- page->offset = offset;
+ addr -= size;
+ page->index = addr;
- return page_address(page) + offset;
+ return (void *)addr;
}
EXPORT_SYMBOL(page_frag_alloc);
Obviously if the problem turns out to be the cacheline thrashing rather
than the call to page_to_virt, then this is pointless to test.
> I won't be able to test the patches until next week, but I expect I
> will probably see a noticeable regression when performing a small
> packet routing test.
I really appreciate you being willing to try this for me. I need to
get myself a dual-socket machine to test things like this.
Powered by blists - more mailing lists