[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YFja/LRC1NI6quL6@cmpxchg.org>
Date: Mon, 22 Mar 2021 13:59:24 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: "Matthew Wilcox (Oracle)" <willy@...radead.org>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-cachefs@...hat.com,
linux-afs@...ts.infradead.org
Subject: Re: [PATCH v5 00/27] Memory Folios
On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote:
> Managing memory in 4KiB pages is a serious overhead. Many benchmarks
> exist which show the benefits of a larger "page size". As an example,
> an earlier iteration of this idea which used compound pages got a 7%
> performance boost when compiling the kernel using kernbench without any
> particular tuning.
>
> Using compound pages or THPs exposes a serious weakness in our type
> system. Functions are often unprepared for compound pages to be passed
> to them, and may only act on PAGE_SIZE chunks. Even functions which are
> aware of compound pages may expect a head page, and do the wrong thing
> if passed a tail page.
>
> There have been efforts to label function parameters as 'head' instead
> of 'page' to indicate that the function expects a head page, but this
> leaves us with runtime assertions instead of using the compiler to prove
> that nobody has mistakenly passed a tail page. Calling a struct page
> 'head' is also inaccurate as they will work perfectly well on base pages.
> The term 'nottail' has not proven popular.
>
> We also waste a lot of instructions ensuring that we're not looking at
> a tail page. Almost every call to PageFoo() contains one or more hidden
> calls to compound_head(). This also happens for get_page(), put_page()
> and many more functions. There does not appear to be a way to tell gcc
> that it can cache the result of compound_head(), nor is there a way to
> tell it that compound_head() is idempotent.
>
> This series introduces the 'struct folio' as a replacement for
> head-or-base pages. This initial set reduces the kernel size by
> approximately 6kB, although its real purpose is adding infrastructure
> to enable further use of the folio.
>
> The intent is to convert all filesystems and some device drivers to work
> in terms of folios. This series contains a lot of explicit conversions,
> but it's important to realise it's removing a lot of implicit conversions
> in some relatively hot paths. There will be very few conversions from
> folios when this work is completed; filesystems, the page cache, the
> LRU and so on will generally only deal with folios.
If that is the case, shouldn't there in the long term only be very
few, easy to review instances of things like compound_head(),
PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
never see tail pages and 2) never assume a compile-time page size?
What are the higher-level places that in the long-term should be
dealing with tail pages at all? Are there legit ones besides the page
allocator, THP splitting internals & pte-mapped compound pages?
I do agree that the current confusion around which layer sees which
types of pages is a problem. But I also think a lot of it is the
result of us being in a transitional period where we've added THP in
more places but not all code and data structures are or were fully
native yet, and so we had things leak out or into where maybe they
shouldn't be to make things work in the short term.
But this part is already getting better, and has gotten better, with
the page cache (largely?) going native for example.
Some compound_head() that are currently in the codebase are already
unnecessary. Like the one in activate_page().
And looking at grep, I wouldn't be surprised if only the page table
walkers need the page_compound() that mark_page_accessed() does. We
would be better off if they did the translation once and explicitly in
the outer scope, where it's clear they're dealing with a pte-mapped
compound page, instead of having a series of rather low level helpers
(page flags testing, refcount operations, LRU operations, stat
accounting) all trying to be clever but really just obscuring things
and imposing unnecessary costs on the vast majority of cases.
So I fully agree with the motivation behind this patch. But I do
wonder why it's special-casing the commmon case instead of the rare
case. It comes at a huge cost. Short term, the churn of replacing
'page' with 'folio' in pretty much all instances is enormous.
And longer term, I'm not convinced folio is the abstraction we want
throughout the kernel. If nobody should be dealing with tail pages in
the first place, why are we making everybody think in 'folios'? Why
does a filesystem care that huge pages are composed of multiple base
pages internally? This feels like an implementation detail leaking out
of the MM code. The vast majority of places should be thinking 'page'
with a size of 'page_size()'. Including most parts of the MM itself.
The compile-time check is nice, but I'm not sure it would be that much
more effective at catching things than a few centrally placed warns
inside PageFoo(), get_page() etc. and other things that should not
encounter tail pages in the first place (with __helpers for the few
instances that do). And given the invasiveness of this change, they
ought to be very drastically better at it, and obviously so, IMO.
> Documentation/core-api/mm-api.rst | 7 +
> fs/afs/write.c | 3 +-
> fs/cachefiles/rdwr.c | 19 ++-
> fs/io_uring.c | 2 +-
> include/linux/memcontrol.h | 21 +++
> include/linux/mm.h | 156 +++++++++++++++----
> include/linux/mm_types.h | 52 +++++++
> include/linux/mmdebug.h | 20 +++
> include/linux/netfs.h | 2 +-
> include/linux/page-flags.h | 120 +++++++++++---
> include/linux/pagemap.h | 249 ++++++++++++++++++++++--------
> include/linux/swap.h | 6 +
> include/linux/vmstat.h | 107 +++++++++++++
> mm/Makefile | 2 +-
> mm/filemap.c | 237 ++++++++++++++--------------
> mm/folio-compat.c | 37 +++++
> mm/memory.c | 8 +-
> mm/page-writeback.c | 62 ++++++--
> mm/swapfile.c | 8 +-
> mm/util.c | 30 ++--
> 20 files changed, 857 insertions(+), 291 deletions(-)
> create mode 100644 mm/folio-compat.c
Powered by blists - more mailing lists