[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e567ad16-0f2b-940b-a39b-a4d1505bfcb9@redhat.com>
Date: Thu, 23 Sep 2021 11:03:44 +0200
From: David Hildenbrand <david@...hat.com>
To: Kent Overstreet <kent.overstreet@...il.com>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Cc: Johannes Weiner <hannes@...xchg.org>,
Matthew Wilcox <willy@...radead.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
"Darrick J. Wong" <djwong@...nel.org>,
Christoph Hellwig <hch@...radead.org>,
David Howells <dhowells@...hat.com>
Subject: Re: Struct page proposal
On 23.09.21 03:21, Kent Overstreet wrote:
> One thing that's come out of the folios discussions with both Matthew and
> Johannes is that we seem to be thinking along similar lines regarding our end
> goals for struct page.
>
> The fundamental reason for struct page is that we need memory to be self
> describing, without any context - we need to be able to go from a generic
> untyped struct page and figure out what it contains: handling physical memory
> failure is the most prominent example, but migration and compaction are more
> common. We need to be able to ask the thing that owns a page of memory "hey,
> stop using this and move your stuff here".
>
> Matthew's helpfully been coming up with a list of page types:
> https://kernelnewbies.org/MemoryTypes
>
> But struct page could be a lot smaller than it is now. I think we can get it
> down to two pointers, which means it'll take up 0.4% of system memory. Both
> Matthew and Johannes have ideas for getting it down even further - the main
> thing to note is that virt_to_page() _should_ be an uncommon operation (most of
> the places we're currently using it are completely unnecessary, look at all the
> places we're using it on the zero page). Johannes is thinking two layer radix
> tree, Matthew was thinking about using maple trees - personally, I think that
> 0.4% of system memory is plenty good enough.
>
>
> Ok, but what do we do with the stuff currently in struct page?
> -------------------------------------------------------------
>
> The main thing to note is that since in normal operation most folios are going
> to be describing many pages, not just one - and we'll be using _less_ memory
> overall if we allocate them separately. That's cool.
>
> Of course, for this to make sense, we'll have to get all the other stuff in
> struct page moved into their own types, but file & anon pages are the big one,
> and that's already being tackled.
>
> Why two ulongs/pointers, instead of just one?
> ---------------------------------------------
>
> Because one of the things we really want and don't have now is a clean division
> between allocator and allocatee state. Allocator meaning either the buddy
> allocator or slab, allocatee state would be the folio or the network pool state
> or whatever actually called kmalloc() or alloc_pages().
>
> Right now slab state sits in the same place in struct page where allocatee state
> does, and the reason this is bad is that slab/slub are a hell of a lot faster
> than the buddy allocator, and Johannes wants to move the boundary between slab
> allocations and buddy allocator allocations up to like 64k. If we fix where slab
> state lives, this will become completely trivial to do.
>
> So if we have this:
>
> struct page {
> unsigned long allocator;
> unsigned long allocatee;
> };
>
> The allocator field would be used for either a pointer to slab/slub's state, if
> it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> allocation - like compound order today, and probably whether or not the
> (compound group of) pages is free.
>
> The allocatee field would be used for a type tagged (using the low bits of the
> pointer) to one of:
> - struct folio
> - struct anon_folio, if that becomes a thing
> - struct network_pool_page
> - struct pte_page
> - struct zone_device_page
>
> Then we can further refactor things until all the stuff that's currently crammed
> in struct page lives in types where each struct field means one and precisely
> one thing, and also where we can freely reshuffle and reorganize and add stuff
> to the various types where we couldn't before because it'd make struct page
> bigger.
>
> Other notes & potential issues:
> - page->compound_dtor needs to die
>
> - page->rcu_head moves into the types that actually need it, no issues there
>
> - page->refcount has question marks around it. I think we can also just move it
> into the types that need it; with RCU derefing the pointer to the folio or
> whatever and grabing a ref on folio->refcount can happen under a RCU read
> lock - there's no real question about whether it's technically possible to
> get it out of struct page, and I think it would be cleaner overall that way.
>
> However, depending on how it's used from code paths that go from generic
> untyped pages, I could see it turning into more of a hassle than it's worth.
> More investigation is needed.
>
> - page->memcg_data - I don't know whether that one more properly belongs in
> struct page or in the page subtypes - I'd love it if Johannes could talk
> about that one.
>
> - page->flags - dealing with this is going to be a huge hassle but also where
> we'll find some of the biggest gains in overall sanity and readability of the
> code. Right now, PG_locked is super special and ad hoc and I have run into
> situations multiple times (and Johannes was in vehement agreement on this
> one) where I simply could not figure the behaviour of the current code re:
> who is responsible for locking pages without instrumenting the code with
> assertions.
>
> Meaning anything we do to create and enforce module boundaries between
> different chunks of code is going to suck, but the end result should be
> really worthwhile.
>
> Matthew Wilcox and David Howells have been having conversations on IRC about
> what to do about other page bits. It appears we should be able to kill a lot of
> filesystem usage of both PG_private and PG_private_2 - filesystems in general
> hang state off of page->private, soon to be folio->private, and PG_private in
> current use just indicates whether page->private is nonzero - meaning it's
> completely redundant.
>
Don't get me wrong, but before there are answers to some of the very
basic questions raised above (especially everything that lives in
page->flags, which are not only page flags, refcount, ...) this isn't
very tempting to spend more time on, from a reviewer perspective.
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists