lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 23 Sep 2021 11:03:44 +0200
From:   David Hildenbrand <>
To:     Kent Overstreet <>,,,
Cc:     Johannes Weiner <>,
        Matthew Wilcox <>,
        Linus Torvalds <>,
        Andrew Morton <>,
        "Darrick J. Wong" <>,
        Christoph Hellwig <>,
        David Howells <>
Subject: Re: Struct page proposal

On 23.09.21 03:21, Kent Overstreet wrote:
> One thing that's come out of the folios discussions with both Matthew and
> Johannes is that we seem to be thinking along similar lines regarding our end
> goals for struct page.
> The fundamental reason for struct page is that we need memory to be self
> describing, without any context - we need to be able to go from a generic
> untyped struct page and figure out what it contains: handling physical memory
> failure is the most prominent example, but migration and compaction are more
> common. We need to be able to ask the thing that owns a page of memory "hey,
> stop using this and move your stuff here".
> Matthew's helpfully been coming up with a list of page types:
> But struct page could be a lot smaller than it is now. I think we can get it
> down to two pointers, which means it'll take up 0.4% of system memory. Both
> Matthew and Johannes have ideas for getting it down even further - the main
> thing to note is that virt_to_page() _should_ be an uncommon operation (most of
> the places we're currently using it are completely unnecessary, look at all the
> places we're using it on the zero page). Johannes is thinking two layer radix
> tree, Matthew was thinking about using maple trees - personally, I think that
> 0.4% of system memory is plenty good enough.
> Ok, but what do we do with the stuff currently in struct page?
> -------------------------------------------------------------
> The main thing to note is that since in normal operation most folios are going
> to be describing many pages, not just one - and we'll be using _less_ memory
> overall if we allocate them separately. That's cool.
> Of course, for this to make sense, we'll have to get all the other stuff in
> struct page moved into their own types, but file & anon pages are the big one,
> and that's already being tackled.
> Why two ulongs/pointers, instead of just one?
> ---------------------------------------------
> Because one of the things we really want and don't have now is a clean division
> between allocator and allocatee state. Allocator meaning either the buddy
> allocator or slab, allocatee state would be the folio or the network pool state
> or whatever actually called kmalloc() or alloc_pages().
> Right now slab state sits in the same place in struct page where allocatee state
> does, and the reason this is bad is that slab/slub are a hell of a lot faster
> than the buddy allocator, and Johannes wants to move the boundary between slab
> allocations and buddy allocator allocations up to like 64k. If we fix where slab
> state lives, this will become completely trivial to do.
> So if we have this:
> struct page {
> 	unsigned long	allocator;
> 	unsigned long	allocatee;
> };
> The allocator field would be used for either a pointer to slab/slub's state, if
> it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> allocation - like compound order today, and probably whether or not the
> (compound group of) pages is free.
> The allocatee field would be used for a type tagged (using the low bits of the
> pointer) to one of:
>   - struct folio
>   - struct anon_folio, if that becomes a thing
>   - struct network_pool_page
>   - struct pte_page
>   - struct zone_device_page
> Then we can further refactor things until all the stuff that's currently crammed
> in struct page lives in types where each struct field means one and precisely
> one thing, and also where we can freely reshuffle and reorganize and add stuff
> to the various types where we couldn't before because it'd make struct page
> bigger.
> Other notes & potential issues:
>   - page->compound_dtor needs to die
>   - page->rcu_head moves into the types that actually need it, no issues there
>   - page->refcount has question marks around it. I think we can also just move it
>     into the types that need it; with RCU derefing the pointer to the folio or
>     whatever and grabing a ref on folio->refcount can happen under a RCU read
>     lock - there's no real question about whether it's technically possible to
>     get it out of struct page, and I think it would be cleaner overall that way.
>     However, depending on how it's used from code paths that go from generic
>     untyped pages, I could see it turning into more of a hassle than it's worth.
>     More investigation is needed.
>   - page->memcg_data - I don't know whether that one more properly belongs in
>     struct page or in the page subtypes - I'd love it if Johannes could talk
>     about that one.
>   - page->flags - dealing with this is going to be a huge hassle but also where
>     we'll find some of the biggest gains in overall sanity and readability of the
>     code. Right now, PG_locked is super special and ad hoc and I have run into
>     situations multiple times (and Johannes was in vehement agreement on this
>     one) where I simply could not figure the behaviour of the current code re:
>     who is responsible for locking pages without instrumenting the code with
>     assertions.
>     Meaning anything we do to create and enforce module boundaries between
>     different chunks of code is going to suck, but the end result should be
>     really worthwhile.
> Matthew Wilcox and David Howells have been having conversations on IRC about
> what to do about other page bits. It appears we should be able to kill a lot of
> filesystem usage of both PG_private and PG_private_2 - filesystems in general
> hang state off of page->private, soon to be folio->private, and PG_private in
> current use just indicates whether page->private is nonzero - meaning it's
> completely redundant.

Don't get me wrong, but before there are answers to some of the very 
basic questions raised above (especially everything that lives in 
page->flags, which are not only page flags, refcount, ...) this isn't 
very tempting to spend more time on, from a reviewer perspective.


David / dhildenb

Powered by blists - more mailing lists