lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 23 Sep 2021 11:03:44 +0200 From: David Hildenbrand <david@...hat.com> To: Kent Overstreet <kent.overstreet@...il.com>, linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, linux-mm@...ck.org Cc: Johannes Weiner <hannes@...xchg.org>, Matthew Wilcox <willy@...radead.org>, Linus Torvalds <torvalds@...ux-foundation.org>, Andrew Morton <akpm@...ux-foundation.org>, "Darrick J. Wong" <djwong@...nel.org>, Christoph Hellwig <hch@...radead.org>, David Howells <dhowells@...hat.com> Subject: Re: Struct page proposal On 23.09.21 03:21, Kent Overstreet wrote: > One thing that's come out of the folios discussions with both Matthew and > Johannes is that we seem to be thinking along similar lines regarding our end > goals for struct page. > > The fundamental reason for struct page is that we need memory to be self > describing, without any context - we need to be able to go from a generic > untyped struct page and figure out what it contains: handling physical memory > failure is the most prominent example, but migration and compaction are more > common. We need to be able to ask the thing that owns a page of memory "hey, > stop using this and move your stuff here". > > Matthew's helpfully been coming up with a list of page types: > https://kernelnewbies.org/MemoryTypes > > But struct page could be a lot smaller than it is now. I think we can get it > down to two pointers, which means it'll take up 0.4% of system memory. Both > Matthew and Johannes have ideas for getting it down even further - the main > thing to note is that virt_to_page() _should_ be an uncommon operation (most of > the places we're currently using it are completely unnecessary, look at all the > places we're using it on the zero page). Johannes is thinking two layer radix > tree, Matthew was thinking about using maple trees - personally, I think that > 0.4% of system memory is plenty good enough. > > > Ok, but what do we do with the stuff currently in struct page? > ------------------------------------------------------------- > > The main thing to note is that since in normal operation most folios are going > to be describing many pages, not just one - and we'll be using _less_ memory > overall if we allocate them separately. That's cool. > > Of course, for this to make sense, we'll have to get all the other stuff in > struct page moved into their own types, but file & anon pages are the big one, > and that's already being tackled. > > Why two ulongs/pointers, instead of just one? > --------------------------------------------- > > Because one of the things we really want and don't have now is a clean division > between allocator and allocatee state. Allocator meaning either the buddy > allocator or slab, allocatee state would be the folio or the network pool state > or whatever actually called kmalloc() or alloc_pages(). > > Right now slab state sits in the same place in struct page where allocatee state > does, and the reason this is bad is that slab/slub are a hell of a lot faster > than the buddy allocator, and Johannes wants to move the boundary between slab > allocations and buddy allocator allocations up to like 64k. If we fix where slab > state lives, this will become completely trivial to do. > > So if we have this: > > struct page { > unsigned long allocator; > unsigned long allocatee; > }; > > The allocator field would be used for either a pointer to slab/slub's state, if > it's a slab page, or if it's a buddy allocator page it'd encode the order of the > allocation - like compound order today, and probably whether or not the > (compound group of) pages is free. > > The allocatee field would be used for a type tagged (using the low bits of the > pointer) to one of: > - struct folio > - struct anon_folio, if that becomes a thing > - struct network_pool_page > - struct pte_page > - struct zone_device_page > > Then we can further refactor things until all the stuff that's currently crammed > in struct page lives in types where each struct field means one and precisely > one thing, and also where we can freely reshuffle and reorganize and add stuff > to the various types where we couldn't before because it'd make struct page > bigger. > > Other notes & potential issues: > - page->compound_dtor needs to die > > - page->rcu_head moves into the types that actually need it, no issues there > > - page->refcount has question marks around it. I think we can also just move it > into the types that need it; with RCU derefing the pointer to the folio or > whatever and grabing a ref on folio->refcount can happen under a RCU read > lock - there's no real question about whether it's technically possible to > get it out of struct page, and I think it would be cleaner overall that way. > > However, depending on how it's used from code paths that go from generic > untyped pages, I could see it turning into more of a hassle than it's worth. > More investigation is needed. > > - page->memcg_data - I don't know whether that one more properly belongs in > struct page or in the page subtypes - I'd love it if Johannes could talk > about that one. > > - page->flags - dealing with this is going to be a huge hassle but also where > we'll find some of the biggest gains in overall sanity and readability of the > code. Right now, PG_locked is super special and ad hoc and I have run into > situations multiple times (and Johannes was in vehement agreement on this > one) where I simply could not figure the behaviour of the current code re: > who is responsible for locking pages without instrumenting the code with > assertions. > > Meaning anything we do to create and enforce module boundaries between > different chunks of code is going to suck, but the end result should be > really worthwhile. > > Matthew Wilcox and David Howells have been having conversations on IRC about > what to do about other page bits. It appears we should be able to kill a lot of > filesystem usage of both PG_private and PG_private_2 - filesystems in general > hang state off of page->private, soon to be folio->private, and PG_private in > current use just indicates whether page->private is nonzero - meaning it's > completely redundant. > Don't get me wrong, but before there are answers to some of the very basic questions raised above (especially everything that lives in page->flags, which are not only page flags, refcount, ...) this isn't very tempting to spend more time on, from a reviewer perspective. -- Thanks, David / dhildenb
Powered by blists - more mailing lists