[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YUIzdTyFBITDIPnj@fedora.tometzki.de>
Date: Wed, 15 Sep 2021 19:55:01 +0200
From: Damian Tometzki <dtometzki@...oraproject.org>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Kent Overstreet <kent.overstreet@...il.com>,
Matthew Wilcox <willy@...radead.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
"Darrick J. Wong" <djwong@...nel.org>,
Christoph Hellwig <hch@...radead.org>,
David Howells <dhowells@...hat.com>
Subject: Re: Folio discussion recap
Hello together,
I am an outsider and following the discussion here on the subject.
Can we not go upsream with the state of development ?
Optimizations will always be there and new kernel releases too.
I can not assess the risk but I think a decision must be made.
Damian
On Wed, 15. Sep 11:40, Johannes Weiner wrote:
> On Fri, Sep 10, 2021 at 04:16:28PM -0400, Kent Overstreet wrote:
> > One particularly noteworthy idea was having struct page refer to
> > multiple hardware pages, and using slab/slub for larger
> > alloctions. In my view, the primary reason for making this change
> > isn't the memory overhead to struct page (though reducing that would
> > be nice);
>
> Don't underestimate this, however.
>
> Picture the near future Willy describes, where we don't bump struct
> page size yet but serve most cache with compound huge pages.
>
> On x86, it would mean that the average page cache entry has 512
> mapping pointers, 512 index members, 512 private pointers, 1024 LRU
> list pointers, 512 dirty flags, 512 writeback flags, 512 uptodate
> flags, 512 memcg pointers etc. - you get the idea.
>
> This is a ton of memory. I think this doesn't get more traction
> because it's memory we've always allocated, and we're simply more
> sensitive to regressions than long-standing pain. But nevertheless
> this is a pretty low-hanging fruit.
>
> The folio makes a great first step moving those into a separate data
> structure, opening the door to one day realizing these savings. Even
> when some MM folks say this was never the intent behind the patches, I
> think this is going to matter significantly, if not more so, later on.
>
> > Fortunately, Matthew made a big step in the right direction by making folios a
> > new type. Right now, struct folio is not separately allocated - it's just
> > unionized/overlayed with struct page - but perhaps in the future they could be
> > separately allocated. I don't think that is a remotely realistic goal for _this_
> > patch series given the amount of code that touches struct page (thing: writeback
> > code, LRU list code, page fault handlers!) - but I think that's a goal we could
> > keep in mind going forward.
>
> Yeah, agreed. Not doable out of the gate, but retaining the ability to
> allocate the "cache entry descriptor" bits - mapping, index etc. -
> on-demand would be a huge benefit down the road for the above reason.
>
> For that they would have to be in - and stay in - their own type.
>
> > We should also be clear on what _exactly_ folios are for, so they don't become
> > the new dumping ground for everyone to stash their crap. They're to be a new
> > core abstraction, and we should endeaver to keep our core data structures
> > _small_, and _simple_.
>
> Right. struct page is a lot of things and anything but simple and
> obvious today. struct folio in its current state does a good job
> separating some of that stuff out.
>
> However, when we think about *which* of the struct page mess the folio
> wants to address, I think that bias toward recent pain over much
> bigger long-standing pain strikes again.
>
> The compound page proliferation is new, and we're sensitive to the
> ambiguity it created between head and tail pages. It's added some
> compound_head() in lower-level accessor functions that are not
> necessary for many contexts. The folio type safety will help clean
> that up, and this is great.
>
> However, there is a much bigger, systematic type ambiguity in the MM
> world that we've just gotten used to over the years: anon vs file vs
> shmem vs slab vs ...
>
> - Many places rely on context to say "if we get here, it must be
> anon/file", and then unsafely access overloaded member elements:
> page->mapping, PG_readahead, PG_swapcache, PG_private
>
> - On the other hand, we also have low-level accessor functions that
> disambiguate the type and impose checks on contexts that may or may
> not actually need them - not unlike compound_head() in PageActive():
>
> struct address_space *folio_mapping(struct folio *folio)
> {
> struct address_space *mapping;
>
> /* This happens if someone calls flush_dcache_page on slab page */
> if (unlikely(folio_test_slab(folio)))
> return NULL;
>
> if (unlikely(folio_test_swapcache(folio)))
> return swap_address_space(folio_swap_entry(folio));
>
> mapping = folio->mapping;
> if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> return NULL;
>
> return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
> }
>
> Then we go identify places that say "we know it's at least not a
> slab page!" and convert them to page_mapping_file() which IS safe to
> use with anon. Or we say "we know this MUST be a file page" and just
> access the (unsafe) mapping pointer directly.
>
> - We have a singular page lock, but what it guards depends on what
> type of page we're dealing with. For a cache page it protects
> uptodate and the mapping. For an anon page it protects swap state.
>
> A lot of us can remember the rules if we try, but the code doesn't
> help and it gets really tricky when dealing with multiple types of
> pages simultaneously. Even mature code like reclaim just serializes
> the operation instead of protecting data - the writeback checks and
> the page table reference tests don't seem to need page lock.
>
> When the cgroup folks wrote the initial memory controller, they just
> added their own page-scope lock to protect page->memcg even though
> the page lock would have covered what it needed.
>
> - shrink_page_list() uses page_mapping() in the first half of the
> function to tell whether the page is anon or file, but halfway
> through we do this:
>
> /* Adding to swap updated mapping */
> mapping = page_mapping(page);
>
> and then use PageAnon() to disambiguate the page type.
>
> - At activate_locked:, we check PG_swapcache directly on the page and
> rely on it doing the right thing for anon, file, and shmem pages.
> But this flag is PG_owner_priv_1 and actually used by the filesystem
> for something else. I guess PG_checked pages currently don't make it
> this far in reclaim, or we'd crash somewhere in try_to_free_swap().
>
> I suppose we're also never calling page_mapping() on PageChecked
> filesystem pages right now, because it would return a swap mapping
> before testing whether this is a file page. You know, because shmem.
>
> These are just a few examples from an MM perspective. I'm sure the FS
> folks have their own stories and examples about pitfalls in dealing
> with struct page members.
>
> We're so used to this that we don't realize how much bigger and
> pervasive this lack of typing is than the compound page thing.
>
> I'm not saying the compound page mess isn't worth fixing. It is.
>
> I'm saying if we started with a file page or cache entry abstraction
> we'd solve not only the huge page cache, but also set us up for a MUCH
> more comprehensive cleanup in MM code and MM/FS interaction that makes
> the tailpage cleanup pale in comparison. For the same amount of churn,
> since folio would also touch all of these places.
>
Powered by blists - more mailing lists