linux-kernel - Re: [GIT PULL] Memory folios for v5.15

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YVyLh+bnZzNeQkyb@cmpxchg.org>
Date:   Tue, 5 Oct 2021 13:29:43 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Matthew Wilcox <willy@...radead.org>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [GIT PULL] Memory folios for v5.15

On Tue, Oct 05, 2021 at 02:52:01PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote:
> > One one hand, the ambition appears to substitute folio for everything
> > that could be a base page or a compound page even inside core MM
> > code. Since there are very few places in the MM code that expressly
> > deal with tail pages in the first place, this amounts to a conversion
> > of most MM code - including the LRU management, reclaim, rmap,
> > migrate, swap, page fault code etc. - away from "the page".
> > 
> > However, this far exceeds the goal of a better mm-fs interface. And
> > the value proposition of a full MM-internal conversion, including
> > e.g. the less exposed anon page handling, is much more nebulous. It's
> > been proposed to leave anon pages out, but IMO to keep that direction
> > maintainable, the folio would have to be translated to a page quite
> > early when entering MM code, rather than propagating it inward, in
> > order to avoid huge, massively overlapping page and folio APIs.
> 
> Here's an example where our current confusion between "any page"
> and "head page" at least produces confusing behaviour, if not an
> outright bug, isolate_migratepages_block():
> 
>                 page = pfn_to_page(low_pfn);
> ...
>                 if (PageCompound(page) && !cc->alloc_contig) {
>                         const unsigned int order = compound_order(page);
> 
>                         if (likely(order < MAX_ORDER))
>                                 low_pfn += (1UL << order) - 1;
>                         goto isolate_fail;
>                 }
> 
> compound_order() does not expect a tail page; it returns 0 unless it's
> a head page.  I think what we actually want to do here is:
> 
> 		if (!cc->alloc_contig) {
> 			struct page *head = compound_head(page);
> 			if (PageHead(head)) {
> 				const unsigned int order = compound_order(head);
> 
> 				low_pfn |= (1UL << order) - 1;
> 				goto isolate_fail;
> 			}
> 		}
> 
> Not earth-shattering; not even necessarily a bug.  But it's an example
> of the way the code reads is different from how the code is executed,
> and that's potentially dangerous.  Having a different type for tail
> and not-tail pages prevents the muddy thinking that can lead to
> tail pages being passed to compound_order().

Thanks for digging this up. I agree the second version is much better.

My question is still whether the extensive folio whitelisting of
everybody else is the best way to bring those codepaths to light.

The above isn't totally random. That code is a pfn walker which
translates from the basepage address space to an ambiguous struct page
object. There are more of those, but we can easily identify them: all
uses of pfn_to_page() and virt_to_page() indicate that the code needs
an audit for how exactly they're using the returned page.

The above instance of such a walker wants to deal with a higher-level
VM object: a thing that can be on the LRU, can be locked, etc. For
those instances the pattern is clear that the pfn_to_page() always
needs to be paired with a compound_head() before handling the page. I
had mentioned in the other subthread a pfn_to_normal_page() to
streamline this pattern, clarify intent, and mark the finished audit.

Another class are page table walkers resolving to an ambiguous struct
page right now. Those are also easy to identify, and AFAICS they all
want headpages, which is why I had mentioned a central compound_head()
in vm_normal_page().

Are there other such classes that I'm missing? Because it seems to me
there are two and they both have rather clear markers for where the
disambiguation needs to happen - and central helpers to put them in!

And it makes sense: almost nobody *actually* needs to access the tail
members of struct page. This suggests a pushdown and early filtering
in a few central translation/lookup helpers would work to completely
disambiguate remaining struct page usage inside MM code.

There *are* a few weird struct page usages left, like bio and sparse,
and you mentioned vm_fault as well in the other subthread. But it
really seems these want converting away from arbitrary struct page to
either something like phys_addr_t or a proper headpage anyway. Maybe a
tuple of headpage and subpage index in the fault case. Because even
after a full folio conversion of everybody else, those would be quite
weird in their use of an ambiguous struct page! Which struct members
are safe to access? What does it mean to lock a tailpage? Etc.

But it's possible I'm missing something. Are there entry points that
are difficult to identify both conceptually and code-wise? And which
couldn't be pushed down to resolve to headpages quite early? Those I
think would make the argument for the folio in the MM implementation.