linux-kernel - Re: [GIT PULL] Memory folios for v5.15

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YSQeFPTMn5WpwyAa@casper.infradead.org>
Date:   Mon, 23 Aug 2021 23:15:48 +0100
From:   Matthew Wilcox <willy@...radead.org>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [GIT PULL] Memory folios for v5.15

On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote:
> On Mon, Aug 23, 2021 at 08:01:44PM +0100, Matthew Wilcox wrote:
> Just to clarify, I'm only on this list because I acked 3 smaller,
> independent memcg cleanup patches in this series. I have repeatedly
> expressed strong reservations over folios themselves.

I thought I'd addressed all your concerns.  I'm sorry I misunderstood
and did not intend to misrepresent your position.

> The arguments for a better data interface between mm and filesystem in
> light of variable page sizes are plentiful and convincing. But from an
> MM point of view, it's all but clear where the delineation between the
> page and folio is, and what the endgame is supposed to look like.
> 
> One one hand, the ambition appears to substitute folio for everything
> that could be a base page or a compound page even inside core MM
> code. Since there are very few places in the MM code that expressly
> deal with tail pages in the first place, this amounts to a conversion
> of most MM code - including the LRU management, reclaim, rmap,
> migrate, swap, page fault code etc. - away from "the page".

I would agree with all of those except the page fault code; I believe
that needs to continue to work in terms of pages in order to support
misaligned mappings.

> However, this far exceeds the goal of a better mm-fs interface. And
> the value proposition of a full MM-internal conversion, including
> e.g. the less exposed anon page handling, is much more nebulous. It's
> been proposed to leave anon pages out, but IMO to keep that direction
> maintainable, the folio would have to be translated to a page quite
> early when entering MM code, rather than propagating it inward, in
> order to avoid huge, massively overlapping page and folio APIs.

I only intend to leave anonymous memory out /for now/.  My hope is
that somebody else decides to work on it (and indeed Google have
volunteered someone for the task).

> It's also not clear to me that using the same abstraction for compound
> pages and the file cache object is future proof. It's evident from
> scalability issues in the allocator, reclaim, compaction, etc. that
> with current memory sizes and IO devices, we're hitting the limits of
> efficiently managing memory in 4k base pages per default. It's also
> clear that we'll continue to have a need for 4k cache granularity for
> quite a few workloads that work with large numbers of small files. I'm
> not sure how this could be resolved other than divorcing the idea of a
> (larger) base page from the idea of cache entries that can correspond,
> if necessary, to memory chunks smaller than a default page.

That sounds to me exactly like folios, except for the naming.  From the
MM point of view, it's less churn to do it your way, but from the
point of view of the rest of the kernel, there's going to be unexpected
consequences.  For example, btrfs didn't support page size != block size
until just recently (and I'm not sure it's entirely fixed yet?)

And there's nobody working on your idea.  At least not that have surfaced
so far.  The folio patch is here now.

Folios are also variable sized.  For files which are small, we still only
allocate 4kB to cache them.  If the file is accessed entirely randomly,
we only allocate 4kB chunks at a time.  We only allocate larger folios
when we think there is an advantage to doing so.

This benefit is retained if someone does come along to change PAGE_SIZE
to 16KiB (or whatever).  Folios can still be composed of multiple pages,
no matter what the PAGE_SIZE is.

> A longer thread on that can be found here:
> https://lore.kernel.org/linux-fsdevel/YFja%2FLRC1NI6quL6@cmpxchg.org/
> 
> As an MM stakeholder, I don't think folios are the answer for MM code.