[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3bf50873-4d1b-a7c7-112e-3a17ac16077f@google.com>
Date: Tue, 15 Jul 2025 13:03:40 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Andrew Morton <akpm@...ux-foundation.org>
cc: Baolin Wang <baolin.wang@...ux.alibaba.com>,
Matthew Wilcox <willy@...radead.org>, hughd@...gle.com, david@...hat.com,
ziy@...dia.com, Liam.Howlett@...cle.com, npache@...hat.com,
ryan.roberts@....com, dev.jain@....com, baohua@...nel.org, vbabka@...e.cz,
rppt@...nel.org, surenb@...gle.com, mhocko@...e.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] mm: fault in complete folios instead of individual
pages for tmpfs
On Wed, 9 Jul 2025, Lorenzo Stoakes wrote:
> On Tue, Jul 08, 2025 at 03:53:56PM +0800, Baolin Wang wrote:
> > On 2025/7/7 21:33, Lorenzo Stoakes wrote:
> > > On Sun, Jul 06, 2025 at 10:02:35AM +0800, Baolin Wang wrote:
> > > > On 2025/7/5 06:18, Andrew Morton wrote:
> > > > > On Fri, 4 Jul 2025 11:19:26 +0800 Baolin Wang <baolin.wang@...ux.alibaba.com> wrote:
> > > > >
> > > > > > After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"),
> > > > > > tmpfs can also support large folio allocation (not just PMD-sized large
> > > > > > folios).
> > > > > >
> > > > > > However, when accessing tmpfs via mmap(), although tmpfs supports large folios,
> > > > > > we still establish mappings at the base page granularity, which is unreasonable.
> > > > > >
> > > > > > We can map multiple consecutive pages of a tmpfs folios at once according to
> > > > > > the size of the large folio. On one hand, this can reduce the overhead of page
> > > > > > faults; on the other hand, it can leverage hardware architecture optimizations
> > > > > > to reduce TLB misses, such as contiguous PTEs on the ARM architecture.
> > > > > >
> > > > > > Moreover, tmpfs mount will use the 'huge=' option to control large folio
> > > > > > allocation explicitly. So it can be understood that the process's RSS statistics
> > > > > > might increase, and I think this will not cause any obvious effects for users.
> > > > > >
> > > > > > Performance test:
> > > > > > I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
> > > > > > sequentially via mmap(). I observed a significant performance improvement:
> > > > >
> > > > > That doesn't sound like a crazy thing to do.
> > > > >
> > > > > > Before the patch:
> > > > > > real 0m0.158s
> > > > > > user 0m0.008s
> > > > > > sys 0m0.150s
> > > > > >
> > > > > > After the patch:
> > > > > > real 0m0.021s
> > > > > > user 0m0.004s
> > > > > > sys 0m0.017s
> > > > >
> > > > > And look at that.
> > > > >
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 0f9b32a20e5b..9944380e947d 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -5383,10 +5383,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> > > > > > /*
> > > > > > * Using per-page fault to maintain the uffd semantics, and same
> > > > > > - * approach also applies to non-anonymous-shmem faults to avoid
> > > > > > + * approach also applies to non shmem/tmpfs faults to avoid
> > > > > > * inflating the RSS of the process.
> > > > > > */
> > > > > > - if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> > > > > > + if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
> > > > > > unlikely(needs_fallback)) {
> > > > > > nr_pages = 1;
> > > > > > } else if (nr_pages > 1) {
> > > > >
> > > > > and that's it?
> > > > >
> > > > > I'm itching to get this into -stable, really. What LTS user wouldn't
> > > > > want this?
> > > >
> > > > This is an improvement rather than a bugfix, so I don't think it needs to go
> > > > into LTS.
> > > >
> > > > Could it be viewed as correcting an oversight in
> > > > > acd7ccb284b8?
> > > >
> > > > Yes, I should have added this optimization in the series of the commit
> > > > acd7ccb284b8. But obviously, I missed this :(.
> > >
> > > Buuut if this was an oversight for that patch that causes an unnecessary
> > > perf degradation, surely this should have fixes tag + cc stable no?
> >
> > IMO, this commit acd7ccb284b8 won't cause perf degradation, instead it is
> > used to introduce a new feature, while the current patch is a further
> > reasonable optimization. As I mentioned, this is an improvement, not a
> > bugfix or a patch to address performance regression.
>
> 4Well :) you say yourself it was an oversight, and it very clearly has a perf
> _impact_, which if you compare backwards to acd7ccb284b8 is a degradation, but I
> get your point.
>
> However, since you say 'oversight' this seems to me that you really meant to
> have included it but hadn't noticed, and additionally, since it just seems to be
> an unequivical good - let's maybe flip this round - why NOT backport it to
> stable?
I strongly agree with Baolin: this patch is good, thank you, but it is
a performance improvement, a new feature, not a candidate for the stable
tree. I'm surprised anyone thinks otherwise: Andrew, please delete that
stable tag before advancing the patch from mm-unstable to mm-stable.
And the Fixee went into 6.14, so it couldn't go to 6.12 LTS anyway.
An unequivocal good? I'm not so sure.
I expect it ought to be limited, by fault_around_bytes (or suchlike).
If I understand all the mTHP versus large folio versus PMD-huge handling
correctly (and of course I do not, I'm still weeks if not months away
from understanding most of it), the old vma_is_anon_shmem() case would
be limited by the shmem mTHP tunables, and one can reasonably argue that
they would already take fault_around_bytes-like considerations into account;
but the newly added file-written cases are governed by huge= mount options
intended for PMD-size, but (currently) permitting all lesser orders.
I don't think that mounting a tmpfs huge=always implies that mapping
256 PTEs for one fault is necessarily a good strategy.
But looking in the opposite direction, why is there now a vma_is_shmem()
check there in finish_fault() at all? If major filesystems are using
large folios, why aren't they also allowed to benefit from mapping
multiple PTEs at once (in this shared-writable case which the existing
fault-around does not cover - I presume to avoid write amplification,
but that's not an issue when the folio is large already).
It's fine to advance cautiously, keeping this to shmem in coming release;
but I think it should be extended soon (without any backport to stable),
and consideration given to limiting it.
Hugh
Powered by blists - more mailing lists