[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPbFgnW1ewPzpBGz@dread.disaster.area>
Date: Tue, 21 Oct 2025 10:28:02 +1100
From: Dave Chinner <david@...morbit.com>
To: Kiryl Shutsemau <kirill@...temov.name>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>,
Hugh Dickins <hughd@...gle.com>,
Matthew Wilcox <willy@...radead.org>,
Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>,
Michal Hocko <mhocko@...e.com>, Rik van Riel <riel@...riel.com>,
Harry Yoo <harry.yoo@...cle.com>,
Johannes Weiner <hannes@...xchg.org>,
Shakeel Butt <shakeel.butt@...ux.dev>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
"Darrick J. Wong" <djwong@...nel.org>, linux-mm@...ck.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
Kiryl Shutsemau <kas@...nel.org>
Subject: Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
On Mon, Oct 20, 2025 at 05:30:52PM +0100, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@...nel.org>
>
> I do NOT want the patches in this patchset to be applied. Instead, I
> would like to discuss the semantics of large folios versus SIGBUS.
>
> ## Background
>
> Accessing memory within a VMA, but beyond i_size rounded up to the next
> page size, is supposed to generate SIGBUS.
>
> This definition is simple if all pages are PAGE_SIZE in size, but with
> large folios in the picture, it is no longer the case.
>
> ## Problem
>
> Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749
> failed due to missing SIGBUS. This was caused by my recent changes that
> try to fault in the whole folio where possible:
>
> 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
> 357b92761d94 ("mm/filemap: map entire large folio faultaround")
>
> These changes did not consider i_size when setting up PTEs, leading to
> xfstest breakage.
>
> However, the problem has been present in the kernel for a long time -
> since huge tmpfs was introduced in 2016. The kernel happily maps
> PMD-sized folios as PMD without checking i_size. And huge=always tmpfs
> allocates PMD-size folios on any writes.
The tmpfs huge=always specific behaviour is not how regular
filesystems have behaved. It is niche, special case functionality
that has weird behaviours and, as such, it most definitely does not
set the standards for how all other filesystems should behave.
> I considered this corner case when I implemented a large tmpfs, and my
> conclusion was that no one in their right mind should rely on receiving
> a SIGBUS signal when accessing beyond i_size. I cannot imagine how it
> could be useful for the workload.
Lacking the imagination or knowledge to understand why a behaviour
exists does not mean that behaviour is unnecessary or that it should
be removed. It just means you didn't ask the people who knew wy the
functionality exists...
> Generic/749 was introduced last year with reference to POSIX, but no
> real workloads were mentioned. It also acknowledged the tmpfs deviation
> from the test case.
>
> POSIX indeed says[3]:
>
> References within the address range starting at pa and
> continuing for len bytes to whole pages following the end of an
> object shall result in delivery of a SIGBUS signal.
>
> Do we care about adhering strictly to this in absence of real workloads
> that relies on this semantics?
We've already told you that we do, because mapping beyond EOF has
implications for the impact on how much stale data exposure occur
when the next set of truncate/mmap() bugs are introduced.
> I think it valuable to allow kernel to map memory with a larger chunks
> -- whole folio -- to get TLB benefits (from both huge pages and TLB
> coalescing). I value TLB hit rate over POSIX wording.
Feel free to do that for tmpfs, but for persistent filesystems the
existing POSIX SIGBUS behaviour needs to remain.
> Any opinions?
There are solid historic reasons for the existing behaviour and for
keeping it unchanged. You aren't allowed to handwave them away
because you don't understand or care about them.
In critical paths like truncate, correctness and safety come first.
Performance is only a secondary consideration. The overlap of
mmap() and truncate() is an area where we have had many, many bugs
and, at minimum, the current POSIX behaviour largely shields us from
serious stale data exposure events when those bugs (inevitably)
occur.
Fundamentally, we really don't care about the mapping/tlb
performance of the PTE fragments at EOF. Anyone using files large
enough to notice the TLB overhead improvements from mapping large
folios is not going to notice that the EOF mapping has a slightly
higher TLB miss overhead than everywhere else in the file.
Please jsut fix the regression.
-Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists