[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <9c1450ba-ade4-4236-8d3e-c5658a3a26c3@redhat.com>
Date: Fri, 24 Oct 2025 09:43:25 +0200
From: David Hildenbrand <david@...hat.com>
To: Dave Chinner <david@...morbit.com>, Andreas Dilger <adilger@...ger.ca>
Cc: Kiryl Shutsemau <kirill@...temov.name>,
Andrew Morton <akpm@...ux-foundation.org>, Hugh Dickins <hughd@...gle.com>,
Matthew Wilcox <willy@...radead.org>,
Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka
<vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
Rik van Riel <riel@...riel.com>, Harry Yoo <harry.yoo@...cle.com>,
Johannes Weiner <hannes@...xchg.org>, Shakeel Butt <shakeel.butt@...ux.dev>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
"Darrick J. Wong" <djwong@...nel.org>, linux-mm <linux-mm@...ck.org>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
On 24.10.25 08:50, Dave Chinner wrote:
> On Thu, Oct 23, 2025 at 09:48:58AM -0600, Andreas Dilger wrote:
>>> On Oct 23, 2025, at 5:38 AM, Dave Chinner <david@...morbit.com> wrote:
>>> On Tue, Oct 21, 2025 at 07:16:26AM +0100, Kiryl Shutsemau wrote:
>>>> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
>>>>> In critical paths like truncate, correctness and safety come first.
>>>>> Performance is only a secondary consideration. The overlap of
>>>>> mmap() and truncate() is an area where we have had many, many bugs
>>>>> and, at minimum, the current POSIX behaviour largely shields us from
>>>>> serious stale data exposure events when those bugs (inevitably)
>>>>> occur.
>>>>
>>>> How do you prevent writes via GUP racing with truncate()?
>>>>
>>>> Something like this:
>>>>
>>>> CPU0 CPU1
>>>> fd = open("file")
>>>> p = mmap(fd)
>>>> whatever_syscall(p)
>>>> get_user_pages(p, &page)
>>>> truncate("file");
>>>> <write to page>
>>>> put_page(page);
>>>
>>> Forget about truncate, go look at the comment above
>>> writable_file_mapping_allowed() about using GUP this way.
>>>
>>> i.e. file-backed mmap/GUP is a known broken anti-pattern. We've
>>> spent the past 15+ years telling people that it is unfixably broken
>>> and they will crash their kernel or corrupt there data if they do
>>> this.
>>>
>>> This is not supported functionality because real world production
>>> use ends up exposing problems with sync and background writeback
>>> races, truncate races, fallocate() races, writes into holes, writes
>>> into preallocated regions, writes over shared extents that require
>>> copy-on-write, etc, etc, ad nausiem.
>>>
>>> If anyone is using filebacked mappings like this, then when it
>>> breaks they get to keep all the broken pieces to themselves.
>>
>> Should ftruncate("file") return ETXTBUSY in this case, so that users
>> and applications know this doesn't work/isn't safe?
>
> No, it is better to block waiting for the GUP to release the
> reference (see below), but the general problem is that we cannot
> reliably discriminate GUP references from other page cache based
> references just by looking at the folio resident in the page cache.
Right. folio_maybe_dma_pinned() can have false positives for small
folios, but also temporarily for large folios (speculative pins from
GUP-fast).
In the future it might get more reliable at least for small folios when
we are able to have a dedicated pincount.
(there is still the issue that some mechanisms that should be using
pin_user_pages() are still using get_user_pages())
>
> However, when FSDAX is being used, trucate does, in fact, block
> waiting for GUP references to be release. fsdax does not use page
> references to track in use pages - the filesystem metadata tracks
> allocated and free pages, not the mm/ subsystem. There are no
> page cache references to the pages, because there is no page
> cache. Hence we can use the difference between the map count and the
> reference count to determine if there are any references we cannot
> forcibly unmap (e.g. GUP) just by looking at the backing store folio
> state.
We can do the same for other folios as well. See folio_expected_ref_count().
Unexpected references can be from GUP, lru caches or other temporary
ones from page migration etc.
As we document for folio_expected_ref_count() it's racy for mapped
folios, though: "Calling this function on a mapped folio will not result
in a stable result, because nothing stops additional page table mappings
from coming (e.g.,fork()) or going (e.g., munmap())."
It only works reliably on unmapped folios when holding the folio lock.
--
Cheers
David / dhildenb
Powered by blists - more mailing lists