lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <9c1450ba-ade4-4236-8d3e-c5658a3a26c3@redhat.com>
Date: Fri, 24 Oct 2025 09:43:25 +0200
From: David Hildenbrand <david@...hat.com>
To: Dave Chinner <david@...morbit.com>, Andreas Dilger <adilger@...ger.ca>
Cc: Kiryl Shutsemau <kirill@...temov.name>,
 Andrew Morton <akpm@...ux-foundation.org>, Hugh Dickins <hughd@...gle.com>,
 Matthew Wilcox <willy@...radead.org>,
 Alexander Viro <viro@...iv.linux.org.uk>,
 Christian Brauner <brauner@...nel.org>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka
 <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
 Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
 Rik van Riel <riel@...riel.com>, Harry Yoo <harry.yoo@...cle.com>,
 Johannes Weiner <hannes@...xchg.org>, Shakeel Butt <shakeel.butt@...ux.dev>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>,
 "Darrick J. Wong" <djwong@...nel.org>, linux-mm <linux-mm@...ck.org>,
 linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics

On 24.10.25 08:50, Dave Chinner wrote:
> On Thu, Oct 23, 2025 at 09:48:58AM -0600, Andreas Dilger wrote:
>>> On Oct 23, 2025, at 5:38 AM, Dave Chinner <david@...morbit.com> wrote:
>>> On Tue, Oct 21, 2025 at 07:16:26AM +0100, Kiryl Shutsemau wrote:
>>>> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
>>>>> In critical paths like truncate, correctness and safety come first.
>>>>> Performance is only a secondary consideration.  The overlap of
>>>>> mmap() and truncate() is an area where we have had many, many bugs
>>>>> and, at minimum, the current POSIX behaviour largely shields us from
>>>>> serious stale data exposure events when those bugs (inevitably)
>>>>> occur.
>>>>
>>>> How do you prevent writes via GUP racing with truncate()?
>>>>
>>>> Something like this:
>>>>
>>>> 	CPU0				CPU1
>>>> fd = open("file")
>>>> p = mmap(fd)
>>>> whatever_syscall(p)
>>>>   get_user_pages(p, &page)
>>>>   				truncate("file");
>>>>   <write to page>
>>>>   put_page(page);
>>>
>>> Forget about truncate, go look at the comment above
>>> writable_file_mapping_allowed() about using GUP this way.
>>>
>>> i.e. file-backed mmap/GUP is a known broken anti-pattern. We've
>>> spent the past 15+ years telling people that it is unfixably broken
>>> and they will crash their kernel or corrupt there data if they do
>>> this.
>>>
>>> This is not supported functionality because real world production
>>> use ends up exposing problems with sync and background writeback
>>> races, truncate races, fallocate() races, writes into holes, writes
>>> into preallocated regions, writes over shared extents that require
>>> copy-on-write, etc, etc, ad nausiem.
>>>
>>> If anyone is using filebacked mappings like this, then when it
>>> breaks they get to keep all the broken pieces to themselves.
>>
>> Should ftruncate("file") return ETXTBUSY in this case, so that users
>> and applications know this doesn't work/isn't safe?
> 
> No, it is better to block waiting for the GUP to release the
> reference (see below), but the general problem is that we cannot
> reliably discriminate GUP references from other page cache based
> references just by looking at the folio resident in the page cache.

Right. folio_maybe_dma_pinned() can have false positives for small 
folios, but also temporarily for large folios (speculative pins from 
GUP-fast).

In the future it might get more reliable at least for small folios when 
we are able to have a dedicated pincount.

(there is still the issue that some mechanisms that should be using 
pin_user_pages() are still using get_user_pages())

> 
> However, when FSDAX is being used, trucate does, in fact, block
> waiting for GUP references to be release. fsdax does not use page
> references to track in use pages - the filesystem metadata tracks
> allocated and free pages, not the mm/ subsystem. There are no
> page cache references to the pages, because there is no page
> cache. Hence we can use the difference between the map count and the
> reference count to determine if there are any references we cannot
> forcibly unmap (e.g. GUP) just by looking at the backing store folio
> state.

We can do the same for other folios as well. See folio_expected_ref_count().

Unexpected references can be from GUP, lru caches or other temporary 
ones from page migration etc.

As we document for folio_expected_ref_count() it's racy for mapped 
folios, though: "Calling this function on a mapped folio will not result 
in a stable result, because nothing stops additional page table mappings 
from coming (e.g.,fork()) or going (e.g., munmap())."

It only works reliably on unmapped folios when holding the folio lock.


-- 
Cheers

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ