lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPshzHsqp2Srau5T@dread.disaster.area>
Date: Fri, 24 Oct 2025 17:50:52 +1100
From: Dave Chinner <david@...morbit.com>
To: Andreas Dilger <adilger@...ger.ca>
Cc: Kiryl Shutsemau <kirill@...temov.name>,
	Andrew Morton <akpm@...ux-foundation.org>,
	David Hildenbrand <david@...hat.com>,
	Hugh Dickins <hughd@...gle.com>,
	Matthew Wilcox <willy@...radead.org>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Christian Brauner <brauner@...nel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	"Liam R. Howlett" <Liam.Howlett@...cle.com>,
	Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Michal Hocko <mhocko@...e.com>, Rik van Riel <riel@...riel.com>,
	Harry Yoo <harry.yoo@...cle.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Baolin Wang <baolin.wang@...ux.alibaba.com>,
	"Darrick J. Wong" <djwong@...nel.org>,
	linux-mm <linux-mm@...ck.org>, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics

On Thu, Oct 23, 2025 at 09:48:58AM -0600, Andreas Dilger wrote:
> > On Oct 23, 2025, at 5:38 AM, Dave Chinner <david@...morbit.com> wrote:
> > On Tue, Oct 21, 2025 at 07:16:26AM +0100, Kiryl Shutsemau wrote:
> >> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
> >>> In critical paths like truncate, correctness and safety come first.
> >>> Performance is only a secondary consideration.  The overlap of
> >>> mmap() and truncate() is an area where we have had many, many bugs
> >>> and, at minimum, the current POSIX behaviour largely shields us from
> >>> serious stale data exposure events when those bugs (inevitably)
> >>> occur.
> >> 
> >> How do you prevent writes via GUP racing with truncate()?
> >> 
> >> Something like this:
> >> 
> >> 	CPU0				CPU1
> >> fd = open("file")
> >> p = mmap(fd)
> >> whatever_syscall(p)
> >>  get_user_pages(p, &page)
> >>  				truncate("file");
> >>  <write to page>
> >>  put_page(page);
> > 
> > Forget about truncate, go look at the comment above
> > writable_file_mapping_allowed() about using GUP this way.
> > 
> > i.e. file-backed mmap/GUP is a known broken anti-pattern. We've
> > spent the past 15+ years telling people that it is unfixably broken
> > and they will crash their kernel or corrupt there data if they do
> > this.
> > 
> > This is not supported functionality because real world production
> > use ends up exposing problems with sync and background writeback
> > races, truncate races, fallocate() races, writes into holes, writes
> > into preallocated regions, writes over shared extents that require
> > copy-on-write, etc, etc, ad nausiem.
> > 
> > If anyone is using filebacked mappings like this, then when it
> > breaks they get to keep all the broken pieces to themselves.
> 
> Should ftruncate("file") return ETXTBUSY in this case, so that users
> and applications know this doesn't work/isn't safe?

No, it is better to block waiting for the GUP to release the
reference (see below), but the general problem is that we cannot
reliably discriminate GUP references from other page cache based
references just by looking at the folio resident in the page cache.

However, when FSDAX is being used, trucate does, in fact, block
waiting for GUP references to be release. fsdax does not use page
references to track in use pages - the filesystem metadata tracks
allocated and free pages, not the mm/ subsystem. There are no
page cache references to the pages, because there is no page
cache. Hence we can use the difference between the map count and the
reference count to determine if there are any references we cannot
forcibly unmap (e.g. GUP) just by looking at the backing store folio
state.

Hence we can block truncate on non mapcount references via the
layout lease hooks like so: i.e.:

->setattr
 xfs_vn_setattr
   xfs_break_layouts(BREAK_UNMAP)
      xfs_break_dax_layouts()
        dax_break_layout_inode()
	  dax_break_layout()
	    page = dax_layout_busy_page_range()
	      page = dax_busy_page()
		 /* page returned if it is held by GUP */
	    wait_page_idle(page)
	         /* blocks until extra ref counts go away */

and only when all the non-mapcount page references are gone across
the truncate range is the truncate allowed to proceed.

IIRC, we decided to block truncate and other operations that need
backing store access exclusion rather than returned an error because
nobody expects operations like truncate to randomly fail like this.
Such behaviour would likely break applications in unexpected ways,
so it was decided to play it safe and block until the ref goes away.

This is one of the reasons for FOLL_LONGTERM  being added - we
can't allow longterm pinning of file-backed fsadax pages (e.g. RDMA
regions using filebacked mappings) because then operations like
truncate can be blocked for hours/days/weeks. This situation is
checked via vma_is_fsdax() in mm/gup.c::check_vma_flags()...

> Unfortunately,
> today's application developers barely even know how IO is done, so
> there is little chance that they would understand subtleties like this.

I think that even the experienced developers who know how to do IO
struggle to understand this sort of thing. Most kernel developers
run screaming from GUP before it drives them insane, too. :/

-Dave.
-- 
Dave Chinner
david@...morbit.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ