linux-kernel - Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aL7w4qrJtvKE1cu5@casper.infradead.org>
Date: Mon, 8 Sep 2025 16:06:10 +0100
From: Matthew Wilcox <willy@...radead.org>
To: Hugh Dickins <hughd@...gle.com>
Cc: David Hildenbrand <david@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Will Deacon <will@...nel.org>, Shivank Garg <shivankg@....com>,
	Christoph Hellwig <hch@...radead.org>,
	Keir Fraser <keirf@...gle.com>, Jason Gunthorpe <jgg@...pe.ca>,
	John Hubbard <jhubbard@...dia.com>,
	Frederick Mayle <fmayle@...gle.com>, Peter Xu <peterx@...hat.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@...nel.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Vlastimil Babka <vbabka@...e.cz>,
	Alexander Krabler <Alexander.Krabler@...a.com>,
	Ge Yang <yangge1116@....com>, Li Zhe <lizhe.67@...edance.com>,
	Chris Li <chrisl@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
	Axel Rasmussen <axelrasmussen@...gle.com>,
	Yuanchu Xie <yuanchu@...gle.com>, Wei Xu <weixugc@...gle.com>,
	Konstantin Khlebnikov <koct9i@...il.com>,
	David Howells <dhowells@...hat.com>, ceph-devel@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2

On Mon, Sep 08, 2025 at 03:27:47AM -0700, Hugh Dickins wrote:
> On Mon, 1 Sep 2025, David Hildenbrand wrote:
> > On 01.09.25 09:52, David Hildenbrand wrote:
> > > On 01.09.25 03:17, Hugh Dickins wrote:
> > >> On Mon, 1 Sep 2025, Matthew Wilcox wrote:
> > >>> On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
> > >>>> 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
> > >>>> which (like PG_private, but not in addition to PG_private) counts for
> > >>>> 1 more reference: it needs to be using folio_has_private() in place of
> > >>>> folio_test_private().
> > >>>
> > >>> No, it doesn't.  I know it used to, but no filesystem was actually doing
> > >>> that.  So I changed mm to match how filesystems actually worked.
> 
> I think Matthew may be remembering how he wanted it to behave (? but he
> wanted it to go away completely) rather than how it ended up behaving:
> we've both found that PG_private_2 always goes with refcount increment.

Let me explain that better.  No filesystem followed the documented rule
that the refcount must be incremented by one if either PG_private or
PG_private_2 was set.  And no surprise; that's a very complicated rule
for filesystems to follow.  Many of them weren't even following the rule
to increment the refcount by one when PG_private was set.

So some were incrementing the refcount by one if PG_private were set, but
not bumping the refcount by one if PG_private_2 were set (I think this is
how btrfs worked, and you seem to believe the same thing).  Others were
bumping the refcount by two if both PG_private and PG_private_2 were set
(I think this is how netfs works today).

> > > Now, one problem would be if migration / splitting / ... code where we
> > > use folio_expected_ref_count() cannot deal with that additional
> > > reference properly, in which case this patch would indeed cause harm.
> 
> Yes, that appears to be why Matthew said NAK and "dangerously wrong".
> 
> So far as I could tell, there is no problem with nfs, it has, and has
> all along had, the appropriate release_folio and migrate_folio methods.
> 
> ceph used to have what's needed, but 6.0's changes from page_has_private()
> to folio_test_private() (the change from "has" either bit to "test" just
> the one bit really should have been highlighted) broke the migration of
> ceph's PG_private_2 folios.
> 
> (I think it may have got re-enabled in intervening releases: David
> Howells reinstated folio_has_private() inside fallback_migrate_folio()'s
> filemap_release_folio(), which may have been enough to get ceph's
> PG_private_2s migratable again; but then 6.15's ceph .migrate_folio =
> filemap_migrate_folio will have broken it again.)
> 
> Folio migration does not and never has copied over PG_private_2 from
> src to dst; so my 1/7 patch would have permitted migration of a ceph
> PG_private_2 src folio to a dst folio left with refcount 1 more than
> it should be (plus whatever the consequences of migrating such a
> folio which should have waited for the flag to be cleared first).

But that's another problem.  The current meaning of PG_fscache (and also
that has changed over the years!) is that the data in the folio is being
written to the fscache.  So we _shouldn't_ migrate the folio as some
piece of storage hardware is busy reading from the old folio.  And if
somebody else starts writing to the old folio, we'll have a corrupted
fscache.

So the current behaviour where we set private_2 and bump the refcount,
but don't take the private_2 status into account is the safe one,
because the elevated refcount means we'll skip the PG_fscache folio.
Maybe it'd be better to wait for it to clear.  But since Dave Howells
is busy killing it off, I'm just inclined to wait for that to happen.

> I'm just going to drop this 1/7, and add a (briefer than this!)
> paragraph to 2/7 == 1/6's commit message in v2 later today.

Thank you!