lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3973ecd7-d99c-6d38-7b53-2f3fca57b48d@google.com>
Date: Mon, 8 Sep 2025 03:27:47 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: David Hildenbrand <david@...hat.com>
cc: Hugh Dickins <hughd@...gle.com>, Matthew Wilcox <willy@...radead.org>, 
    Andrew Morton <akpm@...ux-foundation.org>, Will Deacon <will@...nel.org>, 
    Shivank Garg <shivankg@....com>, Christoph Hellwig <hch@...radead.org>, 
    Keir Fraser <keirf@...gle.com>, Jason Gunthorpe <jgg@...pe.ca>, 
    John Hubbard <jhubbard@...dia.com>, Frederick Mayle <fmayle@...gle.com>, 
    Peter Xu <peterx@...hat.com>, "Aneesh Kumar K.V" <aneesh.kumar@...nel.org>, 
    Johannes Weiner <hannes@...xchg.org>, Vlastimil Babka <vbabka@...e.cz>, 
    Alexander Krabler <Alexander.Krabler@...a.com>, 
    Ge Yang <yangge1116@....com>, Li Zhe <lizhe.67@...edance.com>, 
    Chris Li <chrisl@...nel.org>, Yu Zhao <yuzhao@...gle.com>, 
    Axel Rasmussen <axelrasmussen@...gle.com>, 
    Yuanchu Xie <yuanchu@...gle.com>, Wei Xu <weixugc@...gle.com>, 
    Konstantin Khlebnikov <koct9i@...il.com>, 
    David Howells <dhowells@...hat.com>, ceph-devel@...r.kernel.org, 
    linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when
 PG_private_2

On Mon, 1 Sep 2025, David Hildenbrand wrote:
> On 01.09.25 09:52, David Hildenbrand wrote:
> > On 01.09.25 03:17, Hugh Dickins wrote:
> >> On Mon, 1 Sep 2025, Matthew Wilcox wrote:
> >>> On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
> >>>> 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
> >>>> which (like PG_private, but not in addition to PG_private) counts for
> >>>> 1 more reference: it needs to be using folio_has_private() in place of
> >>>> folio_test_private().
> >>>
> >>> No, it doesn't.  I know it used to, but no filesystem was actually doing
> >>> that.  So I changed mm to match how filesystems actually worked.

I think Matthew may be remembering how he wanted it to behave (? but he
wanted it to go away completely) rather than how it ended up behaving:
we've both found that PG_private_2 always goes with refcount increment.

(Always? Well, until 6.13, btrfs used PG_private_2 without any such
increment: that's gone, so now it's consistently with refcount increment.)

Confusing, given David Howells removed deprecated use of PG_private_2
then later reverted the removal: I've not looked up which releases those
came and went, but reverted in stable trees too, so story all the same;
but maybe some of Matthew's mods interleaved between removal and revert.

> >>> I'm not sure if there's still documentation lying around that gets
> >>> this wrong or if you're remembering how things used to be documented,
> >>> but it's never how any filesystem has ever worked.

Not how btrfs used to work, but it is how ceph and nfs work.

> >>>
> >>> We're achingly close to getting rid of PG_private_2.  I think it's just
> >>> ceph and nfs that still use it.
> >>
> >> I knew you were trying to get rid of it (hurrah! thank you), so when I
> >> tried porting my lru_add_drainage to 6.12 I was careful to check whether
> >> folio_expected_ref_count() would need to add it to the accounting there:
> >> apparently yes; but then I was surprised to find that it's still present
> >> in 6.17-rc, I'd assumed it gone long ago.
> >>
> >> I didn't try to read the filesystems (which could easily have been
> >> inconsistent about it) to understand: what convinced me amidst all
> >> the confusion was this comment and code in mm/filemap.c:
> >>
> >> /**
> >>    * folio_end_private_2 - Clear PG_private_2 and wake any waiters.
> >>    * @folio: The folio.
> >>    *
> >>    * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting
> >>    for
> >>    * it.  The folio reference held for PG_private_2 being set is released.
> >>    *
> >>    * This is, for example, used when a netfs folio is being written to a
> >>    local
> >>    * disk cache, thereby allowing writes to the cache for the same folio to
> >>    be
> >>    * serialised.
> >>    */
> >> void folio_end_private_2(struct folio *folio)
> >> {
> >>  VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
> >>  clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
> >>  folio_wake_bit(folio, PG_private_2);
> >>  folio_put(folio);
> >> }
> >> EXPORT_SYMBOL(folio_end_private_2);
> >>
> >> That seems to be clear that PG_private_2 is matched by a folio reference,
> >> but perhaps you can explain it away - worth changing the comment if so.
> >>
> >> I was also anxious to work out whether PG_private with PG_private_2
> >> would mean +1 or +2: I don't think I found any decisive statement,
> >> but traditional use of page_has_private() implied +1; and I expect
> >> there's no filesystem which actually could have both on the same folio.
> > 
> > I think it's "+1", like we used to have.

I've given up worrying about that.  I'm inclined to think it's +2,
since there's no test_private when incrementing and decrementing
for private_2; but I don't need to care any more.

> > 
> > I was seriously confused when discovering (iow, concerned about false
> > positives):
> > 
> >  PG_fscache = PG_private_2,
> > 
> > But in the end PG_fscache is only used in comments and e.g.,
> > __fscache_clear_page_bits() calls folio_end_private_2(). So both are
> > really just aliases.
> > 
> > [Either PG_fscache should be dropped and referred to as PG_private_2, or
> > PG_private_2 should be dropped and PG_fscache used instead. It's even
> > inconsistently used in that fscache. file.
> > 
> > Or both should be dropped, of course, once we can actually get rid of it
> > ...]
> > 
> > So PG_private_2 should not be used for any other purpose.

Yes, ghastly the hiding of one behind the other; that, and the
PageFlags versus folio_flags, made it all tiresome to track down.

I have considered adding PG_Spanish_Inquisition = PG_private_2
since folio_expect_ref_count() ignoring PG_private_2 implies that
no-one expects the PG_private_2.

> > 
> > folio_start_private_2() / folio_end_private_2() indeed pair the flag
> > with a reference. There are no other callers that would set/clear the
> > flag without involving a reference.
> > 
> > The usage of private_2 is declared deprecated all over the place. So the
> > question is if we really still care.
> > 
> > The ceph usage is guarded by CONFIG_CEPH_FSCACHE, the NFS one by
> > NFS_FSCACHE, nothing really seems to prevent it from getting configured
> > in easily.
> > 
> > Now, one problem would be if migration / splitting / ... code where we
> > use folio_expected_ref_count() cannot deal with that additional
> > reference properly, in which case this patch would indeed cause harm.

Yes, that appears to be why Matthew said NAK and "dangerously wrong".

So far as I could tell, there is no problem with nfs, it has, and has
all along had, the appropriate release_folio and migrate_folio methods.

ceph used to have what's needed, but 6.0's changes from page_has_private()
to folio_test_private() (the change from "has" either bit to "test" just
the one bit really should have been highlighted) broke the migration of
ceph's PG_private_2 folios.

(I think it may have got re-enabled in intervening releases: David
Howells reinstated folio_has_private() inside fallback_migrate_folio()'s
filemap_release_folio(), which may have been enough to get ceph's
PG_private_2s migratable again; but then 6.15's ceph .migrate_folio =
filemap_migrate_folio will have broken it again.)

Folio migration does not and never has copied over PG_private_2 from
src to dst; so my 1/7 patch would have permitted migration of a ceph
PG_private_2 src folio to a dst folio left with refcount 1 more than
it should be (plus whatever the consequences of migrating such a
folio which should have waited for the flag to be cleared first).

Earlier, I did intend to add protection against PG_private_2 into
folio_migrate_mapping() and/or whatever else needs it in mm/migrate.c,
as part of the 1/7 patch; and later submit a ceph patch to give it
back the release_folio wait on PG_private_2 it wants.

But (a) I ran out of steam, and (b) I couldn't test it or advise
ceph folks how to test it, and (c) guessed that Matthew would hate
me populating the codebase with further references to PG_private_2,
and (d) realized that this PG_private_2 thing is a transient
condition (more like writeback than private) which probably nobody
cares too much about (its lack of migration has gone unnoticed).

I'm just going to drop this 1/7, and add a (briefer than this!)
paragraph to 2/7 == 1/6's commit message in v2 later today.

> > 
> > If all folio_expected_ref_count() callers can deal with updating that
> > reference, all good.
> > 
> > nfs_migrate_folio(), for example, has folio_test_private_2() handling in
> > there (just wait until it is gone). ceph handles it during
> > ceph_writepages_start(), but uses ordinary filemap_migrate_folio.
> > 
> > Long story short: this patch is problematic if one
> > folio_expected_ref_count() users is not aware of how to handle that
> > additional reference.
> > 
> 
> Case in point, I just stumbled over
> 
> commit 682a71a1b6b363bff71440f4eca6498f827a839d
> Author: Matthew Wilcox (Oracle) <willy@...radead.org>
> Date:   Fri Sep 2 20:46:46 2022 +0100
> 
>     migrate: convert __unmap_and_move() to use folios
> 
> and
> 
> commit 8faa8ef5dd11abe119ad0c8ccd39f2064ca7ed0e
> Author: Matthew Wilcox (Oracle) <willy@...radead.org>
> Date:   Mon Jun 6 09:34:36 2022 -0400
> 
>     mm/migrate: Convert fallback_migrate_page() to fallback_migrate_folio()
>     
>     Use a folio throughout.  migrate_page() will be converted to
>     migrate_folio() later.
> 
> 
> where we converted from page_has_private() to folio_test_private(). Maybe
> that's all sane, but it raises the question if migration (and maybe splitting)
> as a whole is no incompatible with PG_private_2

The commit I blamed in my notes was 108ca835, I think that's the one
that changes "has" to "test" in the "expected" calculaton; but yes,
8faa8ef5 is significant for skipping the call to folio_release.

Hugh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ