[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260130063403.GB863940@zen.localdomain>
Date: Thu, 29 Jan 2026 22:34:03 -0800
From: Boris Burkov <boris@....io>
To: Qu Wenruo <quwenruo.btrfs@....com>
Cc: JP Kobryn <inwardvessel@...il.com>, clm@...com, dsterba@...e.com,
linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org,
kernel-team@...a.com,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>
Subject: Re: [RFC PATCH] btrfs: defer freeing of subpage private state to
free_folio
On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:
>
>
> 在 2026/1/30 09:38, JP Kobryn 写道:
> [...]
> > The patch also might have the advantage of being easy to backport to the
> > LTS trees. On that note, it's worth mentioning that we encountered a kernel
> > panic as a result of this sequence on a 6.16-based arm64 host (configured
> > with 64k pages so btrfs is in subpage mode). On our 6.16 kernel, the race
> > window is shown below between points A and B:
> >
> > [mm] page cache reclaim path [fs] relocation in subpage mode
> > shrink_folio_list()
> > folio_trylock() /* lock acquired */
> > filemap_release_folio()
> > mapping->a_ops->release_folio()
> > btrfs_release_folio()
> > __btrfs_release_folio()
> > clear_folio_extent_mapped()
> > btrfs_detach_folio_state()
> > bfs = folio_detach_private(folio)
> > btrfs_free_folio_state(folio)
> > kfree(bfs) /* point A */
> >
> > prealloc_file_extent_cluster()
> > filemap_lock_folio()
>
> Mind to explain which function is calling filemap_lock_folio()?
>
> I guess it's filemap_invalidate_inode() -> filemap_fdatawrite_range() ->
> filemap_writeback() -> btrfs_writepages() -> extent_write_cache_pages().
>
I think you may have missed it in the diagram, and some of the function
names may have shifted a bit between kernels, but it is relocation.
On current btrfs/for-next, I think it would be:
relocate_file_extent_cluster()
relocate_one_folio()
filemap_lock_folio()
> > folio_try_get() /* inc refcount */
> > folio_lock() /* wait for lock */
>
>
> Another question here is, since the folio is already released in the mm
> path, the folio should not have dirty flag set.
>
> That means inside extent_write_cache_pages(), the folio_test_dirty() should
> return false, and we should just unlock the folio without touching it
> anymore.
>
> Mind to explain why we still continue the writeback of a non-dirty folio?
>
I think this question is answered by the above as well: we aren't in
writeback, we are in relocation.
Thanks,
Boris
> >
> > __remove_mapping()
> > if (!folio_ref_freeze(folio, refcount)) /* point B */
> > goto cannot_free /* folio remains in cache */
> >
> > folio_unlock(folio) /* lock released */
> >
> > /* lock acquired */
> > btrfs_subpage_clear_updodate()
>
> Mind to provide more context of where the btrfs_subpage_clear_uptodate()
> call is from?
>
> > bfs = folio->priv /* use-after-free */
> >
> > This exact race during relocation should not occur in the latest upstream
> > code, but it's an example of a backport opportunity for this patch.
>
> And mind to explain what is missing in 6.16 kernel that causes the above
> use-after-free?
>
> >
> > Signed-off-by: JP Kobryn <inwardvessel@...il.com>
> > ---
> > fs/btrfs/extent_io.c | 6 ++++--
> > fs/btrfs/inode.c | 18 ++++++++++++++++++
> > 2 files changed, 22 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 3df399dc8856..d83d3f9ae3af 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -928,8 +928,10 @@ void clear_folio_extent_mapped(struct folio *folio)
> > return;
> > fs_info = folio_to_fs_info(folio);
> > - if (btrfs_is_subpage(fs_info, folio))
> > - return btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
> > + if (btrfs_is_subpage(fs_info, folio)) {
> > + /* freeing of private subpage data is deferred to btrfs_free_folio */
> > + return;
> > + }
>
> Another question is, why only two fses (nfs for dir inode, and orangefs) are
> utilizing the free_folio() callback.
>
> Iomap is doing the same as btrfs and only calls ifs_free() in
> release_folio() and invalidate_folio().
>
> Thus it looks like free_folio() callback is not the recommended way to free
> folio->private pointer.
>
> Cc fsdevel list on whether the free_folio() callback should have new
> callers.
>
> > folio_detach_private(folio);
>
> This means for regular folio cases, we still remove the private flag of such
> folio.
>
> It may be fine for most cases as we will not touch folio->private anyway,
> but this still looks like a inconsistent behavior, especially the
> free_folio() callback has handling for both cases.
>
> Thanks,
> Qu
>
> > }
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index b8abfe7439a3..7a832ee3b591 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -7565,6 +7565,23 @@ static bool btrfs_release_folio(struct folio *folio, gfp_t gfp_flags)
> > return __btrfs_release_folio(folio, gfp_flags);
> > }
> > +/* frees subpage private data if present */
> > +static void btrfs_free_folio(struct folio *folio)
> > +{
> > + struct btrfs_folio_state *bfs;
> > +
> > + if (!folio_test_private(folio))
> > + return;
> > +
> > + bfs = folio_detach_private(folio);
> > + if (bfs == (void *)EXTENT_FOLIO_PRIVATE) {
> > + /* extent map flag is detached in btrfs_folio_release */
> > + return;
> > + }
> > +
> > + btrfs_free_folio_state(bfs);
> > +}
> > +
> > #ifdef CONFIG_MIGRATION
> > static int btrfs_migrate_folio(struct address_space *mapping,
> > struct folio *dst, struct folio *src,
> > @@ -10651,6 +10668,7 @@ static const struct address_space_operations btrfs_aops = {
> > .invalidate_folio = btrfs_invalidate_folio,
> > .launder_folio = btrfs_launder_folio,
> > .release_folio = btrfs_release_folio,
> > + .free_folio = btrfs_free_folio,
> > .migrate_folio = btrfs_migrate_folio,
> > .dirty_folio = filemap_dirty_folio,
> > .error_remove_folio = generic_error_remove_folio,
>
Powered by blists - more mailing lists