linux-kernel - Re: [RFC PATCH] btrfs: defer freeing of subpage private state to free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <84623da6-3248-437d-9f01-e3fe57e282db@gmx.com>
Date: Fri, 30 Jan 2026 17:59:01 +1030
From: Qu Wenruo <quwenruo.btrfs@....com>
To: Boris Burkov <boris@....io>
Cc: JP Kobryn <inwardvessel@...il.com>, clm@...com, dsterba@...e.com,
 linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org,
 kernel-team@...a.com,
 "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>
Subject: Re: [RFC PATCH] btrfs: defer freeing of subpage private state to
 free_folio



在 2026/1/30 17:04, Boris Burkov 写道:
> On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:
>>
>>
>> 在 2026/1/30 09:38, JP Kobryn 写道:
>> [...]
>>> The patch also might have the advantage of being easy to backport to the
>>> LTS trees. On that note, it's worth mentioning that we encountered a kernel
>>> panic as a result of this sequence on a 6.16-based arm64 host (configured
>>> with 64k pages so btrfs is in subpage mode). On our 6.16 kernel, the race
>>> window is shown below between points A and B:
>>>
>>> [mm] page cache reclaim path        [fs] relocation in subpage mode
>>> shrink_folio_list()
>>>     folio_trylock() /* lock acquired */
>>>     filemap_release_folio()
>>>       mapping->a_ops->release_folio()
>>>         btrfs_release_folio()
>>>           __btrfs_release_folio()
>>>             clear_folio_extent_mapped()
>>>               btrfs_detach_folio_state()
>>>                 bfs = folio_detach_private(folio)
>>>                 btrfs_free_folio_state(folio)
>>>                   kfree(bfs) /* point A */
>>>
>>>                                      prealloc_file_extent_cluster()
>>>                                        filemap_lock_folio()
>>
>> Mind to explain which function is calling filemap_lock_folio()?
>>
>> I guess it's filemap_invalidate_inode() -> filemap_fdatawrite_range() ->
>> filemap_writeback() -> btrfs_writepages() -> extent_write_cache_pages().
>>
> 
> I think you may have missed it in the diagram, and some of the function
> names may have shifted a bit between kernels, but it is relocation.
> 
> On current btrfs/for-next, I think it would be:
> 
> relocate_file_extent_cluster()
>    relocate_one_folio()
>      filemap_lock_folio()

Thanks, indeed the filemap_lock_folio() inside 
prealloc_file_extent_cluster() only exists in v6.16 code base, which 
does partial folio invalidating manually.

That code is no longer there, and gets replaced with a much healthier 
solution.

> 
>>>                                          folio_try_get() /* inc refcount */
>>>                                          folio_lock() /* wait for lock */
>>
>>
>> Another question here is, since the folio is already released in the mm
>> path, the folio should not have dirty flag set.
>>
>> That means inside extent_write_cache_pages(), the folio_test_dirty() should
>> return false, and we should just unlock the folio without touching it
>> anymore.
>>
>> Mind to explain why we still continue the writeback of a non-dirty folio?
>>
> 
> I think this question is answered by the above as well: we aren't in
> writeback, we are in relocation.

I see the problem now. And thankfully it's commit 4e346baee95f ("btrfs: 
reloc: unconditionally invalidate the page cache for each cluster") 
fixing the behavior.

And yes, the old code can indeed hit the problem.

But still, the commit 4e346baee95f ("btrfs: reloc: unconditionally 
invalidate the page cache for each cluster") itself shouldn't be that 
hard to backport.

Thanks,
Qu

> 
> Thanks,
> Boris
> 
>>>
>>>     __remove_mapping()
>>>       if (!folio_ref_freeze(folio, refcount)) /* point B */
>>>         goto cannot_free /* folio remains in cache */
>>>
>>>     folio_unlock(folio) /* lock released */
>>>
>>>                                      /* lock acquired */
>>>                                      btrfs_subpage_clear_updodate()
>>
>> Mind to provide more context of where the btrfs_subpage_clear_uptodate()
>> call is from?
>>
>>>                                        bfs = folio->priv /* use-after-free */
>>>
>>> This exact race during relocation should not occur in the latest upstream
>>> code, but it's an example of a backport opportunity for this patch.
>>
>> And mind to explain what is missing in 6.16 kernel that causes the above
>> use-after-free?
>>
>>>
>>> Signed-off-by: JP Kobryn <inwardvessel@...il.com>
>>> ---
>>>    fs/btrfs/extent_io.c |  6 ++++--
>>>    fs/btrfs/inode.c     | 18 ++++++++++++++++++
>>>    2 files changed, 22 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 3df399dc8856..d83d3f9ae3af 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -928,8 +928,10 @@ void clear_folio_extent_mapped(struct folio *folio)
>>>    		return;
>>>    	fs_info = folio_to_fs_info(folio);
>>> -	if (btrfs_is_subpage(fs_info, folio))
>>> -		return btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
>>> +	if (btrfs_is_subpage(fs_info, folio)) {
>>> +		/* freeing of private subpage data is deferred to btrfs_free_folio */
>>> +		return;
>>> +	}
>>
>> Another question is, why only two fses (nfs for dir inode, and orangefs) are
>> utilizing the free_folio() callback.
>>
>> Iomap is doing the same as btrfs and only calls ifs_free() in
>> release_folio() and invalidate_folio().
>>
>> Thus it looks like free_folio() callback is not the recommended way to free
>> folio->private pointer.
>>
>> Cc fsdevel list on whether the free_folio() callback should have new
>> callers.
>>
>>>    	folio_detach_private(folio);
>>
>> This means for regular folio cases, we still remove the private flag of such
>> folio.
>>
>> It may be fine for most cases as we will not touch folio->private anyway,
>> but this still looks like a inconsistent behavior, especially the
>> free_folio() callback has handling for both cases.
>>
>> Thanks,
>> Qu
>>
>>>    }
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index b8abfe7439a3..7a832ee3b591 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -7565,6 +7565,23 @@ static bool btrfs_release_folio(struct folio *folio, gfp_t gfp_flags)
>>>    	return __btrfs_release_folio(folio, gfp_flags);
>>>    }
>>> +/* frees subpage private data if present */
>>> +static void btrfs_free_folio(struct folio *folio)
>>> +{
>>> +	struct btrfs_folio_state *bfs;
>>> +
>>> +	if (!folio_test_private(folio))
>>> +		return;
>>> +
>>> +	bfs = folio_detach_private(folio);
>>> +	if (bfs == (void *)EXTENT_FOLIO_PRIVATE) {
>>> +		/* extent map flag is detached in btrfs_folio_release */
>>> +		return;
>>> +	}
>>> +
>>> +	btrfs_free_folio_state(bfs);
>>> +}
>>> +
>>>    #ifdef CONFIG_MIGRATION
>>>    static int btrfs_migrate_folio(struct address_space *mapping,
>>>    			     struct folio *dst, struct folio *src,
>>> @@ -10651,6 +10668,7 @@ static const struct address_space_operations btrfs_aops = {
>>>    	.invalidate_folio = btrfs_invalidate_folio,
>>>    	.launder_folio	= btrfs_launder_folio,
>>>    	.release_folio	= btrfs_release_folio,
>>> +	.free_folio = btrfs_free_folio,
>>>    	.migrate_folio	= btrfs_migrate_folio,
>>>    	.dirty_folio	= filemap_dirty_folio,
>>>    	.error_remove_folio = generic_error_remove_folio,
>>