linux-kernel - Re: [RFC] DEPT report on around btrfs, unlink, and truncate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a93738a1-57af-4eef-9a32-edfc60c7e7b4@gmx.com>
Date: Mon, 23 Jun 2025 19:28:38 +0930
From: Qu Wenruo <quwenruo.btrfs@....com>
To: Byungchul Park <byungchul@...com>
Cc: linux-kernel@...r.kernel.org, clm@...com, josef@...icpanda.com,
 dsterba@...e.com, linux-btrfs@...r.kernel.org, kernel_team@...ynix.com,
 torvalds@...ux-foundation.org, akpm@...ux-foundation.org,
 yeoreum.yun@....com, yunseong.kim@...csson.com, gwan-gyeong.mun@...el.com,
 harry.yoo@...cle.com, ysk@...lloc.com
Subject: Re: [RFC] DEPT report on around btrfs, unlink, and truncate



在 2025/6/23 19:22, Byungchul Park 写道:
> On Mon, Jun 23, 2025 at 06:22:44PM +0930, Qu Wenruo wrote:
>> 在 2025/6/23 17:49, Byungchul Park 写道:
>>> On Mon, Jun 23, 2025 at 03:20:43PM +0930, Qu Wenruo wrote:
>>>> 在 2025/6/23 12:51, Byungchul Park 写道:
>>>>> Hi folks,
>>>>>
>>>>> Thanks to Yunseong, we got two DEPT reports in btrfs.  It doesn't mean
>>>>> it's obvious deadlocks, but after digging into the reports, I'm
>>>>> wondering if it could happen by any chance.
>>>>>
>>>>> 1) The first scenario that I'm concerning is:
>>>>>
>>>>>      context A            context B
>>>>>
>>>>>                           do_truncate()
>>>>>                             ...
>>>>>                               btrfs_do_readpage() // with folio lock held
>>>>
>>>>                                  This one is for data.
>>>
>>> Do you mean this folio is for data?  Thanks for the confirmation.
>>
>> Yes, only data folios will go through btrfs_do_readpage().
>>
>> For metadata, we never go through btrfs_do_readpage(), but
>> read_extent_buffer_pages_nowait().
>>
>>
>>>
>>>>>      do_unlinkat()
>>>>>        ...
>>>>>          push_leaf_right()
>>>>>         btrfs_tree_lock_nested()
>>>>>           down_write_nested(&eb->lock) // hold
>>>
>>> This is struct extent_buffer's rw_sem.  Right?
>>>
>>>>>                                 btrfs_get_extent()
>>>>>                                   btrfs_lookup_file_extent()
>>>>>                                     btrfs_search_slot()
>>>>>                                       down_read_nested(&eb->lock) // stuck
>>>>
>>>>                                          This one is for metadata.
>>>                                                ^
>>>                                        I don't get this actually.
>>>
>>> This is struct extent_buffer's rw_sem, too.  Cannot this rw_sem be the
>>> same as the rw_sem above in context A?
>>
>> My bad, I thought you're talking about that down_read_nested()
>> conflicting with folio lock.
>>
>> But if you're talking about extent_buffer::lock, then the one in context
>> B will wait for the one in context A, and that's expected.
> 
> Sounds good.
> 
>>>> Data and metadata page cache will never cross into each other.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>           __push_leaf_right()
>>>>>             ...
>>>>>               folio_lock() // stuck
>>>
>>> Did you mean this folio is always for metadata?
>>
>> Can you explain more on where this folio_lock() comes from?
> 
> I should also rely on the following stacktrace in the dept report.  I
> asked Yunseong who reported this issue, for the decoded stacktrace, so
> that I can interpret that better.  I will get back once I figure out
> where the wait on PG_locked comes from.
> 
>     [  304.344198][ T7488] [W] dept_page_wait_on_bit(pg_locked_map:0):
>     [  304.344211][ T7488] [<ffff8000823b1d20>] __push_leaf_right+0x8f0/0xc70

I believe it's from btrfs_clear_buffer_dirty():

As we have a for() loop iterating all the folios of a an extent buffer 
(aka, metadata structure), then clear the dirty flags.

The same applies to btrfs_mark_buffer_dirty() -> set_extent_buffer_dirty().

In that case, the folio is 100% belonging to btree inode thus metadata.

Thus the folio lock can not conflict with a data folio, thus there 
should be no deadlock.

Thanks,
Qu


>     [  304.344232][ T7488] stacktrace:
>     [  304.344241][ T7488]       __push_leaf_right+0x8f0/0xc70
>     [  304.344260][ T7488]       push_leaf_right+0x408/0x628
>     [  304.344278][ T7488]       btrfs_del_items+0x974/0xaec
>     [  304.344297][ T7488]       btrfs_truncate_inode_items+0x1c5c/0x2b00
>     [  304.344314][ T7488]       btrfs_evict_inode+0xa4c/0xd38
>     [  304.344335][ T7488]       evict+0x340/0x7b0
>     [  304.344352][ T7488]       iput+0x4ec/0x840
>     [  304.344369][ T7488]       do_unlinkat+0x444/0x59c
>     [  304.344388][ T7488]       __arm64_sys_unlinkat+0x11c/0x260
>     [  304.344407][ T7488]       invoke_syscall+0x88/0x2e0
>     [  304.344425][ T7488]       el0_svc_common.constprop.0+0xe8/0x2e0
>     [  304.344445][ T7488]       do_el0_svc+0x44/0x60
>     [  304.344463][ T7488]       el0_svc+0x50/0x188
>     [  304.344482][ T7488]       el0t_64_sync_handler+0x10c/0x140
>     [  304.344503][ T7488]       el0t_64_sync+0x198/0x19c
> 
> Thanks.
> 
> 	Byungchul
> 
>> I didn't see any location where __push_leaf_right() is locking a folio
>> nor the original do_unlinkat().
>>
>> So here I can only guess the folio is from __push_leaf_right() context,
>> that means it can only be a metadata folio.
>>
>>>
>>> If no, it could lead a deadlock in my opinion.  If yes, dept should
>>> assign different classes to folios between data data and metadata.
>>
>> So far I believe the folio belongs to metadata.
>>
>> And since btrfs has very different handling of metadata folios, and it's
>> a little confusing that, we also have a btree_inode to handle the
>> metadata page cache, but do not have read_folio() callbacks, it can be a
>> little confusing to some automatic tools.
>>
>> Thanks,
>> Qu