linux-kernel - Re: [RFC] DEPT report on around btrfs, unlink, and truncate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250624014426.GC5820@system.software.com>
Date: Tue, 24 Jun 2025 10:44:26 +0900
From: Byungchul Park <byungchul@...com>
To: Qu Wenruo <quwenruo.btrfs@....com>
Cc: linux-kernel@...r.kernel.org, clm@...com, josef@...icpanda.com,
	dsterba@...e.com, linux-btrfs@...r.kernel.org,
	kernel_team@...ynix.com, torvalds@...ux-foundation.org,
	akpm@...ux-foundation.org, yeoreum.yun@....com,
	yunseong.kim@...csson.com, gwan-gyeong.mun@...el.com,
	harry.yoo@...cle.com, ysk@...lloc.com
Subject: Re: [RFC] DEPT report on around btrfs, unlink, and truncate

On Mon, Jun 23, 2025 at 07:28:38PM +0930, Qu Wenruo wrote:
> 在 2025/6/23 19:22, Byungchul Park 写道:
> > On Mon, Jun 23, 2025 at 06:22:44PM +0930, Qu Wenruo wrote:
> > > 在 2025/6/23 17:49, Byungchul Park 写道:
> > > > On Mon, Jun 23, 2025 at 03:20:43PM +0930, Qu Wenruo wrote:
> > > > > 在 2025/6/23 12:51, Byungchul Park 写道:
> > > > > > Hi folks,
> > > > > > 
> > > > > > Thanks to Yunseong, we got two DEPT reports in btrfs.  It doesn't mean
> > > > > > it's obvious deadlocks, but after digging into the reports, I'm
> > > > > > wondering if it could happen by any chance.
> > > > > > 
> > > > > > 1) The first scenario that I'm concerning is:
> > > > > > 
> > > > > >      context A            context B
> > > > > > 
> > > > > >                           do_truncate()
> > > > > >                             ...
> > > > > >                               btrfs_do_readpage() // with folio lock held
> > > > > 
> > > > >                                  This one is for data.
> > > > 
> > > > Do you mean this folio is for data?  Thanks for the confirmation.
> > > 
> > > Yes, only data folios will go through btrfs_do_readpage().
> > > 
> > > For metadata, we never go through btrfs_do_readpage(), but
> > > read_extent_buffer_pages_nowait().
> > > 
> > > 
> > > > 
> > > > > >      do_unlinkat()
> > > > > >        ...
> > > > > >          push_leaf_right()
> > > > > >         btrfs_tree_lock_nested()
> > > > > >           down_write_nested(&eb->lock) // hold
> > > > 
> > > > This is struct extent_buffer's rw_sem.  Right?
> > > > 
> > > > > >                                 btrfs_get_extent()
> > > > > >                                   btrfs_lookup_file_extent()
> > > > > >                                     btrfs_search_slot()
> > > > > >                                       down_read_nested(&eb->lock) // stuck
> > > > > 
> > > > >                                          This one is for metadata.
> > > >                                                ^
> > > >                                        I don't get this actually.
> > > > 
> > > > This is struct extent_buffer's rw_sem, too.  Cannot this rw_sem be the
> > > > same as the rw_sem above in context A?
> > > 
> > > My bad, I thought you're talking about that down_read_nested()
> > > conflicting with folio lock.
> > > 
> > > But if you're talking about extent_buffer::lock, then the one in context
> > > B will wait for the one in context A, and that's expected.
> > 
> > Sounds good.
> > 
> > > > > Data and metadata page cache will never cross into each other.
> > > > > 
> > > > > Thanks,
> > > > > Qu
> > > > > 
> > > > > >           __push_leaf_right()
> > > > > >             ...
> > > > > >               folio_lock() // stuck
> > > > 
> > > > Did you mean this folio is always for metadata?
> > > 
> > > Can you explain more on where this folio_lock() comes from?
> > 
> > I should also rely on the following stacktrace in the dept report.  I
> > asked Yunseong who reported this issue, for the decoded stacktrace, so
> > that I can interpret that better.  I will get back once I figure out
> > where the wait on PG_locked comes from.
> > 
> >     [  304.344198][ T7488] [W] dept_page_wait_on_bit(pg_locked_map:0):
> >     [  304.344211][ T7488] [<ffff8000823b1d20>] __push_leaf_right+0x8f0/0xc70
> 
> I believe it's from btrfs_clear_buffer_dirty():
> 
> As we have a for() loop iterating all the folios of a an extent buffer
> (aka, metadata structure), then clear the dirty flags.
> 
> The same applies to btrfs_mark_buffer_dirty() -> set_extent_buffer_dirty().

Thanks to Yunseong, I figured out this is the case.

> In that case, the folio is 100% belonging to btree inode thus metadata.

Good to know.

Lastly, is it still good with directly manipulating block devs or
stacked file system using loopback devices, from the confliction of
folios and extent_buffers?

If you confirm it, this issue can be closed :-) Thanks in advance.

	Byungchul

> Thus the folio lock can not conflict with a data folio, thus there
> should be no deadlock.
> 
> Thanks,
> Qu
> 
> 
> >     [  304.344232][ T7488] stacktrace:
> >     [  304.344241][ T7488]       __push_leaf_right+0x8f0/0xc70
> >     [  304.344260][ T7488]       push_leaf_right+0x408/0x628
> >     [  304.344278][ T7488]       btrfs_del_items+0x974/0xaec
> >     [  304.344297][ T7488]       btrfs_truncate_inode_items+0x1c5c/0x2b00
> >     [  304.344314][ T7488]       btrfs_evict_inode+0xa4c/0xd38
> >     [  304.344335][ T7488]       evict+0x340/0x7b0
> >     [  304.344352][ T7488]       iput+0x4ec/0x840
> >     [  304.344369][ T7488]       do_unlinkat+0x444/0x59c
> >     [  304.344388][ T7488]       __arm64_sys_unlinkat+0x11c/0x260
> >     [  304.344407][ T7488]       invoke_syscall+0x88/0x2e0
> >     [  304.344425][ T7488]       el0_svc_common.constprop.0+0xe8/0x2e0
> >     [  304.344445][ T7488]       do_el0_svc+0x44/0x60
> >     [  304.344463][ T7488]       el0_svc+0x50/0x188
> >     [  304.344482][ T7488]       el0t_64_sync_handler+0x10c/0x140
> >     [  304.344503][ T7488]       el0t_64_sync+0x198/0x19c
> > 
> > Thanks.
> > 
> >       Byungchul
> > 
> > > I didn't see any location where __push_leaf_right() is locking a folio
> > > nor the original do_unlinkat().
> > > 
> > > So here I can only guess the folio is from __push_leaf_right() context,
> > > that means it can only be a metadata folio.
> > > 
> > > > 
> > > > If no, it could lead a deadlock in my opinion.  If yes, dept should
> > > > assign different classes to folios between data data and metadata.
> > > 
> > > So far I believe the folio belongs to metadata.
> > > 
> > > And since btrfs has very different handling of metadata folios, and it's
> > > a little confusing that, we also have a btree_inode to handle the
> > > metadata page cache, but do not have read_folio() callbacks, it can be a
> > > little confusing to some automatic tools.
> > > 
> > > Thanks,
> > > Qu
> 
>