linux-kernel - Re: [RFC v3 1/2] ext4: fast_commit: assert i_data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e947ad33-afdb-40e6-9f8f-46cc5ea951ee@huaweicloud.com>
Date: Thu, 8 Jan 2026 11:00:38 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Li Chen <me@...ux.beauty>
Cc: linux-ext4 <linux-ext4@...r.kernel.org>, Theodore Ts'o <tytso@....edu>,
 Andreas Dilger <adilger.kernel@...ger.ca>,
 linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [RFC v3 1/2] ext4: fast_commit: assert i_data_sem only before
 sleep

On 1/7/2026 10:30 PM, Li Chen wrote:
> Hi Zhang,
> 
> Thanks a lot for your detailed review!
> 
>  ---- On Wed, 07 Jan 2026 10:00:23 +0800  Zhang Yi <yi.zhang@...weicloud.com> wrote --- 
>  > On 1/6/2026 8:18 PM, Li Chen wrote:
>  > > Hi Zhang Yi,
>  > > 
>  > >  ---- On Mon, 05 Jan 2026 20:18:42 +0800  Zhang Yi <yi.zhang@...weicloud.com> wrote --- 
>  > >  > Hi Li,
>  > >  > 
>  > >  > On 12/24/2025 11:29 AM, Li Chen wrote:
>  > >  > > ext4_fc_track_inode() can return without sleeping when
>  > >  > > EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for
>  > >  > > ei->i_data_sem was done unconditionally before the wait loop, which can
>  > >  > > WARN in call paths that hold i_data_sem even though we never block. Move
>  > >  > > lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path,
>  > >  > > right before schedule().
>  > >  > > 
>  > >  > > Signed-off-by: Li Chen <me@...ux.beauty>
>  > >  > 
>  > >  > Thank you for the fix patch! However, the solution does not seem to fix
>  > >  > the issue. IIUC, the root cause of this issue is the following race
>  > >  > condition (show only one case), and it may cause a real ABBA dead lock
>  > >  > issue.
>  > >  > 
>  > >  > ext4_map_blocks()
>  > >  >  hold i_data_sem // <- A
>  > >  >  ext4_mb_new_blocks()
>  > >  >   ext4_dirty_inode()
>  > >  >                                  ext4_fc_commit()
>  > >  >                                   ext4_fc_perform_commit()
>  > >  >                                    set EXT4_STATE_FC_COMMITTING  <-B
>  > >  >                                    ext4_fc_write_inode_data()
>  > >  >                                    ext4_map_blocks()
>  > >  >                                     hold i_data_sem  // <- A
>  > >  >    ext4_fc_track_inode()
>  > >  >     wait EXT4_STATE_FC_COMMITTING  <- B
>  > >  >                                   jbd2_fc_end_commit()
>  > >  >                                    ext4_fc_cleanup()
>  > >  >                                     clear EXT4_STATE_FC_COMMITTING()
>  > >  > 
>  
> I think the ABBA reasoning is plausible: if a caller violates the ordering
> contract and enters ext4_fc_track_inode() while holding i_data_sem, and the
> recheck still finds EXT4_STATE_FC_COMMITTING set (so we actually schedule()),
> then we can get A -> wait(B). If the commit task, while holding the inode
> in COMMITTING, still needs i_data_sem (e.g. via mapping/log writing), that
> gives B -> wait(A), forming a cycle.
> 
>  > >  > Postponing the lockdep assertion to the point where sleeping is actually
>  > >  > necessary does not resolve this deadlock issue, it merely masks the
>  > >  > problem, right?
>  > >  > 
>  > >  > I currently don't quite understand why only ext4_fc_track_inode() needs
>  > >  > to wait for the inode being fast committed to be completed, instead of
>  > >  > adding it to the FC_Q_STAGING list like other tracking operations.
>  > 
>  > It seems that the inode metadata of the tracked inode was not recorded
>  > during the __track_inode(), so the inode metadata committed at commit
>  > time reflects real-time data. However, the current
>  > ext4_fc_perform_commit() lacks concurrency control, allowing other
>  > processes to simultaneously initiate new handles that modify the inode
>  > metadata while the previous metadata is being fast committed. Therefore,
>  > to prevent recording newly changed inode metadata during the old commit
>  > phase, the ext4_fc_track_inode() function must wait for the ongoing
>  > commit process to complete before modifying.
>  > 
>  > >  > So
>  > >  > now I don't have a good idea to fix this problem either.  Perhaps we
>  > >  > need to rethink the necessity of this waiting, or find a way to avoid
>  > >  > acquiring i_data_sem during fast commit.
>  > 
>  > Ha, the solution seems to have already been listed in the TODOs in
>  > fast_commit.c.
>  > 
>  >   Change ext4_fc_commit() to lookup logical to physical mapping using extent
>  >   status tree. This would get rid of the need to call ext4_fc_track_inode()
>  >   before acquiring i_data_sem. To do that we would need to ensure that
>  >   modified extents from the extent status tree are not evicted from memory.
>  > 
>  > Alternatively, recording the mapped range of tracking might also be
>  > feasible.
> 
> Thanks a lot for your insights!
> 
> For the next revesion, I plan to follow the "Alternatively" way firstly:
> record the mapped ranges (and relvant inode metadata) at commit time in a
> snapshot, when journal updates are locked/handles are drained, and then
> consume only the snapshot during log writing. This avoids doing
> logical-to-physical mapping (and thus avoids taking i_data_sem) in the log
> writing phase, and removes the need for ext4_fc_track_inode() to wait for
> EXT4_STATE_FC_COMMITTING.
> 
> I did not pick the extent status tree approach because it would require
> additional work to guarantee the needed mappings are resident and not
> evicted under memory pressure, which seems like a larger correctness
> surface(Please correct me if I'm wrong). If you believe the extent stats tree
> approach is better, please let me know, and I will do my best to implement it.
> 
> Thanks again for the guidance. I'll post an RFC v3 later.

Yes, in my opinion, I also prefer the solution of recording the mapped ranges,
as it should not incur too much overhead. Please give it a try.

Cheers,
Yi.

> 
> Regards,
> Li
>