[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <19b98ddd118.2300666e5265102.1629777029508214951@linux.beauty>
Date: Wed, 07 Jan 2026 22:30:32 +0800
From: Li Chen <me@...ux.beauty>
To: "Zhang Yi" <yi.zhang@...weicloud.com>
Cc: "linux-ext4" <linux-ext4@...r.kernel.org>,
"Theodore Ts'o" <tytso@....edu>,
"Andreas Dilger" <adilger.kernel@...ger.ca>,
"linux-kernel" <linux-kernel@...r.kernel.org>
Subject: Re: [RFC v3 1/2] ext4: fast_commit: assert i_data_sem only before
sleep
Hi Zhang,
Thanks a lot for your detailed review!
---- On Wed, 07 Jan 2026 10:00:23 +0800 Zhang Yi <yi.zhang@...weicloud.com> wrote ---
> On 1/6/2026 8:18 PM, Li Chen wrote:
> > Hi Zhang Yi,
> >
> > ---- On Mon, 05 Jan 2026 20:18:42 +0800 Zhang Yi <yi.zhang@...weicloud.com> wrote ---
> > > Hi Li,
> > >
> > > On 12/24/2025 11:29 AM, Li Chen wrote:
> > > > ext4_fc_track_inode() can return without sleeping when
> > > > EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for
> > > > ei->i_data_sem was done unconditionally before the wait loop, which can
> > > > WARN in call paths that hold i_data_sem even though we never block. Move
> > > > lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path,
> > > > right before schedule().
> > > >
> > > > Signed-off-by: Li Chen <me@...ux.beauty>
> > >
> > > Thank you for the fix patch! However, the solution does not seem to fix
> > > the issue. IIUC, the root cause of this issue is the following race
> > > condition (show only one case), and it may cause a real ABBA dead lock
> > > issue.
> > >
> > > ext4_map_blocks()
> > > hold i_data_sem // <- A
> > > ext4_mb_new_blocks()
> > > ext4_dirty_inode()
> > > ext4_fc_commit()
> > > ext4_fc_perform_commit()
> > > set EXT4_STATE_FC_COMMITTING <-B
> > > ext4_fc_write_inode_data()
> > > ext4_map_blocks()
> > > hold i_data_sem // <- A
> > > ext4_fc_track_inode()
> > > wait EXT4_STATE_FC_COMMITTING <- B
> > > jbd2_fc_end_commit()
> > > ext4_fc_cleanup()
> > > clear EXT4_STATE_FC_COMMITTING()
> > >
I think the ABBA reasoning is plausible: if a caller violates the ordering
contract and enters ext4_fc_track_inode() while holding i_data_sem, and the
recheck still finds EXT4_STATE_FC_COMMITTING set (so we actually schedule()),
then we can get A -> wait(B). If the commit task, while holding the inode
in COMMITTING, still needs i_data_sem (e.g. via mapping/log writing), that
gives B -> wait(A), forming a cycle.
> > > Postponing the lockdep assertion to the point where sleeping is actually
> > > necessary does not resolve this deadlock issue, it merely masks the
> > > problem, right?
> > >
> > > I currently don't quite understand why only ext4_fc_track_inode() needs
> > > to wait for the inode being fast committed to be completed, instead of
> > > adding it to the FC_Q_STAGING list like other tracking operations.
>
> It seems that the inode metadata of the tracked inode was not recorded
> during the __track_inode(), so the inode metadata committed at commit
> time reflects real-time data. However, the current
> ext4_fc_perform_commit() lacks concurrency control, allowing other
> processes to simultaneously initiate new handles that modify the inode
> metadata while the previous metadata is being fast committed. Therefore,
> to prevent recording newly changed inode metadata during the old commit
> phase, the ext4_fc_track_inode() function must wait for the ongoing
> commit process to complete before modifying.
>
> > > So
> > > now I don't have a good idea to fix this problem either. Perhaps we
> > > need to rethink the necessity of this waiting, or find a way to avoid
> > > acquiring i_data_sem during fast commit.
>
> Ha, the solution seems to have already been listed in the TODOs in
> fast_commit.c.
>
> Change ext4_fc_commit() to lookup logical to physical mapping using extent
> status tree. This would get rid of the need to call ext4_fc_track_inode()
> before acquiring i_data_sem. To do that we would need to ensure that
> modified extents from the extent status tree are not evicted from memory.
>
> Alternatively, recording the mapped range of tracking might also be
> feasible.
Thanks a lot for your insights!
For the next revesion, I plan to follow the "Alternatively" way firstly:
record the mapped ranges (and relvant inode metadata) at commit time in a
snapshot, when journal updates are locked/handles are drained, and then
consume only the snapshot during log writing. This avoids doing
logical-to-physical mapping (and thus avoids taking i_data_sem) in the log
writing phase, and removes the need for ext4_fc_track_inode() to wait for
EXT4_STATE_FC_COMMITTING.
I did not pick the extent status tree approach because it would require
additional work to guarantee the needed mappings are resident and not
evicted under memory pressure, which seems like a larger correctness
surface(Please correct me if I'm wrong). If you believe the extent stats tree
approach is better, please let me know, and I will do my best to implement it.
Thanks again for the guidance. I'll post an RFC v3 later.
Regards,
Li
Powered by blists - more mailing lists