[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAD+ocbw8Z-mUiNQy2oDEzA3bBM=o-Gboi+r=HB4s2Hm6=-R8iQ@mail.gmail.com>
Date: Mon, 21 Feb 2022 10:54:16 -0800
From: harshad shirwadkar <harshadshirwadkar@...il.com>
To: Ritesh Harjani <riteshh@...ux.ibm.com>
Cc: "Theodore Ts'o" <tytso@....edu>,
linux-ext4 <linux-ext4@...r.kernel.org>, Jan Kara <jack@...e.cz>
Subject: Re: Query regarding fast_commit replay of inode
Hi Ritesh,
Thanks for bringing this up again. I think we should really document
this in the code. Let me first try to explain what we are doing and
then why we are doing it this way.
During fast commit replay (even for "add range" / "delete range"
tags), we always clear the bitmaps of whatever blocks we touch. For
example, even if new blocks are added to an inode, during the replay
phase of add range tag, we will still clear the bitmap to mark these
new blocks as free and at the end of the replay (i.e. after all the
tags have been replayed) we will traverse the "modified inodes list"
and mark all the blocks in these inodes as allocated.
We extend this logic to the replay of the "inode" tag as well. So
here's the high level flow of replay path wrt block bitmap handling:
- For every "add_range" / "del_range" tags in the FC log area:
- Clear block bitmaps for all modified blocks in this tag.
- For every "inode" tag in the FC log area:
- Clear all the blocks in that inode
- Record that the inode is modified
- At the end of the replay phase
- For each inode in the modified inode list:
- Mark all the blocks in this inode as used
Let me try to describe the reason why we take this approach with an
example. Let's say there was an inode A of size 100 with following
extents:
Logical - Physical
[0-49] - unmapped
[50-99] - [1000 - 1049]
Let's say this inode changed to following mappings:
Logical - Physical
[0-49] - [1000 - 1049]
[50-99] - [1050 - 1099]
Let's say a fast commit was performed at this point. So, this would
translate to following logs in the fast commit area:
[ADD_RANGE, A, 0-49, 1000-1049] [ADD_RANGE, A, 50-99, 1050-1099]
After replaying the first tag, the inode's intermediate state would be
as follows:
Logical - Physical
[0-49] - [1000-1049]
[50-99] - [1000-1049]
During replay of the second tag, the logical to physical mapping of
50:99 has changed. So, we need to decide what to do with the old
blocks. If we mark the old physical blocks corresponding to 50-99
range as free on-disk, then that would result in block bitmap
inconsistency, since these blocks are actually being used somewhere
else. We can't simply leave these blocks as allocated either since
these blocks may actually not be used at all. Things get even more
complicated if these blocks are moved to a different file. So to
protect against such cases, what we simply do is that we defer setting
up the block bitmap correctly until the very end and just mark all the
blocks that we encounter during replays of different tags as free.
This was done mainly to protect against such cases in add range /
delete range operations, but I extended it to replay inode as well
just to be on the safer side. generic/311 provides a lot of good test
cases for such manipulations. We can see if removing clear_bb from
replay inode handling introduces any regressions.
Does this make sense?
- Harshad
On Sun, 20 Feb 2022 at 23:59, Ritesh Harjani <riteshh@...ux.ibm.com> wrote:
>
> Hello Ted/Harshad,
>
> I think we did discuss this once before, but I am unable to recollect the exact
> reasoning for this. So question is - why do we have to call ext4_ext_clear_bb()
> from ext4_fc_replay_inode()?
>
> I was just thinking if this is suboptimal and if it can be optimized. But before
> working on that problem, I wanted to again understand the right reasoning behind
> choosing this approach in the first place.
>
> Could you please help with this again?
>
> ext4_fc_replay_inode()
> <..>
> inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
> if (!IS_ERR(inode)) {
> ext4_ext_clear_bb(inode);
> iput(inode);
> }
> <..>
>
> I will document it this time, so that I don't have to keep coming to this
> everytime I look into fc replay code.
>
> Thanks again for your help!!
> -ritesh
Powered by blists - more mailing lists