lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 17 Jan 2023 12:03:35 +0100
From:   Jan Kara <jack@...e.cz>
To:     Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc:     Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org,
        Theodore Ts'o <tytso@....edu>,
        Ritesh Harjani <riteshh@...ux.ibm.com>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        rookxu <brookxu.cn@...il.com>,
        Ritesh Harjani <ritesh.list@...il.com>
Subject: Re: [PATCH v3 7/8] ext4: Use rbtrees to manage PAs instead of inode
 i_prealloc_list

On Tue 17-01-23 16:00:47, Ojaswin Mujoo wrote:
> On Mon, Jan 16, 2023 at 01:23:34PM +0100, Jan Kara wrote:
> > > Since this covers the special case we discussed above, we will always
> > > un-delete the PA when we encounter the special case and we can then
> > > adjust for overlap and traverse the PA rbtree without any issues.
> > > 
> > > Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
> > > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>
> 
> Hi Jan,
> Thanks for the review, sharing some of my thoughts below.
> 
> > 
> > So I find this putting back of already deleted inode PA very fragile. For
> > example in current code I suspect you've missed a case in ext4_mb_put_pa()
> > which can mark inode PA (so it can then be spotted by
> > ext4_mb_pa_adjust_overlap() and marked as in use again) but
> > ext4_mb_put_pa() still goes on and destroys the PA.
> 
> The 2 code paths that clash here are:
> 
> ext4_mb_new_blocks() -> ext4_mb_release_context() -> ext4_mb_put_pa()
> ext4_mb_new_blocks() -> ext4_mb_normalize_request() -> ext4_mb_pa_adjust_overlap()
> 
> Since these are the only code paths from which these 2 functions are
> called, for a given inode, access will always be serialized by the upper
> level ei->i_data_sem, which is always taken when writing data blocks
> using ext4_mb_new_block(). 

Indeed, inode->i_data_sem prevents the race I was afraid of.
 
> From my understanding of the code, I feel only
> ext4_mb_discard_group_preallocations() can race against other functions
> that are modifying the PA rbtree since it does not take any inode locks.
> 
> That being said, I do understand your concerns regarding the solution,
> however I'm willing to work with the community to ensure our
> implementation of this undelete feature is as robust as possible. Along
> with fixing the bug reported here [1], I believe that it is also a good
> optimization to have especially when the disk is near full and we are
> seeing a lot of group discards going on. 
> 
> Also, in case the deleted PA completely lies inside our new range, it is
> much better to just undelete and use it rather than deleting the
> existing PA and reallocating the range again. I think the advantage
> would be even bigger in ext4_mb_use_preallocated() function where we can
> just undelete and use the PA and skip the entire allocation, incase original
> range lies in a deleted PA.

Thanks for explantion. However I think you're optimizing the wrong thing.
We are running out of space (to run ext4_mb_discard_group_preallocations()
at all) and we allocate from an area covered by PA that we've just decided
to discard - if anything relies on performance of the filesystem in ENOSPC
conditions it has serious problems no matter what. Sure, we should deliver
the result (either ENOSPC or some block allocation) in a reasonable time
but the performance does not really matter much because all the scanning
and flushing is going to slow down everything a lot anyway. One additional
scan of the rbtree is really negligible in this case. So what we should
rather optimize for in this case is the code simplicity and maintainability
of this rare corner-case that will also likely get only a small amount of
testing. And in terms of code simplicity the delete & restart solution
seems to be much better (at least as far as I'm imagining it - maybe the
code will prove me wrong ;)).

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ