lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Mon, 10 Oct 2022 16:30:14 +0530
From:   Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To:     Jan Kara <jack@...e.cz>
Cc:     linux-ext4@...r.kernel.org, "Theodore Ts'o" <tytso@....edu>,
        Ritesh Harjani <riteshh@...ux.ibm.com>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        rookxu <brookxu.cn@...il.com>,
        Ritesh Harjani <ritesh.list@...il.com>
Subject: Re: [PATCH 7/8] ext4: Use rbtrees to manage PAs instead of inode
 i_prealloc_list

On Mon, Oct 10, 2022 at 11:38:26AM +0200, Jan Kara wrote:
> On Fri 07-10-22 02:16:18, Ojaswin Mujoo wrote:
> > Currently, the kernel uses i_prealloc_list to hold all the inode
> > preallocations. This is known to cause degradation in performance in
> > workloads which perform large number of sparse writes on a single file.
> > This is mainly because functions like ext4_mb_normalize_request() and
> > ext4_mb_use_preallocated() iterate over this complete list, resulting in
> > slowdowns when large number of PAs are present.
> > 
> > Patch 27bc446e2 partially fixed this by enforcing a limit of 512 for
> > the inode preallocation list and adding logic to continually trim the
> > list if it grows above the threshold, however our testing revealed that
> > a hardcoded value is not suitable for all kinds of workloads.
> > 
> > To optimize this, add an rbtree to the inode and hold the inode
> > preallocations in this rbtree. This will make iterating over inode PAs
> > faster and scale much better than a linked list. Additionally, we also
> > had to remove the LRU logic that was added during trimming of the list
> > (in ext4_mb_release_context()) as it will add extra overhead in rbtree.
> > The discards now happen in the lowest-logical-offset-first order.
> > 
> > ** Locking notes **
> > 
> > With the introduction of rbtree to maintain inode PAs, we can't use RCU
> > to walk the tree for searching since it can result in partial traversals
> > which might miss some nodes(or entire subtrees) while discards happen
> > in parallel (which happens under a lock).  Hence this patch converts the
> > ei->i_prealloc_lock spin_lock to rw_lock.
> > 
> > Almost all the codepaths that read/modify the PA rbtrees are protected
> > by the higher level inode->i_data_sem (except
> > ext4_mb_discard_group_preallocations() and ext4_clear_inode()) IIUC, the
> > only place we need lock protection is when one thread is reading
> > "searching" the PA rbtree (earlier protected under rcu_read_lock()) and
> > another is "deleting" the PAs in ext4_mb_discard_group_preallocations()
> > function (which iterates all the PAs using the grp->bb_prealloc_list and
> > deletes PAs from the tree without taking any inode lock (i_data_sem)).
> > 
> > So, this patch converts all rcu_read_lock/unlock() paths for inode list
> > PA to use read_lock() and all places where we were using
> > ei->i_prealloc_lock spinlock will now be using write_lock().
> > 
> > Note that this makes the fast path (searching of the right PA e.g.
> > ext4_mb_use_preallocated() or ext4_mb_normalize_request()), now use
> > read_lock() instead of rcu_read_lock/unlock().  Ths also will now block
> > due to slow discard path (ext4_mb_discard_group_preallocations()) which
> > uses write_lock().
> > 
> > But this is not as bad as it looks. This is because -
> > 
> > 1. The slow path only occurs when the normal allocation failed and we
> >    can say that we are low on disk space.  One can argue this scenario
> >    won't be much frequent.
> > 
> > 2. ext4_mb_discard_group_preallocations(), locks and unlocks the rwlock
> >    for deleting every individual PA.  This gives enough opportunity for
> >    the fast path to acquire the read_lock for searching the PA inode
> >    list.
> > 
> > Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
> > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>
> 
> Looks mostly good to me now. Just three nits below. With those fixes feel
> free to add:
> 
> Reviewed-by: Jan Kara <jack@...e.cz>
> 
> > @@ -4031,19 +4054,27 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
> >  	new_end = *end;
> >  
> >  	/* check we don't cross already preallocated blocks */
> > -	rcu_read_lock();
> > -	list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
> > -		if (tmp_pa->pa_deleted)
> > +	read_lock(&ei->i_prealloc_lock);
> > +	for (iter = ei->i_prealloc_node.rb_node; iter;
> > +	     iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter)) {
> > +		tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
> > +				  pa_node.inode_node);
> > +		tmp_pa_start = tmp_pa->pa_lstart;
> > +		tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> > +
> > +		/*
> > +		 * If pa is deleted, ignore overlaps and just iterate in rbtree
> > +		 * based on tmp_pa_start
> > +		 */
> > +		if (tmp_pa->pa_deleted) {
> >  			continue;
> > +		}
> 
> Curly braces here are pointless.
> 
> > @@ -4408,17 +4439,21 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
> >  		return false;
> >  
> >  	/* first, try per-file preallocation */
> > -	rcu_read_lock();
> > -	list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
> > +	read_lock(&ei->i_prealloc_lock);
> > +	for (iter = ei->i_prealloc_node.rb_node; iter;
> > +	     iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical, tmp_pa_start, iter)) {
> > +		tmp_pa = rb_entry(iter, struct ext4_prealloc_space, pa_node.inode_node);
> >  
> >  		/* all fields in this condition don't change,
> >  		 * so we can skip locking for them */
> >  		tmp_pa_start = tmp_pa->pa_lstart;
> >  		tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> >  
> > +		/* original request start doesn't lie in this PA */
> >  		if (ac->ac_o_ex.fe_logical < tmp_pa_start ||
> > -		    ac->ac_o_ex.fe_logical >= tmp_pa_end)
> > +		    ac->ac_o_ex.fe_logical >= tmp_pa_end) {
> >  			continue;
> > +		}
> 
> Again, curly braces here are pointless.
> 
> > +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> > +                       int (*cmp)(struct rb_node *, struct rb_node *))
> > +{
> > +       struct rb_node **iter = &root->rb_node, *parent = NULL;
> > +
> > +       while (*iter) {
> > +               parent = *iter;
> > +               if (cmp(new, *iter) < 0)
> > +                       iter = &((*iter)->rb_left);
> > +               else
> > +                       iter = &((*iter)->rb_right);
> > +       }
> > +
> > +       rb_link_node(new, parent, iter);
> > +       rb_insert_color(new, root);
> > +}
> 
> I think I wrote it already last time: ext4_mb_rb_insert() is always called
> with ext4_mb_pa_cmp() as the comparison function. Furthemore
> ext4_mb_pa_cmp() is used nowhere else. So I'd just opencode
> ext4_mb_pa_cmp() in ext4_mb_rb_insert() and get rid of the indirect call.
> Better for speed as well as readability.

Hi Jan,

As mentioned in change notes, I intentionally left it as it is to make
ext4_mb_rb_insert() helper function reusable. However, I agree with your 
point about readability so I'll just merge the 2 functions and send a
next version.

Thanks,
Ojaswin
> 
> 								Honza
> -- 
> Jan Kara <jack@...e.com>
> SUSE Labs, CR

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ