linux-ext4 - Re: [PATCH] ext4: make es shrinker handle nr_to_scan correctly (Re: [RFC PATCH 2/2] ext4: improve ...)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131231105912.GB11920@quack.suse.cz>
Date:	Tue, 31 Dec 2013 11:59:12 +0100
From:	Jan Kara <jack@...e.cz>
To:	Zheng Liu <gnehzuil.liu@...il.com>
Cc:	Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org,
	Theodore Ts'o <tytso@....edu>,
	Andreas Dilger <adilger.kernel@...ger.ca>,
	Zheng Liu <wenqing.lz@...bao.com>
Subject: Re: [PATCH] ext4: make es shrinker handle nr_to_scan correctly (Re:
 [RFC PATCH 2/2] ext4: improve ...)

On Tue 31-12-13 16:20:15, Zheng Liu wrote:
> On Mon, Dec 30, 2013 at 10:09:17PM +0100, Jan Kara wrote:
> > On Wed 25-12-13 11:34:48, Zheng Liu wrote:
> > > On Mon, Dec 23, 2013 at 09:54:19AM +0100, Jan Kara wrote:
> > > > On Fri 20-12-13 18:42:45, Zheng Liu wrote:
> > > > > From: Zheng Liu <wenqing.lz@...bao.com>
> > > > > 
> > > > > The extents status tree shrinker will scan all inodes on sbi->s_es_lru
> > > > > under heavy memory pressure, and try to reclaim the entry from extents
> > > > > status tree.  During this process it couldn't reclaim the delayed entry
> > > > > because ext4 needs to use these entries to do delayed allocation space
> > > > > reservation, seek_data/hole, etc....  So if a system has done a huge
> > > > > number of writes and these dirty pages don't be written out.  There will
> > > > > be a lot of delayed entries on extents status tree.  If shrinker tries
> > > > > to reclaim memory from the tree, it will burn some CPU time to iterate
> > > > > on these non-reclaimable entries.  At some circumstances it could cause
> > > > > excessive stall time.
> > > > > 
> > > > > In this commit a new list is used to track reclaimable entries of extent
> > > > > status tree (e.g. written/unwritten/hole entries).  The shrinker will
> > > > > scan reclaimable entry on this list.  So it won't encouter any delayed
> > > > > entry and don't need to take too much time to spin.  But the defect is
> > > > > that we need to cost extra 1/3 memory space for one entry.  Before this
> > > > > commit, 'struct extent_status' occupies 48 bytes on a 64bits platform.
> > > > > After that it will occupy 64 bytes. :(
> > > >   This looks sensible. I was just wondering about one thing: One incorrect
> > > > thing the old extent shrinker does is that it tries to reclaim 'nr_to_scan'
> > > > objects. That is wrong - it should *scan* 'nr_to_scan' objects and reclaim
> > > > objects it can find. Now we shouldn't always start scanning at the end of
> > > > the LRU because if delayed extents accumulate there we would never reclaim
> > > > anything. Rather we should cycle through the list of entries we have. But
> > > > that doesn't play well with the fact we have LRU list and thus want to
> > > > reclaim from the end of the list. In the end what you do might be the best
> > > > we can do but I wanted to mention the above just in case someone has some
> > > > idea.
> > > 
> > > Ah, thanks for pointing it out.  So maybe we can fix this issue before
> > > we are sure that the new improvement is acceptable because it makes us
> > > avoid scanning too many objects.  What do you think?
> >   I'm sorry but I'm not sure I understand.  By 'fix this issue' do you mean
> > using your patch or somehow fixing the problem that we try to reclaim
> > 'nr_to_scan' objects instead of just trying to scan that many objects?
> 
> This patch tries to make es shrinker handle nr_to_scan properly.  After
> applying this patch, es shrinker just scans nr_to_scan objects.  But
> __ext4_es_shrink() couldn't guarantee that it can reclaim objects from
> extent status tree because it doesn't scan all objects and it might only
> scan nr_to_scan delayed objects.  So it could return ENOMEM error from
> ext4_es_insert/remove_extent(), although there are some reclaimable
> objects in the tree.
> 
> Another approach to solve above problem is that we add a new parameter
> called 'nr_to_discard', which tells es shrinker to reclaim nr_to_discard
> objects.  But it makes thing a bit complicated.  I am not sure whether
> it is worthwhile because if we use a list to track all reclaimable
> objects we no longer need nr_to_discard parameter.
> 
> Obviously, this patch is only a band-aid and it might be not very useful
> for us to solve the problem that we faced.  But I attach it below in
> case someone have some idea.
> 
> Regards,
>                                                 - Zheng
> 
> Subject: [PATCH] ext4: make es shrinker handle nr_to_scan correctly
> 
> From: Zheng Liu <wenqing.lz@...bao.com>
> 
> Nr_to_scan means that the shrinker needs to traverse nr_to_scan objects
> rather than reclaim nr_to_scan objects.  This commit makes es shrinker
> handle it correctly.  But after this __ext4_es_shrink() couldn't
> guarantee that it can reclaim objects.  Hence we need to enlarge value
> from 1 to 128 in ext4_es_insert/remove_extent() to avoid returning
> ENOMEM error.  Before we use a list to track all reclaimable objects,
> there is a tradeoff between returning ENOMEM and discarding too many
> objects from extent status tree.
  But the way you've implemented this won't quite work - reclaim never asks
a slab cache to scan large number of objects in one round - the maximum is
128. So if you have 128 unreclaimable objects at the beginning of your
object list you would never reclaim anything - that is what I was speaking
about in my original email. So you have to implement some 'cycling' of head
pointer so that you can continue scanning when asked next time instead of
staring over from the beginning of the list. This logic doesn't play well
with the LRU approach we try to do in the extent cache. So I really see two
viable approaches

1) Trivial: Forget about LRU, just have a list of inodes with extents in
extent cache. Shrinker will remember ino + offset pair it stopped scanning
at and continues scanning the list of inodes from there (restarting the scan
from the beginning if inode got reclaimed in the mean time).

2) Complex: Keep the LRU algorithm and try to make it more CPU efficient.

Approach 1) can possibly reclaim extents which are still very useful. OTOH
these extents will be loaded quickly back and then it should take full
cycle over all other extents to get back to these and reclaim them again.
So it needn't be too bad.

								Honza

> Signed-off-by: Zheng Liu <wenqing.lz@...bao.com>
> ---
>  fs/ext4/extents_status.c |   28 ++++++++++++++++------------
>  1 file changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index eb7ae61..87795d1 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -146,7 +146,7 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
>  static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>  			      ext4_lblk_t end);
>  static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> -				       int nr_to_scan);
> +				       int *nr_to_scan);
>  static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
>  			    struct ext4_inode_info *locked_ei);
>  
> @@ -670,7 +670,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  		goto error;
>  retry:
>  	err = __es_insert_extent(inode, &newes);
> -	if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
> +	if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 128,
>  					       EXT4_I(inode)))
>  		goto retry;
>  	if (err == -ENOMEM && !ext4_es_is_delayed(&newes))
> @@ -824,7 +824,7 @@ retry:
>  				es->es_lblk = orig_es.es_lblk;
>  				es->es_len = orig_es.es_len;
>  				if ((err == -ENOMEM) &&
> -				    __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
> +				    __ext4_es_shrink(EXT4_SB(inode->i_sb), 128,
>  						     EXT4_I(inode)))
>  					goto retry;
>  				goto out;
> @@ -938,8 +938,6 @@ static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
>  
>  retry:
>  	list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
> -		int shrunk;
> -
>  		/*
>  		 * If we have already reclaimed all extents from extent
>  		 * status tree, just stop the loop immediately.
> @@ -966,13 +964,11 @@ retry:
>  			continue;
>  
>  		write_lock(&ei->i_es_lock);
> -		shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
> +		nr_shrunk += __es_try_to_reclaim_extents(ei, &nr_to_scan);
>  		if (ei->i_es_lru_nr == 0)
>  			list_del_init(&ei->i_es_lru);
>  		write_unlock(&ei->i_es_lock);
>  
> -		nr_shrunk += shrunk;
> -		nr_to_scan -= shrunk;
>  		if (nr_to_scan == 0)
>  			break;
>  	}
> @@ -1003,8 +999,16 @@ retry:
>  
>  	spin_unlock(&sbi->s_es_lru_lock);
>  
> -	if (locked_ei && nr_shrunk == 0)
> -		nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
> +	if (locked_ei && nr_shrunk == 0) {
> +		/*
> +		 * We try to reclaim objects from locked inode.  We don't
> +		 * want to discard too many objects from this inode because
> +		 * it might be accessed frequently.
> +		 */
> +		if (!nr_to_scan)
> +			nr_to_scan = 8;
> +		nr_shrunk = __es_try_to_reclaim_extents(locked_ei, &nr_to_scan);
> +	}
>  
>  	return nr_shrunk;
>  }
> @@ -1085,7 +1089,7 @@ void ext4_es_lru_del(struct inode *inode)
>  }
>  
>  static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> -				       int nr_to_scan)
> +				       int *nr_to_scan)
>  {
>  	struct inode *inode = &ei->vfs_inode;
>  	struct ext4_es_tree *tree = &ei->i_es_tree;
> @@ -1114,7 +1118,7 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
>  			rb_erase(&es->rb_node, &tree->root);
>  			ext4_es_free_extent(inode, es);
>  			nr_shrunk++;
> -			if (--nr_to_scan == 0)
> +			if (--(*nr_to_scan) == 0)
>  				break;
>  		}
>  	}
> -- 
> 1.7.9.7
> 
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html