linux-kernel - Re: [Cluster-devel] [PATCH 2/2] GFS2: Add a gfs2-specific prune_icache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57723FBC.6010207@redhat.com>
Date:	Tue, 28 Jun 2016 10:13:32 +0100
From:	Steven Whitehouse <swhiteho@...hat.com>
To:	Dave Chinner <david@...morbit.com>,
	Bob Peterson <rpeterso@...hat.com>
Cc:	cluster-devel@...hat.com, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, Al Viro <viro@...iv.linux.org.uk>
Subject: Re: [Cluster-devel] [PATCH 2/2] GFS2: Add a gfs2-specific
 prune_icache_sb

Hi,

On 28/06/16 03:08, Dave Chinner wrote:
> On Fri, Jun 24, 2016 at 02:50:11PM -0500, Bob Peterson wrote:
>> This patch adds a new prune_icache_sb function for the VFS slab
>> shrinker to call. Trying to directly free the inodes from memory
>> might deadlock because it evicts inodes, which calls into DLM
>> to acquire the glock. The DLM, in turn, may block on a pending
>> fence operation, which may already be blocked on memory allocation
>> that caused the slab shrinker to be called in the first place.
>>
>> Signed-off-by: Bob Peterson <rpeterso@...hat.com>
> All sorts of problems with this.
>
>> ---
>>   fs/gfs2/incore.h     |  2 ++
>>   fs/gfs2/ops_fstype.c |  1 +
>>   fs/gfs2/quota.c      | 25 +++++++++++++++++++++++++
>>   fs/gfs2/super.c      | 13 +++++++++++++
>>   4 files changed, 41 insertions(+)
>>
>> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
>> index a6a3389..a367459 100644
>> --- a/fs/gfs2/incore.h
>> +++ b/fs/gfs2/incore.h
>> @@ -757,6 +757,8 @@ struct gfs2_sbd {
>>   
>>   	struct task_struct *sd_logd_process;
>>   	struct task_struct *sd_quotad_process;
>> +	int sd_iprune; /* inodes to prune */
>> +	spinlock_t sd_shrinkspin;
>>   
>>   	/* Quota stuff */
>>   
>> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
>> index 4546360..65a69be 100644
>> --- a/fs/gfs2/ops_fstype.c
>> +++ b/fs/gfs2/ops_fstype.c
>> @@ -95,6 +95,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
>>   	spin_lock_init(&sdp->sd_jindex_spin);
>>   	mutex_init(&sdp->sd_jindex_mutex);
>>   	init_completion(&sdp->sd_journal_ready);
>> +	spin_lock_init(&sdp->sd_shrinkspin);
>>   
>>   	INIT_LIST_HEAD(&sdp->sd_quota_list);
>>   	mutex_init(&sdp->sd_quota_mutex);
>> diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
>> index ce7d69a..5810a2c 100644
>> --- a/fs/gfs2/quota.c
>> +++ b/fs/gfs2/quota.c
>> @@ -1528,14 +1528,39 @@ void gfs2_wake_up_statfs(struct gfs2_sbd *sdp) {
>>   int gfs2_quotad(void *data)
>>   {
>>   	struct gfs2_sbd *sdp = data;
>> +	struct super_block *sb = sdp->sd_vfs;
>>   	struct gfs2_tune *tune = &sdp->sd_tune;
>>   	unsigned long statfs_timeo = 0;
>>   	unsigned long quotad_timeo = 0;
>>   	unsigned long t = 0;
>>   	DEFINE_WAIT(wait);
>>   	int empty;
>> +	int rc;
>> +	struct shrink_control sc = {.gfp_mask = GFP_KERNEL, };
>>   
> Has not set PF_MEMALLOC context here, so memory reclaim operations that
> require allocation to complete won't get access to memory reserves and
> hence are likely to deadlock.
>
>>   	while (!kthread_should_stop()) {
>> +		/* TODO: Deal with shrinking of dcache */
> I don't think that's ever going to be allowed.
>
>> +		/* Prune any inode cache intended by the shrinker. */
>> +		spin_lock(&sdp->sd_shrinkspin);
>> +		if (sdp->sd_iprune > 0) {
>> +			sc.nr_to_scan = sdp->sd_iprune;
>> +			if (sc.nr_to_scan > 1024)
>> +				sc.nr_to_scan = 1024;
> Magic number warning. Where's that come from?
>
>> +			sdp->sd_iprune -= sc.nr_to_scan;
>> +			spin_unlock(&sdp->sd_shrinkspin);
>> +			rc = prune_icache_sb(sb, &sc);
>> +			if (rc < 0) {
>> +				spin_lock(&sdp->sd_shrinkspin);
>> +				sdp->sd_iprune = 0;
>> +				spin_unlock(&sdp->sd_shrinkspin);
>> +			}
>> +			if (sdp->sd_iprune) {
>> +				cond_resched();
>> +				continue;
>> +			}
>> +		} else {
>> +			spin_unlock(&sdp->sd_shrinkspin);
>> +		}
> Ok, so this only scans 1024 inodes per loop.  AFAICT, that's  only
> once every quotad/statfs timeout, which appears to be 60/30 jiffies
> by default.
>
> What this means is that memory pressure is not going to be able to
> evict gfs2 inodes at the rate required by memory pressure. If we
> have a million cached inodes, and we can only reclaim 1000 every
> ~50ms, then we've got a severe memory imblance issue. in that same
> time, we could be pulling thousands on inodes into the cache....
>
>>   
>>   		/* Update the master statfs file */
>>   		if (sdp->sd_statfs_force_sync) {
>> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
>> index 9b2ff353..75e8a85 100644
>> --- a/fs/gfs2/super.c
>> +++ b/fs/gfs2/super.c
>> @@ -1667,6 +1667,18 @@ static void gfs2_destroy_inode(struct inode *inode)
>>   	call_rcu(&inode->i_rcu, gfs2_i_callback);
>>   }
>>   
>> +static long gfs2_prune_icache_sb(struct super_block *sb,
>> +				 struct shrink_control *sc)
>> +{
>> +	struct gfs2_sbd *sdp;
>> +
>> +	sdp = sb->s_fs_info;
>> +	spin_lock(&sdp->sd_shrinkspin);
>> +	sdp->sd_iprune = sc->nr_to_scan + 1;
>> +	spin_unlock(&sdp->sd_shrinkspin);
>> +	return 0;
> And so this is even more problematic. sc->nr_to_scan is set to to
> either the shrinker batch size or SHRINK_BATCH (128), and the return
> value of the shrinker is the number of objects freed. When the
> shrinker calculates the number of objects that need to be freed from
> the slab cache, it breaks it up into batch size chunks, and it calls
> iteratively until it's scanned the number of objects it calculated.
> This is how the shrinkers convey memory pressure - the more calls
> made, the higher the memory pressure, the more the cache should be
> scanned for reclaim.
>
> The code above, if called iteratively, continually overwrites the
> prune count, so it is only ever going to have a maximum of the
> shrinker batch size. IOWs, if the kernel wants 100,000 inodes to be
> scaned for reclaim, it will make 100 calls into this function, but
> this will only pass one call on to the background thread, which then
> runs some time in the next 50-500ms to do that reclaim. Hence
> reclaim is simply not going to be effective under heavy inode cache
> memory pressure.
>
> And then there's an even bigger problem - the superblock shrinker
> has set the flags:
>
> 	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
>
> This new shrinker is not aware of either node based reclaim or memcg
> based reclaim, and hence this will break more than system balance -
> it will break containers or anything else that uses memcgs to limit
> memory usage of an application. It will probably also trash the gfs2
> inode cache system wide as prune_icache_sb() is given no directions
> for NUMA or memcg based reclaim.
>
> [ Actually, now that I look at the code and remind myself how the
> NUMA aware reclaim works, it's worse than I thought.  Passing a
> sc->nid = 0 to prune_icache_sb() means it will only attempt to
> shrink the inode cache on node 0. This means the gfs2 inode cache on
> nodes 1..N can never be reclaimed by memory pressure. IOWs, it
> simply does not work properly.... ]
>
> Finally, returning 0 immediately from the shrinker without doing
> anything essentially prevents the superblock shrinker from
> throttling inode allocation to the rate at which inodes can be
> reclaimed. Because reclaim will now skip all attempts to free
> inodes, the memory pressure feedback loop has been broken. e.g. if
> you have a pure inode cache workload (e.g. doing lots of
> concurrent finds) that is generating all the memory pressure,
> there's now nothing to throttle memory allocation to the rate that
> memory can be reclaimed. This will cause premature ENOMEM and OOM
> situations to occur, as memory reclaim will blow through without
> reclaiming any inodes and hence freeing no memory.
>
> Shrinker contexts are complex and the VFS shrinkers are much more
> complex than most people understand. I strongly recommend that
> filesystems leave inode cache reclaim to the superblock shrinker for
> the above reasons - it's far too easy for people to get badly wrong.
>
> If there are specific limitations on how inodes can be freed, then
> move the parts of inode *freeing* that cause problems to a different
> context via the ->evict/destroy callouts and trigger that external
> context processing on demand. That external context can just do bulk
> "if it is on the list then free it" processing, because the reclaim
> policy has already been executed to place that inode on the reclaim
> list.
That causes problems, unfortunately. As soon as the inode is taken out 
of the hash, then a new inode can be created by another process doing a 
lookup (for whatever reason - NFS fh lookup, scan for unlinked but still 
allocated inodes) and we'd land up with two in memory inodes referring 
to the same on disk inode at the same time. We rely on the VFS to 
provide mutual exclusion there. We did look at leaving the inode hashed 
and making the evict async, but that didn't work either - the exact 
issue escapes me, but Bob's memory is probably better than mine on this 
issue.

The issue is that there is a dependency loop involving the glocks 
required for inode deallocation, so we have:

  1. Event occurs which requires recovery somewhere in the cluster, 
which memory is short
  2. Userspace gets blocked waiting for memory (on inode shrinker) due 
to requirement for a glock, to complete eviction
  3. The glock is blocked on the DLM
  4. The DLM is blocked until recovery is complete -> deadlock

So we have various option on where to break this loop, the big question 
being where is the best place to do it, as the "obvious" ones seem to 
have various disadvantages to them,

Steve.


>
> This is essentially what XFS does, but it also uses the
> ->nr_cached_objects/->free_cached_objects() callouts in the
> superblock shrinker to provide the reclaim rate feedback mechanism
> required to throttle incoming memory allocations.
>
> Cheers,
>
> Dave.