linux-kernel - Re: [PATCH 8/9] inode: convert per-sb inode list to a list

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5506F811.5000004@fb.com>
Date:	Mon, 16 Mar 2015 11:34:41 -0400
From:	Josef Bacik <jbacik@...com>
To:	Jan Kara <jack@...e.cz>
CC:	<linux-fsdevel@...r.kernel.org>, <david@...morbit.com>,
	<viro@...iv.linux.org.uk>, <linux-kernel@...r.kernel.org>,
	Dave Chinner <dchinner@...hat.com>
Subject: Re: [PATCH 8/9] inode: convert per-sb inode list to a list_lru

On 03/16/2015 08:27 AM, Jan Kara wrote:
>    Hello,
>
> On Tue 10-03-15 15:45:23, Josef Bacik wrote:
>> From: Dave Chinner <dchinner@...hat.com>
>>
>> The per-superblock inode list and lock is a bottleneck for systems
>> that cycle inodes in and out of cache concurrently. The global lock
>> is a limiting factor.
>>
>> Most of the additions to the sb inode list occur on the CPU that
>> allocated the inode, and most of the removals occur during evict()
>> calls as a result of memory reclaim. Both of these events are local
>> to the node that the inode belongs to, so it maps to the per-node
>> lists that the list_lru uses.
>>
>> There are several places where the inode list is walked. These can
>> be converted easily to use list_lru_walk() to do their work on each
>> inode on the list.
>    Err, after looking at the patch I'm not so sure list_lru is such a great
> fit for inode sb list. We don't use any of its LRU features so arguably
> that just makes things less readable, there's some memcg code which I hope
> isn't activated for the inode list but how do I tell? We just use it as
> general per-numa-node linked lists so if we go down the path of
> per-numa-node linked list I'd prefer to have an abstraction just for that.
> Also there are some lifetime issues I'll comment on in the code.
>
> And do we really want per-numa-node lists? Why not per-cpu? Is it the
> memory cost we are concerned about? Or is it the cost of iterating all CPUs
> when wanting to traverse all inodes? The changelog doesn't tell. The
> traversal is used on unmount (inode eviction, fsnotify mark eviction),
> quota on/off (adding / removing dquot pointers to / from inodes),
> drop_caches, sync (iterate all block devices). Sync is IMO the only one
> where CPU overhead might be a concern but we could create a better method
> for iterating all block devices. After all creating / removing block
> devices could happily use just ordinary lists. So CPU overhead shouldn't be
> a huge problem. Memory overhead is (list_head + spinlock) per-sb-per-cpu.
> That doesn't sound too terrible either.

Yeah sorry, I had it in my head that the list was already per-cpu, but 
you're right its per-numa.

>
> When speaking about this with Andi Kleen, he was suggesting that we maybe
> want to not have a per-cpu lists but a list per a batch of CPUs (say 8, the
> actual number would be tuned during boot based on the number of actual
> CPUs). That may be an option as well.
>
> I wouldn't like to block this series just for this patch since all the
> others are fine (modulo some minor comments). So could you repost the
> series without this patch so that it can get queued for the coming merge
> window? It should be easy to rebase the patch 9/9 to not need this patch.
>

Yeah I can do that.

>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 2eb2436..d23ce6f 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -1749,38 +1749,56 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
>>   }
>>   EXPORT_SYMBOL(__invalidate_device);
>>
>> -void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>> -{
>> -	struct inode *inode, *old_inode = NULL;
>> +struct bdev_iter {
>> +	void (*func)(struct block_device *, void *);
>> +	void *arg;
>> +	struct inode *toput_inode;
>> +};
>>
>> -	spin_lock(&blockdev_superblock->s_inode_list_lock);
>> -	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
>> -		struct address_space *mapping = inode->i_mapping;
>> +static enum lru_status
>> +bdev_iter_cb(struct list_head *item, struct list_lru_one *lru,
>> +	     spinlock_t *lock, void *cb_arg)
>> +{
>> +	struct bdev_iter *iter = cb_arg;
>> +	struct inode *inode = container_of(item, struct inode, i_sb_list);
>>
>> -		spin_lock(&inode->i_lock);
>> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
>> -		    mapping->nrpages == 0) {
>> -			spin_unlock(&inode->i_lock);
>> -			continue;
>> -		}
>> -		__iget(inode);
>> +	spin_lock(&inode->i_lock);
>> +	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
>> +	    inode->i_mapping->nrpages == 0) {
>>   		spin_unlock(&inode->i_lock);
>> -		spin_unlock(&blockdev_superblock->s_inode_list_lock);
>> -		/*
>> -		 * We hold a reference to 'inode' so it couldn't have been
>> -		 * removed from s_inodes list while we dropped the
>> -		 * s_inode_list_lock  We cannot iput the inode now as we can
>> -		 * be holding the last reference and we cannot iput it under
>> -		 * s_inode_list_lock. So we keep the reference and iput it
>> -		 * later.
>> -		 */
>> -		iput(old_inode);
>> -		old_inode = inode;
>> +		return LRU_SKIP;
>> +	}
>> +	__iget(inode);
>> +	spin_unlock(&inode->i_lock);
>> +	spin_unlock(lock);
>>
>> -		func(I_BDEV(inode), arg);
>> +	iput(iter->toput_inode);
>> +	iter->toput_inode = inode;
>>
>> -		spin_lock(&blockdev_superblock->s_inode_list_lock);
>> -	}
>> -	spin_unlock(&blockdev_superblock->s_inode_list_lock);
>> -	iput(old_inode);
>> +	iter->func(I_BDEV(inode), iter->arg);
>> +
>> +	/*
>> +	 * Even though we have dropped the lock here, we can return LRU_SKIP as
>> +	 * we have a reference to the current inode and so it's next pointer is
>> +	 * guaranteed to be valid even though we dropped the list lock.
>> +	 */
>> +	spin_lock(lock);
>> +	return LRU_SKIP;
>> +}
>    So I actually think doing the iteration like this is buggy with list_lru
> code. When we pin inode by grabbing i_count, we are only sure the inode
> won't go away under us. However there's nothing which protects from list
> modifications. So this method is only safe to be combined with
> list_for_each[_entry](). Specifically it does *not* work when combined with
> list_for_each_safe() which __list_lru_walk_one() uses because the next
> pointer that is cached may become invalid once we drop the lock. And I'd
> really hate to try to make these two work together - that would seem really
> fragile since subtle changes to how lru_list iteration work could break
> inode iteration.
>
> This is true for all the iterations over the inodes that are playing with
> i_count. It's not just about this particular case of blockdev iteration.
>
>> diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
>> index a4e1a8f..a0cdc66 100644
>> --- a/fs/notify/inode_mark.c
>> +++ b/fs/notify/inode_mark.c
>> @@ -161,87 +161,60 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
>>   	return ret;
>>   }
>>
>> -/**
>> - * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
>> - * @sb: superblock being unmounted.
>> - *
>> - * Called during unmount with no locks held, so needs to be safe against
>> - * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
>> - */
>> -void fsnotify_unmount_inodes(struct super_block *sb)
>> -{
>> -	struct inode *inode, *next_i, *need_iput = NULL;
>> -
>> -	spin_lock(&sb->s_inode_list_lock);
>> -	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
>> -		struct inode *need_iput_tmp;
>> -
>> -		/*
>> -		 * We cannot __iget() an inode in state I_FREEING,
>> -		 * I_WILL_FREE, or I_NEW which is fine because by that point
>> -		 * the inode cannot have any associated watches.
>> -		 */
>> -		spin_lock(&inode->i_lock);
>> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
>> -			spin_unlock(&inode->i_lock);
>> -			continue;
>> -		}
>> +static enum lru_status
>> +fsnotify_unmount_inode(struct list_head *item, struct list_lru_one *lru,
>> +		       spinlock_t *lock, void *cb_arg)
>> + {
>> +	struct inode **toput_inode = cb_arg;
>> +	struct inode *inode = container_of(item, struct inode, i_sb_list);
>> +
>> +	/* New or being freed inodes cannot have any associated watches. */
>> +	spin_lock(&inode->i_lock);
>> +	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
>> +		spin_unlock(&inode->i_lock);
>> +		return LRU_SKIP;
>> +	}
>>
>> -		/*
>> -		 * If i_count is zero, the inode cannot have any watches and
>> -		 * doing an __iget/iput with MS_ACTIVE clear would actually
>> -		 * evict all inodes with zero i_count from icache which is
>> -		 * unnecessarily violent and may in fact be illegal to do.
>> -		 */
>> -		if (!atomic_read(&inode->i_count)) {
>> -			spin_unlock(&inode->i_lock);
>> -			continue;
>> -		}
>> -
>> -		need_iput_tmp = need_iput;
>> -		need_iput = NULL;
>> -
>> -		/* In case fsnotify_inode_delete() drops a reference. */
>> -		if (inode != need_iput_tmp)
>> -			__iget(inode);
>> -		else
>> -			need_iput_tmp = NULL;
>> +	/* If i_count is zero, the inode cannot have any watches */
>> +	if (!atomic_read(&inode->i_count)) {
>>   		spin_unlock(&inode->i_lock);
>> +		return LRU_SKIP;
>> +	}
>>
>> -		/* In case the dropping of a reference would nuke next_i. */
>> -		while (&next_i->i_sb_list != &sb->s_inodes) {
>> -			spin_lock(&next_i->i_lock);
>> -			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
>> -						atomic_read(&next_i->i_count)) {
>> -				__iget(next_i);
>> -				need_iput = next_i;
>> -				spin_unlock(&next_i->i_lock);
>> -				break;
>> -			}
>> -			spin_unlock(&next_i->i_lock);
>> -			next_i = list_entry(next_i->i_sb_list.next,
>> -						struct inode, i_sb_list);
>> -		}
>> +	__iget(inode);
>> +	spin_unlock(&inode->i_lock);
>> +	spin_unlock(lock);
>>
>> -		/*
>> -		 * We can safely drop s_inode_list_lock here because either
>> -		 * we actually hold references on both inode and next_i or
>> -		 * end of list.  Also no new inodes will be added since the
>> -		 * umount has begun.
>> -		 */
>> -		spin_unlock(&sb->s_inode_list_lock);
>> +	iput(*toput_inode);
>> +	*toput_inode = inode;
>>
>> -		if (need_iput_tmp)
>> -			iput(need_iput_tmp);
>> +	/* for each watch, send FS_UNMOUNT and then remove it */
>> +	fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
>> +	fsnotify_inode_delete(inode);
>>
>> -		/* for each watch, send FS_UNMOUNT and then remove it */
>> -		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
>> +	/*
>> +	 * Even though we have dropped the lock here, we can return LRU_SKIP as
>> +	 * we have a reference to the current inode and so it's next pointer is
>> +	 * guaranteed to be valid even though we dropped the list lock.
>> +	 */
>> +	spin_lock(lock);
>> +	return LRU_SKIP;
>> +}
>    So in the original code of fsnotify_unmount_inodes() you can see the
> hoops you have to jump through when having to iterate inode list with
> list_for_each_entry_safe(). Deleting that code without deeper thought
> wasn't a good idea ;).
>
> As a side note, it should be possible to convert this code to use just
> list_for_each() and avoid all that crap but it will require me to dwelve into
> fsnotify magic again to verify it's correct. I guess I'll leave that after
> your patches are queued so that I don't make life harder to you.
>

Well these two cases are just bugs, we are supposed to return LRU_RETRY 
every time we drop the lru list lock so we loop back to the beginning. 
But I agree with your other points, I'll just drop this one and 
find/create a batched per-cpu list to use instead and do that separately 
so we can still get the meat of this patchset in.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/