[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5506F811.5000004@fb.com>
Date: Mon, 16 Mar 2015 11:34:41 -0400
From: Josef Bacik <jbacik@...com>
To: Jan Kara <jack@...e.cz>
CC: <linux-fsdevel@...r.kernel.org>, <david@...morbit.com>,
<viro@...iv.linux.org.uk>, <linux-kernel@...r.kernel.org>,
Dave Chinner <dchinner@...hat.com>
Subject: Re: [PATCH 8/9] inode: convert per-sb inode list to a list_lru
On 03/16/2015 08:27 AM, Jan Kara wrote:
> Hello,
>
> On Tue 10-03-15 15:45:23, Josef Bacik wrote:
>> From: Dave Chinner <dchinner@...hat.com>
>>
>> The per-superblock inode list and lock is a bottleneck for systems
>> that cycle inodes in and out of cache concurrently. The global lock
>> is a limiting factor.
>>
>> Most of the additions to the sb inode list occur on the CPU that
>> allocated the inode, and most of the removals occur during evict()
>> calls as a result of memory reclaim. Both of these events are local
>> to the node that the inode belongs to, so it maps to the per-node
>> lists that the list_lru uses.
>>
>> There are several places where the inode list is walked. These can
>> be converted easily to use list_lru_walk() to do their work on each
>> inode on the list.
> Err, after looking at the patch I'm not so sure list_lru is such a great
> fit for inode sb list. We don't use any of its LRU features so arguably
> that just makes things less readable, there's some memcg code which I hope
> isn't activated for the inode list but how do I tell? We just use it as
> general per-numa-node linked lists so if we go down the path of
> per-numa-node linked list I'd prefer to have an abstraction just for that.
> Also there are some lifetime issues I'll comment on in the code.
>
> And do we really want per-numa-node lists? Why not per-cpu? Is it the
> memory cost we are concerned about? Or is it the cost of iterating all CPUs
> when wanting to traverse all inodes? The changelog doesn't tell. The
> traversal is used on unmount (inode eviction, fsnotify mark eviction),
> quota on/off (adding / removing dquot pointers to / from inodes),
> drop_caches, sync (iterate all block devices). Sync is IMO the only one
> where CPU overhead might be a concern but we could create a better method
> for iterating all block devices. After all creating / removing block
> devices could happily use just ordinary lists. So CPU overhead shouldn't be
> a huge problem. Memory overhead is (list_head + spinlock) per-sb-per-cpu.
> That doesn't sound too terrible either.
Yeah sorry, I had it in my head that the list was already per-cpu, but
you're right its per-numa.
>
> When speaking about this with Andi Kleen, he was suggesting that we maybe
> want to not have a per-cpu lists but a list per a batch of CPUs (say 8, the
> actual number would be tuned during boot based on the number of actual
> CPUs). That may be an option as well.
>
> I wouldn't like to block this series just for this patch since all the
> others are fine (modulo some minor comments). So could you repost the
> series without this patch so that it can get queued for the coming merge
> window? It should be easy to rebase the patch 9/9 to not need this patch.
>
Yeah I can do that.
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 2eb2436..d23ce6f 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -1749,38 +1749,56 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
>> }
>> EXPORT_SYMBOL(__invalidate_device);
>>
>> -void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>> -{
>> - struct inode *inode, *old_inode = NULL;
>> +struct bdev_iter {
>> + void (*func)(struct block_device *, void *);
>> + void *arg;
>> + struct inode *toput_inode;
>> +};
>>
>> - spin_lock(&blockdev_superblock->s_inode_list_lock);
>> - list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
>> - struct address_space *mapping = inode->i_mapping;
>> +static enum lru_status
>> +bdev_iter_cb(struct list_head *item, struct list_lru_one *lru,
>> + spinlock_t *lock, void *cb_arg)
>> +{
>> + struct bdev_iter *iter = cb_arg;
>> + struct inode *inode = container_of(item, struct inode, i_sb_list);
>>
>> - spin_lock(&inode->i_lock);
>> - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
>> - mapping->nrpages == 0) {
>> - spin_unlock(&inode->i_lock);
>> - continue;
>> - }
>> - __iget(inode);
>> + spin_lock(&inode->i_lock);
>> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
>> + inode->i_mapping->nrpages == 0) {
>> spin_unlock(&inode->i_lock);
>> - spin_unlock(&blockdev_superblock->s_inode_list_lock);
>> - /*
>> - * We hold a reference to 'inode' so it couldn't have been
>> - * removed from s_inodes list while we dropped the
>> - * s_inode_list_lock We cannot iput the inode now as we can
>> - * be holding the last reference and we cannot iput it under
>> - * s_inode_list_lock. So we keep the reference and iput it
>> - * later.
>> - */
>> - iput(old_inode);
>> - old_inode = inode;
>> + return LRU_SKIP;
>> + }
>> + __iget(inode);
>> + spin_unlock(&inode->i_lock);
>> + spin_unlock(lock);
>>
>> - func(I_BDEV(inode), arg);
>> + iput(iter->toput_inode);
>> + iter->toput_inode = inode;
>>
>> - spin_lock(&blockdev_superblock->s_inode_list_lock);
>> - }
>> - spin_unlock(&blockdev_superblock->s_inode_list_lock);
>> - iput(old_inode);
>> + iter->func(I_BDEV(inode), iter->arg);
>> +
>> + /*
>> + * Even though we have dropped the lock here, we can return LRU_SKIP as
>> + * we have a reference to the current inode and so it's next pointer is
>> + * guaranteed to be valid even though we dropped the list lock.
>> + */
>> + spin_lock(lock);
>> + return LRU_SKIP;
>> +}
> So I actually think doing the iteration like this is buggy with list_lru
> code. When we pin inode by grabbing i_count, we are only sure the inode
> won't go away under us. However there's nothing which protects from list
> modifications. So this method is only safe to be combined with
> list_for_each[_entry](). Specifically it does *not* work when combined with
> list_for_each_safe() which __list_lru_walk_one() uses because the next
> pointer that is cached may become invalid once we drop the lock. And I'd
> really hate to try to make these two work together - that would seem really
> fragile since subtle changes to how lru_list iteration work could break
> inode iteration.
>
> This is true for all the iterations over the inodes that are playing with
> i_count. It's not just about this particular case of blockdev iteration.
>
>> diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
>> index a4e1a8f..a0cdc66 100644
>> --- a/fs/notify/inode_mark.c
>> +++ b/fs/notify/inode_mark.c
>> @@ -161,87 +161,60 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
>> return ret;
>> }
>>
>> -/**
>> - * fsnotify_unmount_inodes - an sb is unmounting. handle any watched inodes.
>> - * @sb: superblock being unmounted.
>> - *
>> - * Called during unmount with no locks held, so needs to be safe against
>> - * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
>> - */
>> -void fsnotify_unmount_inodes(struct super_block *sb)
>> -{
>> - struct inode *inode, *next_i, *need_iput = NULL;
>> -
>> - spin_lock(&sb->s_inode_list_lock);
>> - list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
>> - struct inode *need_iput_tmp;
>> -
>> - /*
>> - * We cannot __iget() an inode in state I_FREEING,
>> - * I_WILL_FREE, or I_NEW which is fine because by that point
>> - * the inode cannot have any associated watches.
>> - */
>> - spin_lock(&inode->i_lock);
>> - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
>> - spin_unlock(&inode->i_lock);
>> - continue;
>> - }
>> +static enum lru_status
>> +fsnotify_unmount_inode(struct list_head *item, struct list_lru_one *lru,
>> + spinlock_t *lock, void *cb_arg)
>> + {
>> + struct inode **toput_inode = cb_arg;
>> + struct inode *inode = container_of(item, struct inode, i_sb_list);
>> +
>> + /* New or being freed inodes cannot have any associated watches. */
>> + spin_lock(&inode->i_lock);
>> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
>> + spin_unlock(&inode->i_lock);
>> + return LRU_SKIP;
>> + }
>>
>> - /*
>> - * If i_count is zero, the inode cannot have any watches and
>> - * doing an __iget/iput with MS_ACTIVE clear would actually
>> - * evict all inodes with zero i_count from icache which is
>> - * unnecessarily violent and may in fact be illegal to do.
>> - */
>> - if (!atomic_read(&inode->i_count)) {
>> - spin_unlock(&inode->i_lock);
>> - continue;
>> - }
>> -
>> - need_iput_tmp = need_iput;
>> - need_iput = NULL;
>> -
>> - /* In case fsnotify_inode_delete() drops a reference. */
>> - if (inode != need_iput_tmp)
>> - __iget(inode);
>> - else
>> - need_iput_tmp = NULL;
>> + /* If i_count is zero, the inode cannot have any watches */
>> + if (!atomic_read(&inode->i_count)) {
>> spin_unlock(&inode->i_lock);
>> + return LRU_SKIP;
>> + }
>>
>> - /* In case the dropping of a reference would nuke next_i. */
>> - while (&next_i->i_sb_list != &sb->s_inodes) {
>> - spin_lock(&next_i->i_lock);
>> - if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
>> - atomic_read(&next_i->i_count)) {
>> - __iget(next_i);
>> - need_iput = next_i;
>> - spin_unlock(&next_i->i_lock);
>> - break;
>> - }
>> - spin_unlock(&next_i->i_lock);
>> - next_i = list_entry(next_i->i_sb_list.next,
>> - struct inode, i_sb_list);
>> - }
>> + __iget(inode);
>> + spin_unlock(&inode->i_lock);
>> + spin_unlock(lock);
>>
>> - /*
>> - * We can safely drop s_inode_list_lock here because either
>> - * we actually hold references on both inode and next_i or
>> - * end of list. Also no new inodes will be added since the
>> - * umount has begun.
>> - */
>> - spin_unlock(&sb->s_inode_list_lock);
>> + iput(*toput_inode);
>> + *toput_inode = inode;
>>
>> - if (need_iput_tmp)
>> - iput(need_iput_tmp);
>> + /* for each watch, send FS_UNMOUNT and then remove it */
>> + fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
>> + fsnotify_inode_delete(inode);
>>
>> - /* for each watch, send FS_UNMOUNT and then remove it */
>> - fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
>> + /*
>> + * Even though we have dropped the lock here, we can return LRU_SKIP as
>> + * we have a reference to the current inode and so it's next pointer is
>> + * guaranteed to be valid even though we dropped the list lock.
>> + */
>> + spin_lock(lock);
>> + return LRU_SKIP;
>> +}
> So in the original code of fsnotify_unmount_inodes() you can see the
> hoops you have to jump through when having to iterate inode list with
> list_for_each_entry_safe(). Deleting that code without deeper thought
> wasn't a good idea ;).
>
> As a side note, it should be possible to convert this code to use just
> list_for_each() and avoid all that crap but it will require me to dwelve into
> fsnotify magic again to verify it's correct. I guess I'll leave that after
> your patches are queued so that I don't make life harder to you.
>
Well these two cases are just bugs, we are supposed to return LRU_RETRY
every time we drop the lru list lock so we loop back to the beginning.
But I agree with your other points, I'll just drop this one and
find/create a batched per-cpu list to use instead and do that separately
so we can still get the meat of this patchset in. Thanks,
Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists