[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f942ec06-974b-b9dd-7238-63636e82f3fb@huawei.com>
Date: Tue, 6 Aug 2024 11:31:52 +0800
From: Zhihao Cheng <chengzhihao1@...wei.com>
To: Jan Kara <jack@...e.cz>, Zhihao Cheng <chengzhihao@...weicloud.com>
CC: <viro@...iv.linux.org.uk>, <brauner@...nel.org>, <tahsin@...gle.com>,
<mjguzik@...il.com>, <error27@...il.com>, <tytso@....edu>,
<rydercoding@...mail.com>, <hch@...radead.org>, <andreas.dilger@...el.com>,
<richard@....at>, <linux-fsdevel@...r.kernel.org>,
<linux-ext4@...r.kernel.org>, <linux-mtd@...ts.infradead.org>,
<linux-kernel@...r.kernel.org>, <yi.zhang@...wei.com>,
<yangerkun@...wei.com>, <wangzhaolong1@...wei.com>
Subject: Re: [PATCH] vfs: Don't evict inode under the inode lru traversing
context
在 2024/8/5 23:30, Jan Kara 写道:
> On Mon 05-08-24 09:34:46, Zhihao Cheng wrote:
>> From: Zhihao Cheng <chengzhihao1@...wei.com>
>>
>> The inode reclaiming process(See function prune_icache_sb) collects all
>> reclaimable inodes and mark them with I_FREEING flag at first, at that
>> time, other processes will be stuck if they try getting these inodes
>> (See function find_inode_fast), then the reclaiming process destroy the
>> inodes by function dispose_list(). Some filesystems(eg. ext4 with
>> ea_inode feature, ubifs with xattr) may do inode lookup in the inode
>> evicting callback function, if the inode lookup is operated under the
>> inode lru traversing context, deadlock problems may happen.
>>
>> Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
>> if ea_inode feature is enabled, the lookup process will be stuck
>> under the evicting context like this:
>>
>> 1. File A has inode i_reg and an ea inode i_ea
>> 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
>> 3. Then, following three processes running like this:
>>
>> PA PB
>> echo 2 > /proc/sys/vm/drop_caches
>> shrink_slab
>> prune_dcache_sb
>> // i_reg is added into lru, lru->i_ea->i_reg
>> prune_icache_sb
>> list_lru_walk_one
>> inode_lru_isolate
>> i_ea->i_state |= I_FREEING // set inode state
>> inode_lru_isolate
>> __iget(i_reg)
>> spin_unlock(&i_reg->i_lock)
>> spin_unlock(lru_lock)
>> rm file A
>> i_reg->nlink = 0
>> iput(i_reg) // i_reg->nlink is 0, do evict
>> ext4_evict_inode
>> ext4_xattr_delete_inode
>> ext4_xattr_inode_dec_ref_all
>> ext4_xattr_inode_iget
>> ext4_iget(i_ea->i_ino)
>> iget_locked
>> find_inode_fast
>> __wait_on_freeing_inode(i_ea) ----→ AA deadlock
>> dispose_list // cannot be executed by prune_icache_sb
>> wake_up_bit(&i_ea->i_state)
>>
>> Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file
>> deleting process holds BASEHD's wbuf->io_mutex while getting the
>> xattr inode, which could race with inode reclaiming process(The
>> reclaiming process could try locking BASEHD's wbuf->io_mutex in
>> inode evicting function), then an ABBA deadlock problem would
>> happen as following:
>>
>> 1. File A has inode ia and a xattr(with inode ixa), regular file B has
>> inode ib and a xattr.
>> 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
>> 3. Then, following three processes running like this:
>>
>> PA PB PC
>> echo 2 > /proc/sys/vm/drop_caches
>> shrink_slab
>> prune_dcache_sb
>> // ib and ia are added into lru, lru->ixa->ib->ia
>> prune_icache_sb
>> list_lru_walk_one
>> inode_lru_isolate
>> ixa->i_state |= I_FREEING // set inode state
>> inode_lru_isolate
>> __iget(ib)
>> spin_unlock(&ib->i_lock)
>> spin_unlock(lru_lock)
>> rm file B
>> ib->nlink = 0
>> rm file A
>> iput(ia)
>> ubifs_evict_inode(ia)
>> ubifs_jnl_delete_inode(ia)
>> ubifs_jnl_write_inode(ia)
>> make_reservation(BASEHD) // Lock wbuf->io_mutex
>> ubifs_iget(ixa->i_ino)
>> iget_locked
>> find_inode_fast
>> __wait_on_freeing_inode(ixa)
>> | iput(ib) // ib->nlink is 0, do evict
>> | ubifs_evict_inode
>> | ubifs_jnl_delete_inode(ib)
>> ↓ ubifs_jnl_write_inode
>> ABBA deadlock ←-----make_reservation(BASEHD)
>> dispose_list // cannot be executed by prune_icache_sb
>> wake_up_bit(&ixa->i_state)
>>
>> Fix it by forbidding inode evicting under the inode lru traversing
>> context. In details, we import a new inode state flag 'I_LRU_ISOLATING'
>> to pin inode without holding i_count under the inode lru traversing
>> context, the inode evicting process will wait until this flag is
>> cleared from i_state.
>
> Thanks for the patch and sorry for not getting to this myself! Let me
> rephrase the above paragraph a bit for better readability:
>
> Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING to
> pin the inode in memory while inode_lru_isolate() reclaims its pages
> instead of using ordinary inode reference. This way inode deletion cannot
> be triggered from inode_lru_isolate() thus avoiding the deadlock. evict()
> is made to wait for I_LRU_ISOLATING to be cleared before proceeding with
> inode cleanup.
>
Looks clearer, thanks for the rephrasing, you're really nice.
>> @@ -488,6 +488,36 @@ static void inode_lru_list_del(struct inode *inode)
>> this_cpu_dec(nr_unused);
>> }
>>
>> +static void inode_lru_isolating(struct inode *inode)
>
> Perhaps call this inode_pin_lru_isolating()
Adopt. Will change in v2.
>
>> +{
>> + BUG_ON(inode->i_state & (I_LRU_ISOLATING | I_FREEING | I_WILL_FREE));
>> + inode->i_state |= I_LRU_ISOLATING;
>> +}
>> +
>> +static void inode_lru_finish_isolating(struct inode *inode)
>
> And call this inode_unpin_lru_isolating()?
Adopt. Will change in v2.
>
> Otherwise the patch looks good so feel free to add:
>
> Reviewed-by: Jan Kara <jack@...e.cz>
>
> Honza
>
Powered by blists - more mailing lists