linux-kernel - Re: [PATCH] vfs: Don't evict inode under the inode lru traversing context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f942ec06-974b-b9dd-7238-63636e82f3fb@huawei.com>
Date: Tue, 6 Aug 2024 11:31:52 +0800
From: Zhihao Cheng <chengzhihao1@...wei.com>
To: Jan Kara <jack@...e.cz>, Zhihao Cheng <chengzhihao@...weicloud.com>
CC: <viro@...iv.linux.org.uk>, <brauner@...nel.org>, <tahsin@...gle.com>,
	<mjguzik@...il.com>, <error27@...il.com>, <tytso@....edu>,
	<rydercoding@...mail.com>, <hch@...radead.org>, <andreas.dilger@...el.com>,
	<richard@....at>, <linux-fsdevel@...r.kernel.org>,
	<linux-ext4@...r.kernel.org>, <linux-mtd@...ts.infradead.org>,
	<linux-kernel@...r.kernel.org>, <yi.zhang@...wei.com>,
	<yangerkun@...wei.com>, <wangzhaolong1@...wei.com>
Subject: Re: [PATCH] vfs: Don't evict inode under the inode lru traversing
 context

在 2024/8/5 23:30, Jan Kara 写道:
> On Mon 05-08-24 09:34:46, Zhihao Cheng wrote:
>> From: Zhihao Cheng <chengzhihao1@...wei.com>
>>
>> The inode reclaiming process(See function prune_icache_sb) collects all
>> reclaimable inodes and mark them with I_FREEING flag at first, at that
>> time, other processes will be stuck if they try getting these inodes
>> (See function find_inode_fast), then the reclaiming process destroy the
>> inodes by function dispose_list(). Some filesystems(eg. ext4 with
>> ea_inode feature, ubifs with xattr) may do inode lookup in the inode
>> evicting callback function, if the inode lookup is operated under the
>> inode lru traversing context, deadlock problems may happen.
>>
>> Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
>>          if ea_inode feature is enabled, the lookup process will be stuck
>> 	under the evicting context like this:
>>
>>   1. File A has inode i_reg and an ea inode i_ea
>>   2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
>>   3. Then, following three processes running like this:
>>
>>      PA                              PB
>>   echo 2 > /proc/sys/vm/drop_caches
>>    shrink_slab
>>     prune_dcache_sb
>>     // i_reg is added into lru, lru->i_ea->i_reg
>>     prune_icache_sb
>>      list_lru_walk_one
>>       inode_lru_isolate
>>        i_ea->i_state |= I_FREEING // set inode state
>>       inode_lru_isolate
>>        __iget(i_reg)
>>        spin_unlock(&i_reg->i_lock)
>>        spin_unlock(lru_lock)
>>                                       rm file A
>>                                        i_reg->nlink = 0
>>        iput(i_reg) // i_reg->nlink is 0, do evict
>>         ext4_evict_inode
>>          ext4_xattr_delete_inode
>>           ext4_xattr_inode_dec_ref_all
>>            ext4_xattr_inode_iget
>>             ext4_iget(i_ea->i_ino)
>>              iget_locked
>>               find_inode_fast
>>                __wait_on_freeing_inode(i_ea) ----→ AA deadlock
>>      dispose_list // cannot be executed by prune_icache_sb
>>       wake_up_bit(&i_ea->i_state)
>>
>> Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file
>>          deleting process holds BASEHD's wbuf->io_mutex while getting the
>> 	xattr inode, which could race with inode reclaiming process(The
>>          reclaiming process could try locking BASEHD's wbuf->io_mutex in
>> 	inode evicting function), then an ABBA deadlock problem would
>> 	happen as following:
>>
>>   1. File A has inode ia and a xattr(with inode ixa), regular file B has
>>      inode ib and a xattr.
>>   2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
>>   3. Then, following three processes running like this:
>>
>>          PA                PB                        PC
>>                  echo 2 > /proc/sys/vm/drop_caches
>>                   shrink_slab
>>                    prune_dcache_sb
>>                    // ib and ia are added into lru, lru->ixa->ib->ia
>>                    prune_icache_sb
>>                     list_lru_walk_one
>>                      inode_lru_isolate
>>                       ixa->i_state |= I_FREEING // set inode state
>>                      inode_lru_isolate
>>                       __iget(ib)
>>                       spin_unlock(&ib->i_lock)
>>                       spin_unlock(lru_lock)
>>                                                     rm file B
>>                                                      ib->nlink = 0
>>   rm file A
>>    iput(ia)
>>     ubifs_evict_inode(ia)
>>      ubifs_jnl_delete_inode(ia)
>>       ubifs_jnl_write_inode(ia)
>>        make_reservation(BASEHD) // Lock wbuf->io_mutex
>>        ubifs_iget(ixa->i_ino)
>>         iget_locked
>>          find_inode_fast
>>           __wait_on_freeing_inode(ixa)
>>            |          iput(ib) // ib->nlink is 0, do evict
>>            |           ubifs_evict_inode
>>            |            ubifs_jnl_delete_inode(ib)
>>            ↓             ubifs_jnl_write_inode
>>       ABBA deadlock ←-----make_reservation(BASEHD)
>>                     dispose_list // cannot be executed by prune_icache_sb
>>                      wake_up_bit(&ixa->i_state)
>>
>> Fix it by forbidding inode evicting under the inode lru traversing
>> context. In details, we import a new inode state flag 'I_LRU_ISOLATING'
>> to pin inode without holding i_count under the inode lru traversing
>> context, the inode evicting process will wait until this flag is
>> cleared from i_state.
> 
> Thanks for the patch and sorry for not getting to this myself!  Let me
> rephrase the above paragraph a bit for better readability:
> 
> Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING to
> pin the inode in memory while inode_lru_isolate() reclaims its pages
> instead of using ordinary inode reference. This way inode deletion cannot
> be triggered from inode_lru_isolate() thus avoiding the deadlock. evict()
> is made to wait for I_LRU_ISOLATING to be cleared before proceeding with
> inode cleanup.
> 

Looks clearer, thanks for the rephrasing, you're really nice.
>> @@ -488,6 +488,36 @@ static void inode_lru_list_del(struct inode *inode)
>>   		this_cpu_dec(nr_unused);
>>   }
>>   
>> +static void inode_lru_isolating(struct inode *inode)
> 
> Perhaps call this inode_pin_lru_isolating()

Adopt. Will change in v2.
> 
>> +{
>> +	BUG_ON(inode->i_state & (I_LRU_ISOLATING | I_FREEING | I_WILL_FREE));
>> +	inode->i_state |= I_LRU_ISOLATING;
>> +}
>> +
>> +static void inode_lru_finish_isolating(struct inode *inode)
> 
> And call this inode_unpin_lru_isolating()?

Adopt. Will change in v2.
> 
> Otherwise the patch looks good so feel free to add:
> 
> Reviewed-by: Jan Kara <jack@...e.cz>
> 
> 								Honza
>