linux-kernel - Re: [PATCH] vfs: Don't evict inode under the inode lru traversing context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240805153018.3sju3nowiqggykvf@quack3>
Date: Mon, 5 Aug 2024 17:30:18 +0200
From: Jan Kara <jack@...e.cz>
To: Zhihao Cheng <chengzhihao@...weicloud.com>
Cc: viro@...iv.linux.org.uk, brauner@...nel.org, tahsin@...gle.com,
	mjguzik@...il.com, error27@...il.com, tytso@....edu,
	rydercoding@...mail.com, jack@...e.cz, hch@...radead.org,
	andreas.dilger@...el.com, richard@....at,
	linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
	linux-mtd@...ts.infradead.org, linux-kernel@...r.kernel.org,
	chengzhihao1@...wei.com, yi.zhang@...wei.com, yangerkun@...wei.com,
	wangzhaolong1@...wei.com
Subject: Re: [PATCH] vfs: Don't evict inode under the inode lru traversing
 context

On Mon 05-08-24 09:34:46, Zhihao Cheng wrote:
> From: Zhihao Cheng <chengzhihao1@...wei.com>
> 
> The inode reclaiming process(See function prune_icache_sb) collects all
> reclaimable inodes and mark them with I_FREEING flag at first, at that
> time, other processes will be stuck if they try getting these inodes
> (See function find_inode_fast), then the reclaiming process destroy the
> inodes by function dispose_list(). Some filesystems(eg. ext4 with
> ea_inode feature, ubifs with xattr) may do inode lookup in the inode
> evicting callback function, if the inode lookup is operated under the
> inode lru traversing context, deadlock problems may happen.
> 
> Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
>         if ea_inode feature is enabled, the lookup process will be stuck
> 	under the evicting context like this:
> 
>  1. File A has inode i_reg and an ea inode i_ea
>  2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
>  3. Then, following three processes running like this:
> 
>     PA                              PB
>  echo 2 > /proc/sys/vm/drop_caches
>   shrink_slab
>    prune_dcache_sb
>    // i_reg is added into lru, lru->i_ea->i_reg
>    prune_icache_sb
>     list_lru_walk_one
>      inode_lru_isolate
>       i_ea->i_state |= I_FREEING // set inode state
>      inode_lru_isolate
>       __iget(i_reg)
>       spin_unlock(&i_reg->i_lock)
>       spin_unlock(lru_lock)
>                                      rm file A
>                                       i_reg->nlink = 0
>       iput(i_reg) // i_reg->nlink is 0, do evict
>        ext4_evict_inode
>         ext4_xattr_delete_inode
>          ext4_xattr_inode_dec_ref_all
>           ext4_xattr_inode_iget
>            ext4_iget(i_ea->i_ino)
>             iget_locked
>              find_inode_fast
>               __wait_on_freeing_inode(i_ea) ----→ AA deadlock
>     dispose_list // cannot be executed by prune_icache_sb
>      wake_up_bit(&i_ea->i_state)
> 
> Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file
>         deleting process holds BASEHD's wbuf->io_mutex while getting the
> 	xattr inode, which could race with inode reclaiming process(The
>         reclaiming process could try locking BASEHD's wbuf->io_mutex in
> 	inode evicting function), then an ABBA deadlock problem would
> 	happen as following:
> 
>  1. File A has inode ia and a xattr(with inode ixa), regular file B has
>     inode ib and a xattr.
>  2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
>  3. Then, following three processes running like this:
> 
>         PA                PB                        PC
>                 echo 2 > /proc/sys/vm/drop_caches
>                  shrink_slab
>                   prune_dcache_sb
>                   // ib and ia are added into lru, lru->ixa->ib->ia
>                   prune_icache_sb
>                    list_lru_walk_one
>                     inode_lru_isolate
>                      ixa->i_state |= I_FREEING // set inode state
>                     inode_lru_isolate
>                      __iget(ib)
>                      spin_unlock(&ib->i_lock)
>                      spin_unlock(lru_lock)
>                                                    rm file B
>                                                     ib->nlink = 0
>  rm file A
>   iput(ia)
>    ubifs_evict_inode(ia)
>     ubifs_jnl_delete_inode(ia)
>      ubifs_jnl_write_inode(ia)
>       make_reservation(BASEHD) // Lock wbuf->io_mutex
>       ubifs_iget(ixa->i_ino)
>        iget_locked
>         find_inode_fast
>          __wait_on_freeing_inode(ixa)
>           |          iput(ib) // ib->nlink is 0, do evict
>           |           ubifs_evict_inode
>           |            ubifs_jnl_delete_inode(ib)
>           ↓             ubifs_jnl_write_inode
>      ABBA deadlock ←-----make_reservation(BASEHD)
>                    dispose_list // cannot be executed by prune_icache_sb
>                     wake_up_bit(&ixa->i_state)
> 
> Fix it by forbidding inode evicting under the inode lru traversing
> context. In details, we import a new inode state flag 'I_LRU_ISOLATING'
> to pin inode without holding i_count under the inode lru traversing
> context, the inode evicting process will wait until this flag is
> cleared from i_state.

Thanks for the patch and sorry for not getting to this myself!  Let me
rephrase the above paragraph a bit for better readability:

Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING to
pin the inode in memory while inode_lru_isolate() reclaims its pages
instead of using ordinary inode reference. This way inode deletion cannot
be triggered from inode_lru_isolate() thus avoiding the deadlock. evict()
is made to wait for I_LRU_ISOLATING to be cleared before proceeding with
inode cleanup.

> @@ -488,6 +488,36 @@ static void inode_lru_list_del(struct inode *inode)
>  		this_cpu_dec(nr_unused);
>  }
>  
> +static void inode_lru_isolating(struct inode *inode)

Perhaps call this inode_pin_lru_isolating()

> +{
> +	BUG_ON(inode->i_state & (I_LRU_ISOLATING | I_FREEING | I_WILL_FREE));
> +	inode->i_state |= I_LRU_ISOLATING;
> +}
> +
> +static void inode_lru_finish_isolating(struct inode *inode)

And call this inode_unpin_lru_isolating()?

Otherwise the patch looks good so feel free to add:

Reviewed-by: Jan Kara <jack@...e.cz>

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR