[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <yakewaqynmapatlh3s45huq6dutkkkcdj26tqpfx6yllsjmyie@rh6xthl5pwkb>
Date: Sat, 20 Jul 2024 12:42:15 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Zhihao Cheng <chengzhihao@...weicloud.com>
Cc: Jan Kara <jack@...e.cz>, Theodore Ts'o <tytso@....edu>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>, linux-ext4@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
Christoph Hellwig <hch@...radead.org>, linux-mtd <linux-mtd@...ts.infradead.org>,
Richard Weinberger <richard@....at>, "zhangyi (F)" <yi.zhang@...wei.com>,
yangerkun <yangerkun@...wei.com>, "wangzhaolong (A)" <wangzhaolong1@...wei.com>
Subject: Re: [BUG REPORT] potential deadlock in inode evicting under the
inode lru traversing context on ext4 and ubifs
On Fri, Jul 19, 2024 at 11:21:51AM +0800, Zhihao Cheng wrote:
> 在 2024/7/18 21:40, Jan Kara 写道:
> > I'm pondering about the best way to fix this. Maybe we could handle the
> > need for inode pinning in inode_lru_isolate() in a similar way as in
> > writeback code so that last iput() cannot happen from inode_lru_isolate().
> > In writeback we use I_SYNC flag to pin the inode and evict() waits for this
> > flag to clear. I'll probably sleep to it and if I won't find it too
> > disgusting to live tomorrow, I can code it.
> >
>
> I guess that you may modify like this:
> diff --git a/fs/inode.c b/fs/inode.c
> index f356fe2ec2b6..5b1a9b23f53f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -457,7 +457,7 @@ EXPORT_SYMBOL(ihold);
>
> static void __inode_add_lru(struct inode *inode, bool rotate)
> {
> - if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING |
> I_WILL_FREE))
> + if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE
> | I_PINING))
> return;
> if (atomic_read(&inode->i_count))
> return;
> @@ -845,7 +845,7 @@ static enum lru_status inode_lru_isolate(struct
> list_head *item,
> * be under pressure before the cache inside the highmem zone.
> */
> if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
> - __iget(inode);
> + inode->i_state |= I_PINING;
> spin_unlock(&inode->i_lock);
> spin_unlock(lru_lock);
> if (remove_inode_buffers(inode)) {
> @@ -857,7 +857,10 @@ static enum lru_status inode_lru_isolate(struct
> list_head *item,
> __count_vm_events(PGINODESTEAL, reap);
> mm_account_reclaimed_pages(reap);
> }
> - iput(inode);
> + spin_lock(&inode->i_lock);
> + inode->i_state &= ~I_PINING;
> + wake_up_bit(&inode->i_state, __I_PINING);
> + spin_unlock(&inode->i_lock);
> spin_lock(lru_lock);
> return LRU_RETRY;
> }
> @@ -1772,6 +1775,7 @@ static void iput_final(struct inode *inode)
> return;
> }
>
> + inode_wait_for_pining(inode);
> state = inode->i_state;
> if (!drop) {
> WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd34b5755c0b..daf094fff5fe 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2415,6 +2415,8 @@ static inline void kiocb_clone(struct kiocb *kiocb,
> struct kiocb *kiocb_src,
> #define I_DONTCACHE (1 << 16)
> #define I_SYNC_QUEUED (1 << 17)
> #define I_PINNING_NETFS_WB (1 << 18)
> +#define __I_PINING 19
> +#define I_PINING (1 << __I_PINING)
>
> #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
>
> , which means that we will import a new inode state to solve the problem.
>
My non-maintainer $0,03 is as follows:
1. I_PINING is too generic of a name. I_LRU_PINNED or something else
indicating what this is for would be prudent
2. while not specific to this patch, the handling of i_state is too
accidental-breakage friendly. a full blown solution is way out of the
scope here, but something can be done to future-proof this work anyway.
To that end I would suggest:
1. inode_lru_pin() which appart from setting the flag includes:
BUG_ON(inode->i_state & (I_LRU_PINNED | I_FREEING | I_WILL_FREE)
2. inode_lru_unpin() which apart from unsetting the flag + wakeup includes:
BUG_ON(!(inode->i_state & I_LRU_PINNED))
3. inode_lru_wait_for_pinned()
However, a non-cosmetic remark is that at the spot inode_wait_for_pining
gets invoked none of the of the pinning-blocking flags may be set (to my
reading anyway). This is not the end of the world, but it does mean the
waiting routine will have to check stuff in a loop.
Names are not that important, the key is to keep the logic and
dependencies close by code-wise.
Powered by blists - more mailing lists