lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3a6d4838-7f3b-8790-81dc-ec512930dabf@huawei.com>
Date: Tue, 6 Aug 2024 11:48:47 +0800
From: Zhihao Cheng <chengzhihao1@...wei.com>
To: Mateusz Guzik <mjguzik@...il.com>, Zhihao Cheng
	<chengzhihao@...weicloud.com>
CC: <viro@...iv.linux.org.uk>, <brauner@...nel.org>, <tahsin@...gle.com>,
	<error27@...il.com>, <tytso@....edu>, <rydercoding@...mail.com>,
	<jack@...e.cz>, <hch@...radead.org>, <andreas.dilger@...el.com>,
	<richard@....at>, <linux-fsdevel@...r.kernel.org>,
	<linux-ext4@...r.kernel.org>, <linux-mtd@...ts.infradead.org>,
	<linux-kernel@...r.kernel.org>, <yi.zhang@...wei.com>,
	<yangerkun@...wei.com>, <wangzhaolong1@...wei.com>
Subject: Re: [PATCH] vfs: Don't evict inode under the inode lru traversing
 context

在 2024/8/6 0:10, Mateusz Guzik 写道:
> On Mon, Aug 5, 2024 at 3:24 AM Zhihao Cheng <chengzhihao@...weicloud.com> wrote:
>>
>> From: Zhihao Cheng <chengzhihao1@...wei.com>
>>
>> The inode reclaiming process(See function prune_icache_sb) collects all
>> reclaimable inodes and mark them with I_FREEING flag at first, at that
>> time, other processes will be stuck if they try getting these inodes
>> (See function find_inode_fast), then the reclaiming process destroy the
>> inodes by function dispose_list(). Some filesystems(eg. ext4 with
>> ea_inode feature, ubifs with xattr) may do inode lookup in the inode
>> evicting callback function, if the inode lookup is operated under the
>> inode lru traversing context, deadlock problems may happen.
>>
[...]
> 
> I only have some tidy-ups with stuff I neglected to mention when
> typing up my proposal.
> 
>> ---
>>   fs/inode.c         | 37 +++++++++++++++++++++++++++++++++++--
>>   include/linux/fs.h |  5 +++++
>>   2 files changed, 40 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index 86670941884b..f1c6e8072f39 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -488,6 +488,36 @@ static void inode_lru_list_del(struct inode *inode)
>>                  this_cpu_dec(nr_unused);
>>   }
>>
>> +static void inode_lru_isolating(struct inode *inode)
>> +{
> 
>          lockdep_assert_held(&inode->i_lock);

Adopt, will change in v2.
> 
>> +       BUG_ON(inode->i_state & (I_LRU_ISOLATING | I_FREEING | I_WILL_FREE));
>> +       inode->i_state |= I_LRU_ISOLATING;
>> +}
>> +
>> +static void inode_lru_finish_isolating(struct inode *inode)
>> +{
>> +       spin_lock(&inode->i_lock);
>> +       BUG_ON(!(inode->i_state & I_LRU_ISOLATING));
>> +       inode->i_state &= ~I_LRU_ISOLATING;
>> +       wake_up_bit(&inode->i_state, __I_LRU_ISOLATING);
>> +       spin_unlock(&inode->i_lock);
>> +}
>> +
>> +static void inode_wait_for_lru_isolating(struct inode *inode)
>> +{
>> +       DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LRU_ISOLATING);
>> +       wait_queue_head_t *wqh;
>> +
> 
> Top of evict() asserts on I_FREEING being set, used to decide it's not
> legit to pin above. This dependency can be documented it in the
> routine as well:
> 
> BUG_ON(!(inode->i_state & I_FREEING));

Sorry, I don't understand it, do you mean add the 
'BUG_ON(!(inode->i_state & I_FREEING))' in 
inode_wait_for_lru_isolating()? Just like you talked, evict() already 
has one. So far, the inode_wait_for_lru_isolating() is called only in 
evict(), we can add the BUG_ON if it is called more than one places in 
future.>
>> +       spin_lock(&inode->i_lock);
> 
> This lock acquire is avoidable, which is always nice to do for
> single-threaded perf. Probably can be also done for the writeback code
> below. Maybe I'll massage it myself after the patch lands.

Yes, maybe inode_wait_for_writeback() and inode_wait_for_lru_isolating() 
can be lockless, and a smp_rmb() is needed, I think.
> 
>> +       wqh = bit_waitqueue(&inode->i_state, __I_LRU_ISOLATING);
>> +       while (inode->i_state & I_LRU_ISOLATING) {
>> +               spin_unlock(&inode->i_lock);
>> +               __wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
>> +               spin_lock(&inode->i_lock);
>> +       }
>> +       spin_unlock(&inode->i_lock);
>> +}
> 
> So new arrivals *are* blocked by this point thanks to I_FREEING being
> set on entry to evict(). This also means the flag can show up at most
> once.
> 
> Thus instead of looping this should merely go to sleep once and assert
> the I_LRU_ISOLATING flag is no longer set after waking up.

Good improvement, will change in v2.
> 
>> +
>>   /**
>>    * inode_sb_list_add - add inode to the superblock list of inodes
>>    * @inode: inode to add
>> @@ -657,6 +687,9 @@ static void evict(struct inode *inode)
>>
>>          inode_sb_list_del(inode);
>>
>> +       /* Wait for LRU isolating to finish. */
> 
> I don't think this comment adds anything given the name of the func.

Adopt, will be deleted in v2.
> 
>> +       inode_wait_for_lru_isolating(inode);
>> +
>>          /*
>>           * Wait for flusher thread to be done with the inode so that filesystem
>>           * does not start destroying it while writeback is still running. Since
>> @@ -855,7 +888,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>>           * be under pressure before the cache inside the highmem zone.
>>           */
>>          if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
>> -               __iget(inode);
>> +               inode_lru_isolating(inode);
>>                  spin_unlock(&inode->i_lock);
>>                  spin_unlock(lru_lock);
>>                  if (remove_inode_buffers(inode)) {
>> @@ -867,7 +900,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>>                                  __count_vm_events(PGINODESTEAL, reap);
>>                          mm_account_reclaimed_pages(reap);
>>                  }
>> -               iput(inode);
>> +               inode_lru_finish_isolating(inode);
>>                  spin_lock(lru_lock);
>>                  return LRU_RETRY;
>>          }
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index fd34b5755c0b..fb0426f349fc 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2392,6 +2392,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
>>    *
>>    * I_PINNING_FSCACHE_WB        Inode is pinning an fscache object for writeback.
>>    *
>> + * I_LRU_ISOLATING     Inode is pinned being isolated from LRU without holding
>> + *                     i_count.
>> + *
>>    * Q: What is the difference between I_WILL_FREE and I_FREEING?
>>    */
>>   #define I_DIRTY_SYNC           (1 << 0)
>> @@ -2415,6 +2418,8 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
>>   #define I_DONTCACHE            (1 << 16)
>>   #define I_SYNC_QUEUED          (1 << 17)
>>   #define I_PINNING_NETFS_WB     (1 << 18)
>> +#define __I_LRU_ISOLATING      19
>> +#define I_LRU_ISOLATING                (1 << __I_LRU_ISOLATING)
>>
>>   #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
>>   #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
>> --
>> 2.39.2
>>
> 
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ