linux-ext4 - Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6329ee04c9272_2a6ded294bf@dwillia2-xfh.jf.intel.com.notmuch>
Date:   Tue, 20 Sep 2022 09:44:52 -0700
From:   Dan Williams <dan.j.williams@...el.com>
To:     Dave Chinner <david@...morbit.com>,
        Dan Williams <dan.j.williams@...el.com>
CC:     <akpm@...ux-foundation.org>, Matthew Wilcox <willy@...radead.org>,
        "Jan Kara" <jack@...e.cz>, "Darrick J. Wong" <djwong@...nel.org>,
        Jason Gunthorpe <jgg@...dia.com>,
        Christoph Hellwig <hch@....de>,
        John Hubbard <jhubbard@...dia.com>,
        <linux-fsdevel@...r.kernel.org>, <nvdimm@...ts.linux.dev>,
        <linux-xfs@...r.kernel.org>, <linux-mm@...ck.org>,
        <linux-ext4@...r.kernel.org>
Subject: Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode
 eviction path

Dave Chinner wrote:
> On Mon, Sep 19, 2022 at 09:11:48AM -0700, Dan Williams wrote:
> > Dave Chinner wrote:
> > > On Thu, Sep 15, 2022 at 08:35:38PM -0700, Dan Williams wrote:
> > > > In preparation for moving DAX pages to be 0-based rather than 1-based
> > > > for the idle refcount, the fsdax core wants to have all mappings in a
> > > > "zapped" state before truncate. For typical pages this happens naturally
> > > > via unmap_mapping_range(), for DAX pages some help is needed to record
> > > > this state in the 'struct address_space' of the inode(s) where the page
> > > > is mapped.
> > > > 
> > > > That "zapped" state is recorded in DAX entries as a side effect of
> > > > xfs_break_layouts(). Arrange for it to be called before all truncation
> > > > events which already happens for truncate() and PUNCH_HOLE, but not
> > > > truncate_inode_pages_final(). Arrange for xfs_break_layouts() before
> > > > truncate_inode_pages_final().
> ....
> > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > index 9ac59814bbb6..ebb4a6eba3fc 100644
> > > > --- a/fs/xfs/xfs_super.c
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -725,6 +725,27 @@ xfs_fs_drop_inode(
> > > >  	return generic_drop_inode(inode);
> > > >  }
> > > >  
> > > > +STATIC void
> > > > +xfs_fs_evict_inode(
> > > > +	struct inode		*inode)
> > > > +{
> > > > +	struct xfs_inode	*ip = XFS_I(inode);
> > > > +	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > > > +	long			error;
> > > > +
> > > > +	xfs_ilock(ip, iolock);
> > > 
> > > I'm guessing you never ran this through lockdep.
> > 
> > I always run with lockdep enabled in my development kernels, but maybe my
> > testing was insufficient? Somewhat moot with your concerns below...
> 
> I'm guessing your testing doesn't generate inode cache pressure and
> then have direct memory reclaim inodes. e.g. on a directory inode
> this will trigger lockdep immediately because readdir locks with
> XFS_IOLOCK_SHARED and then does GFP_KERNEL memory reclaim. If we try
> to take XFS_IOLOCK_EXCL from memory reclaim of directory inodes,
> lockdep will then shout from the rooftops...

Got it.

> 
> > > > +
> > > > +	truncate_inode_pages_final(&inode->i_data);
> > > > +	clear_inode(inode);
> > > > +
> > > > +	xfs_iunlock(ip, iolock);
> > > > +}
> > > 
> > > That all said, this really looks like a bit of a band-aid.
> > 
> > It definitely is since DAX is in this transitory state between doing
> > some activities page-less and others with page metadata. If DAX was
> > fully committed to behaving like a typical page then
> > unmap_mapping_range() would have already satisfied this reference
> > counting situation.
> > 
> > > I can't work out why would we we ever have an actual layout lease
> > > here that needs breaking given they are file based and active files
> > > hold a reference to the inode. If we ever break that, then I suspect
> > > this change will cause major problems for anyone using pNFS with XFS
> > > as xfs_break_layouts() can end up waiting for NFS delegation
> > > revocation. This is something we should never be doing in inode
> > > eviction/memory reclaim.
> > > 
> > > Hence I have to ask why this lease break is being done
> > > unconditionally for all inodes, instead of only calling
> > > xfs_break_dax_layouts() directly on DAX enabled regular files?  I
> > > also wonder what exciting new system deadlocks this will create
> > > because BREAK_UNMAP_FINAL can essentially block forever waiting on
> > > dax mappings going away. If that DAX mapping reclaim requires memory
> > > allocations.....
> > 
> > There should be no memory allocations in the DAX mapping reclaim path.
> > Also, the page pins it waits for are precluded from being GUP_LONGTERM.
> 
> So if the task that holds the pin needs memory allocation before it
> can unpin the page to allow direct inode reclaim to make progress?

No, it couldn't, and I realize now that GUP_LONGTERM has nothing to do
with this hang since any GFP_KERNEL in a path that took a DAX page pin
path could run afoul of this need to wait.

So, this has me looking at invalidate_inodes() and iput_final(), where I
did not see the reclaim entanglement, and thinking DAX has the unique
requirement to make sure that no access to a page outlives the hosting
inode.

Not that I need to tell you, but to get my own thinking straight,
compare that to typical page cache as the pinner can keep a pinned
page-cache page as long as it wants even after it has been truncated.
DAX needs to make sure that truncate_inode_pages() ceases all access to
the page synchronous with the truncate. The typical page-cache will
ensure that the next mapping of the file will get a new page if the page
previously pinned for that offset is still in use, DAX can not offer
that as the same page that was previously pinned is always used.

So I think this means something like this:

diff --git a/fs/inode.c b/fs/inode.c
index 6462276dfdf0..ab16772b9a8d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -784,6 +784,11 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
                        continue;
                }
 
+               if (dax_inode_busy(inode)) {
+                       busy = 1;
+                       continue;
+               }
+
                inode->i_state |= I_FREEING;
                inode_lru_list_del(inode);
                spin_unlock(&inode->i_lock);
@@ -1733,6 +1738,8 @@ static void iput_final(struct inode *inode)
                spin_unlock(&inode->i_lock);
 
                write_inode_now(inode, 1);
+               if (IS_DAX(inode))
+                       dax_break_layouts(inode);
 
                spin_lock(&inode->i_lock);
                state = inode->i_state;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9eced4cc286e..e4a74ab310b5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3028,8 +3028,20 @@ extern struct inode * igrab(struct inode *);
 extern ino_t iunique(struct super_block *, ino_t);
 extern int inode_needs_sync(struct inode *inode);
 extern int generic_delete_inode(struct inode *inode);
+
+static inline bool dax_inode_busy(struct inode *inode)
+{
+       if (!IS_DAX(inode))
+               return false;
+
+       return dax_zap_pages(inode) != NULL;
+}
+
 static inline int generic_drop_inode(struct inode *inode)
 {
+       if (dax_inode_busy(inode))
+               return 0;
+
        return !inode->i_nlink || inode_unhashed(inode);
 }
 extern void d_mark_dontcache(struct inode *inode);

...where generic code skips over dax-inodes with pinned pages.

> 
> > > /me looks deeper into the dax_layout_busy_page() stuff and realises
> > > that both ext4 and XFS implementations of ext4_break_layouts() and
> > > xfs_break_dax_layouts() are actually identical.
> > > 
> > > That is, filemap_invalidate_unlock() and xfs_iunlock(ip,
> > > XFS_MMAPLOCK_EXCL) operate on exactly the same
> > > inode->i_mapping->invalidate_lock. Hence the implementations in ext4
> > > and XFS are both functionally identical.
> > 
> > I assume you mean for the purposes of this "final" break since
> > xfs_file_allocate() holds XFS_IOLOCK_EXCL over xfs_break_layouts().
> 
> No, I'm just looking at the two *dax* functions - we don't care what
> locks xfs_break_layouts() requires - dax mapping manipulation is
> covered by the mapping->invalidate_lock and not the inode->i_rwsem.
> This is explicitly documented in the code by the the asserts in both
> ext4_break_layouts() and xfs_break_dax_layouts().
> 
> XFS holds the inode->i_rwsem over xfs_break_layouts() because we
> have to break *file layout leases* from there, too. These are
> serialised by the inode->i_rwsem, not the mapping->invalidate_lock.

Got it, will make generic helpers for the scenario where only
dax-break-layouts needs to be performed.