linux-kernel - Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171030020023.GG3666@dastard>
Date:   Mon, 30 Oct 2017 13:00:23 +1100
From:   Dave Chinner <david@...morbit.com>
To:     Dan Williams <dan.j.williams@...il.com>
Cc:     Jan Kara <jack@...e.cz>, Christoph Hellwig <hch@....de>,
        Michal Hocko <mhocko@...e.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Heiko Carstens <heiko.carstens@...ibm.com>,
        "J. Bruce Fields" <bfields@...ldses.org>,
        linux-mm <linux-mm@...ck.org>, Paul Mackerras <paulus@...ba.org>,
        Sean Hefty <sean.hefty@...el.com>,
        Jeff Layton <jlayton@...chiereds.net>,
        Matthew Wilcox <mawilcox@...rosoft.com>,
        linux-rdma@...r.kernel.org, Michael Ellerman <mpe@...erman.id.au>,
        Jason Gunthorpe <jgunthorpe@...idianresearch.com>,
        Doug Ledford <dledford@...hat.com>,
        Hal Rosenstock <hal.rosenstock@...il.com>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Gerald Schaefer <gerald.schaefer@...ibm.com>,
        "linux-nvdimm@...ts.01.org" <linux-nvdimm@...ts.01.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-xfs@...r.kernel.org,
        Martin Schwidefsky <schwidefsky@...ibm.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Darrick J. Wong" <darrick.wong@...cle.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Subject: Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less'
 support

On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote:
> On Thu, Oct 26, 2017 at 3:58 AM, Jan Kara <jack@...e.cz> wrote:
> > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> >> On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> >> > I'd like to brainstorm how we can do something better.
> >> >
> >> > How about:
> >> >
> >> > If we hit a page with an elevated refcount in truncate / hole puch
> >> > etc for a DAX file system we do not free the blocks in the file system,
> >> > but add it to the extent busy list.  We mark the page as delayed
> >> > free (e.g. page flag?) so that when it finally hits refcount zero we
> >> > call back into the file system to remove it from the busy list.
> >>
> >> Brainstorming some more:
> >>
> >> Given that on a DAX file there shouldn't be any long-term page
> >> references after we unmap it from the page table and don't allow
> >> get_user_pages calls why not wait for the references for all
> >> DAX pages to go away first?  E.g. if we find a DAX page in
> >> truncate_inode_pages_range that has an elevated refcount we set
> >> a new flag to prevent new references from showing up, and then
> >> simply wait for it to go away.  Instead of a busy way we can
> >> do this through a few hashed waitqueued in dev_pagemap.  And in
> >> fact put_zone_device_page already gets called when putting the
> >> last page so we can handle the wakeup from there.
> >>
> >> In fact if we can't find a page flag for the stop new callers
> >> things we could probably come up with a way to do that through
> >> dev_pagemap somehow, but I'm not sure how efficient that would
> >> be.
> >
> > We were talking about this yesterday with Dan so some more brainstorming
> > from us. We can implement the solution with extent busy list in ext4
> > relatively easily - we already have such list currently similarly to XFS.
> > There would be some modifications needed but nothing too complex. The
> > biggest downside of this solution I see is that it requires per-filesystem
> > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> > may have problems and ext2 definitely will need some modifications.
> > Invisible used blocks may be surprising to users at times although given
> > page refs should be relatively short term, that should not be a big issue.
> > But are we guaranteed page refs are short term? E.g. if someone creates
> > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> > can be rather long-term similarly as in RDMA case. Also freeing of blocks
> > on page reference drop is another async entry point into the filesystem
> > which could unpleasantly surprise us but I guess workqueues would solve
> > that reasonably fine.
> >
> > WRT waiting for page refs to be dropped before proceeding with truncate (or
> > punch hole for that matter - that case is even nastier since we don't have
> > i_size to guard us). What I like about this solution is that it is very
> > visible there's something unusual going on with the file being truncated /
> > punched and so problems are easier to diagnose / fix from the admin side.
> > So far we have guarded hole punching from concurrent faults (and
> > get_user_pages() does fault once you do unmap_mapping_range()) with
> > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> > obvious case Dan came up with is when GUP obtains ref to page A, then hole
> > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> > dropped, and then GUP blocks on trying to fault in another page.
> >
> > I think we cannot easily prevent new page references to be grabbed as you
> > write above since nobody expects stuff like get_page() to fail. But I
> > think that unmapping relevant pages and then preventing them to be faulted
> > in again is workable and stops GUP as well. The problem with that is though
> > what to do with page faults to such pages - you cannot just fail them for
> > hole punch, and you cannot easily allocate new blocks either. So we are
> > back at a situation where we need to detach blocks from the inode and then
> > wait for page refs to be dropped - so some form of busy extents. Am I
> > missing something?
> 
> Coming back to this since Dave has made clear that new locking to
> coordinate get_user_pages() is a no-go.
> 
> We can unmap to force new get_user_pages() attempts to block on the
> per-fs mmap lock, but if punch-hole finds any elevated pages it needs
> to drop the mmap lock and wait. We need this lock dropped to get
> around the problem that the driver will not start to drop page
> references until it has elevated the page references on all the pages
> in the I/O. If we need to drop the mmap lock that makes it impossible
> to coordinate this unlock/retry loop within truncate_inode_pages_range
> which would otherwise be the natural place to land this code.
> 
> Would it be palatable to unmap and drain dma in any path that needs to
> detach blocks from an inode? Something like the following that builds
> on dax_wait_dma() tried to achieve, but does not introduce a new lock
> for the fs to manage:
> 
> retry:
>     per_fs_mmap_lock(inode);
>     unmap_mapping_range(mapping, start, end); /* new page references
> cannot be established */
>     if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) {
>         per_fs_mmap_unlock(inode); /* new page references can happen,
> so we need to start over */
>         wait_for_page_idle(dax_page);
>         goto retry;
>     }
>     truncate_inode_pages_range(mapping, start, end);
>     per_fs_mmap_unlock(inode);

These retry loops you keep proposing are just bloody horrible.  They
are basically just a method for blocking an operation until whatever
condition is preventing the invalidation goes away. IMO, that's an
ugly solution no matter how much lipstick you dress it up with.

i.e. the blocking loops mean the user process is going to be blocked
for arbitrary lengths of time. That's not a solution, it's just
passing the buck - now the userspace developers need to work around
truncate/hole punch being randomly blocked for arbitrary lengths of
time.

The whole point of pushing this into the busy extent list is that it
doesn't require blocking operations. i.e the re-use of the underlying
storage is simply delayed until notification that it is safe to
re-use comes along, but the extent removal operation doesn't get
blocked.

That's how we treat extents that require discard operations after
they have been freed - they remain in the busy list until the
discard IO completion signals "all done" and clears the busy extent.
Here we need to hold off clearing the extent until we get the "all
done" from the dax code.

e.g. what needs to happen when trying to do the invalidation is
something like this (assuming invalidate_inode_pages2_range() will
actually fail on pages under DMA):

	flags = 0;
	if (IS_DAX()) {
		error = invalidate_inode_pages2_range()
		if (error == -EBUSY && dax_dma_busy_page())
			flags = EXTENT_BUSY_DAX;
		else
			truncate_pagecache(); /* blocking */
	} else {
		truncate_pagecache();
	}

that EXTENT_BUSY_DAX flag needs to be carried all the way through to
the xfs_free_extent -> xfs_extent_busy_insert(). That's probably the
most complex part of the patch.

This flag then prevents xfs_extent_busy_reuse() from allowing reuse
of the extent.

And in xfs_extent_busy_clear(), they need to be treated sort of like
discarded extents. On transaction commit callback, we need to check
if there are still busy daxdma pages over the extent range, and if
there are we leave it in the busy list, otherwise it can be cleared.
For everything that is left in the busy list, the dax dma code will
need to call back into the filesystem when that page is released and
when the extent no long has any dax dma busy pages left over it it
can be cleared from the list.

Once we have the dax code to call back into the filesystem when the
problematic daxdma pages are released, and everything else should be
relatively straight forward...

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com