[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181220090753.GB27845@quack2.suse.cz>
Date: Thu, 20 Dec 2018 10:07:53 +0100
From: Jan Kara <jack@...e.cz>
To: Dave Chinner <david@...morbit.com>
Cc: Jan Kara <jack@...e.cz>, Jason Gunthorpe <jgg@...pe.ca>,
Jerome Glisse <jglisse@...hat.com>,
John Hubbard <jhubbard@...dia.com>,
Matthew Wilcox <willy@...radead.org>,
Dan Williams <dan.j.williams@...el.com>,
John Hubbard <john.hubbard@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Linux MM <linux-mm@...ck.org>, tom@...pey.com,
Al Viro <viro@...iv.linux.org.uk>, benve@...co.com,
Christoph Hellwig <hch@...radead.org>,
Christopher Lameter <cl@...ux.com>,
"Dalessandro, Dennis" <dennis.dalessandro@...el.com>,
Doug Ledford <dledford@...hat.com>,
Michal Hocko <mhocko@...nel.org>, mike.marciniszyn@...el.com,
rcampbell@...dia.com,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
On Thu 20-12-18 09:33:12, Dave Chinner wrote:
> On Wed, Dec 19, 2018 at 12:35:40PM +0100, Jan Kara wrote:
> > On Wed 19-12-18 21:28:25, Dave Chinner wrote:
> > > On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> > > > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> > > >
> > > > > Essentially, what we are talking about is how to handle broken
> > > > > hardware. I say we should just brun it with napalm and thermite
> > > > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > > > the underlying storage doesn't already require it.
> > > >
> > > > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > > > then just do it.
> > >
> > > O_DIRECT IO *isn't the problem*.
> >
> > That is not true. O_DIRECT IO is a problem. In some aspects it is easier
> > than the problem with RDMA but currently O_DIRECT IO can crash your machine
> > or corrupt data the same way RDMA can.
>
> It's not O_DIRECT - it's a ""transient page pin". Yes, there are
> problems with that right now, but as we've discussed the issues can
> be avoided by:
>
> a) stable pages always blocking in ->page_mkwrite;
> b) blocking in write_cache_pages() on an elevated map count
> when WB_SYNC_ALL is set; and
> c) blocking in truncate_pagecache() on an elevated map
> count.
>
> That prevents:
> a) gup pinning a page that is currently under writeback and
> modifying it while IO is in flight;
> b) a dirty page being written back while it is pinned by
> GUP, thereby turning it clean before the gup reference calls
> set_page_dirty() on DMA completion; and
This is not prevented by what you wrote above as currently GUP does not
increase page->_mapcount. Currently, there's no way to distinguish GUP page
reference from any other page reference - GUP simply does get_page() - and
big part of this thread as I see it is exactly about how to introduce this
distinction and how to convert all GUP users to the new convention safely
(as currently they just pass struct page * pointers around and eventually
do put_page() on them). Increasing page->_mapcount in GUP and trying to
deduce the pin count from that is one option Jerome suggested. At this
point I'm not 100% sure this is going to work but we'll see.
> c) truncate/hole punch for pulling the page out from under
> the gup operation that is ongoing.
>
> This is an adequate solution for a short term transient pins. It
> doesn't break fsync(), it doesn't change how truncate works and it
> fixes the problem where a mapped file is the buffer for an O_DIRECT
> IO rather than the open fd and that buffer file gets truncated.
> IOWs, transient pins (and hence O_DIRECT) is not really the problem
> here.
For now let's assume that the mechanism how to detect page pinned by GUP is
actually somehow solved and we have already page_pinned() implemented. Then
what you suggest can actually create a deadlock AFAICS:
Process 1: Process 2:
fsync("file")
/* Evil memory buffer with page order reversed */
addr1 = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, "file", 4096);
addr2 = mmap(addr1+4096, 4096, PROT_WRITE, MAP_SHARED, "file", 0);
/* Fault in pages */
*addr1 = 0;
*addr2 = 0;
adds page with index 0
to bio
fd = open("file2", O_RDWR | O_DIRECT);
read(fd, addr1, 8192)
-> eventually gets to iov_iter_get_pages() and then to
get_user_pages_fast().
-> pins "file" page with index 1
blocks on pin for
page with index 1
-> blocks in PageWriteback for page with index 0
Possibility of deadlocks like this is why I've decided it will be easier
to just bounce the page for writeback we cannot avoid rather than block the
writeback...
> The problem with this is that blocking on elevated map count does
> not work for long term pins (i.e. gup_longterm()) which are defined
> as:
>
> * "longterm" == userspace controlled elevated page count lifetime.
> * Contrast this to iov_iter_get_pages() usages which are transient.
>
> It's the "userspace controlled" part of the long term gup pin that
> is the problem we need to solve. If we treat them the same as a
> transient pin, then this leads to fsync() and truncate either
> blocking for a long time waiting for userspace to drop it's gup
> reference, or having to be failed with something like EBUSY or
> EAGAIN.
I agree. "userspace controlled" pins are another big problem to solve.
> This is the problem revokable file layout leases solve. The NFS
> server is already using this for revoking delegations from remote
> clients. Userspace holding long term GUP references is essentially
> the same thing - it's a delegation of file ownership to userspace
> that the filesystem must be able to revoke when it needs to run
> internal and/or 3rd-party requested operations on that delegated
> file.
>
> If the hardware supports page faults, then we can further optimise
> the long term pin case to relax stable page requirements and allow
> page cleaning to occur while there are long term pins. In this case,
> the hardware will write-fault the clean pages appropriately before
> DMA is initiated, and hence avoid the need for data integrity
> operations like fsync() to trigger lease revocation. However,
> truncate/hole punch still requires lease revocation to work sanely,
> especially when we consider DAX *must* ensure there are no remaining
> references to the physical pmem page after the space has been freed.
>
> i.e. conflating the transient and long term gup pins as the same
> problem doesn't help anyone. If we fix the short term pin problems,
> then the long term pin problem become tractable by adding a layer
> over the top (i.e. hardware page fault capability and/or file lease
> requirements). Existing apps and hardware will continue to work -
> external operations on the pinned file will simply hang rather than
> causing corruption or kernel crashes. New (or updated) applications
> will play nicely with lease revocation and at that point the "long
> term pin" basically becomes a transient pin where the unpin latency
> is determined by how quickly the app responds to the lease
> revocation. And page fault capable hardware will reduce the
> occurrence of lease revocations due to data writeback/integrity
> operations and behave almost identically to cpu-based mmap accesses
> to file backed pages.
Agreed. I think we are on the same page wrt this. Just at this point I'm
trying to solve the "transient pin" problem...
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists