[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181219223312.GP6311@dastard>
Date: Thu, 20 Dec 2018 09:33:12 +1100
From: Dave Chinner <david@...morbit.com>
To: Jan Kara <jack@...e.cz>
Cc: Jason Gunthorpe <jgg@...pe.ca>, Jerome Glisse <jglisse@...hat.com>,
John Hubbard <jhubbard@...dia.com>,
Matthew Wilcox <willy@...radead.org>,
Dan Williams <dan.j.williams@...el.com>,
John Hubbard <john.hubbard@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Linux MM <linux-mm@...ck.org>, tom@...pey.com,
Al Viro <viro@...iv.linux.org.uk>, benve@...co.com,
Christoph Hellwig <hch@...radead.org>,
Christopher Lameter <cl@...ux.com>,
"Dalessandro, Dennis" <dennis.dalessandro@...el.com>,
Doug Ledford <dledford@...hat.com>,
Michal Hocko <mhocko@...nel.org>, mike.marciniszyn@...el.com,
rcampbell@...dia.com,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
On Wed, Dec 19, 2018 at 12:35:40PM +0100, Jan Kara wrote:
> On Wed 19-12-18 21:28:25, Dave Chinner wrote:
> > On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote:
> > > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote:
> > >
> > > > Essentially, what we are talking about is how to handle broken
> > > > hardware. I say we should just brun it with napalm and thermite
> > > > (i.e. taint the kernel with "unsupportable hardware") and force
> > > > wait_for_stable_page() to trigger when there are GUP mappings if
> > > > the underlying storage doesn't already require it.
> > >
> > > If you want to ban O_DIRECT/etc from writing to file backed pages,
> > > then just do it.
> >
> > O_DIRECT IO *isn't the problem*.
>
> That is not true. O_DIRECT IO is a problem. In some aspects it is easier
> than the problem with RDMA but currently O_DIRECT IO can crash your machine
> or corrupt data the same way RDMA can.
It's not O_DIRECT - it's a ""transient page pin". Yes, there are
problems with that right now, but as we've discussed the issues can
be avoided by:
a) stable pages always blocking in ->page_mkwrite;
b) blocking in write_cache_pages() on an elevated map count
when WB_SYNC_ALL is set; and
c) blocking in truncate_pagecache() on an elevated map
count.
That prevents:
a) gup pinning a page that is currently under writeback and
modifying it while IO is in flight;
b) a dirty page being written back while it is pinned by
GUP, thereby turning it clean before the gup reference calls
set_page_dirty() on DMA completion; and
c) truncate/hole punch for pulling the page out from under
the gup operation that is ongoing.
This is an adequate solution for a short term transient pins. It
doesn't break fsync(), it doesn't change how truncate works and it
fixes the problem where a mapped file is the buffer for an O_DIRECT
IO rather than the open fd and that buffer file gets truncated.
IOWs, transient pins (and hence O_DIRECT) is not really the problem
here.
The problem with this is that blocking on elevated map count does
not work for long term pins (i.e. gup_longterm()) which are defined
as:
* "longterm" == userspace controlled elevated page count lifetime.
* Contrast this to iov_iter_get_pages() usages which are transient.
It's the "userspace controlled" part of the long term gup pin that
is the problem we need to solve. If we treat them the same as a
transient pin, then this leads to fsync() and truncate either
blocking for a long time waiting for userspace to drop it's gup
reference, or having to be failed with something like EBUSY or
EAGAIN.
This is the problem revokable file layout leases solve. The NFS
server is already using this for revoking delegations from remote
clients. Userspace holding long term GUP references is essentially
the same thing - it's a delegation of file ownership to userspace
that the filesystem must be able to revoke when it needs to run
internal and/or 3rd-party requested operations on that delegated
file.
If the hardware supports page faults, then we can further optimise
the long term pin case to relax stable page requirements and allow
page cleaning to occur while there are long term pins. In this case,
the hardware will write-fault the clean pages appropriately before
DMA is initiated, and hence avoid the need for data integrity
operations like fsync() to trigger lease revocation. However,
truncate/hole punch still requires lease revocation to work sanely,
especially when we consider DAX *must* ensure there are no remaining
references to the physical pmem page after the space has been freed.
i.e. conflating the transient and long term gup pins as the same
problem doesn't help anyone. If we fix the short term pin problems,
then the long term pin problem become tractable by adding a layer
over the top (i.e. hardware page fault capability and/or file lease
requirements). Existing apps and hardware will continue to work -
external operations on the pinned file will simply hang rather than
causing corruption or kernel crashes. New (or updated) applications
will play nicely with lease revocation and at that point the "long
term pin" basically becomes a transient pin where the unpin latency
is determined by how quickly the app responds to the lease
revocation. And page fault capable hardware will reduce the
occurrence of lease revocations due to data writeback/integrity
operations and behave almost identically to cpu-based mmap accesses
to file backed pages.
Cheers,
Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists