linux-kernel - Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170619132107.GG11993@dastard>
Date:   Mon, 19 Jun 2017 23:21:07 +1000
From:   Dave Chinner <david@...morbit.com>
To:     Andy Lutomirski <luto@...nel.org>
Cc:     Dan Williams <dan.j.williams@...el.com>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        andy.rudoff@...el.com, Andrew Morton <akpm@...ux-foundation.org>,
        Jan Kara <jack@...e.cz>,
        linux-nvdimm <linux-nvdimm@...ts.01.org>,
        Linux API <linux-api@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Jeff Moyer <jmoyer@...hat.com>,
        Linux FS Devel <linux-fsdevel@...r.kernel.org>,
        Christoph Hellwig <hch@....de>
Subject: Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for
 byte-addressable updates to pmem

On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote:
> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@...el.com> wrote:
> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@...nel.org> wrote:
> >> My other objection is that the syscall intentionally leaks a reference
> >> to the file.  This means it needs overflow protection and it probably
> >> shouldn't ever be allowed to use it without privilege.
> >
> > We only hold the one reference while S_DAXFILE is set, so I think the
> > protection is there, and per Dave's original proposal this requires
> > CAP_LINUX_IMMUTABLE.
> >
> >> Why can't the underlying issue be easily fixed, though?  Could
> >> .page_mkwrite just make sure that metadata is synced when the FS uses
> >> DAX?
> >
> > Yes, it most definitely could and that idea has been floated.
> >
> >> On a DAX fs, syncing metadata should be extremely fast.

<sigh>

This again....

Persistent memory means the *I/O* is fast. It does not mean that
*complex filesystem operations* are fast.

Don't forget that there's an shitload of CPU that gets burnt to make
sure that the metadata is synced correctly. Do that /synchronously/
on *every* write page fault (which, BTW, modify mtime, so will
always have dirty metadata to sync) and now you have a serious
performance problem with your "fast" DAX access method.

And that's before we even consider all the problems with running
sync operations in page fault context....

> >> This
> >> could be conditioned on an madvise or mmap flag if performance might
> >> be an issue.  As far as I know, this change alone should be
> >> sufficient.
> >
> > The hang up is that it requires per-fs enabling as it needs to be
> > careful to manage mmap_sem vs fs journal locks for example. I know the
> > in-development NOVA [1] filesystem is planning to support this out of
> > the gate. ext4 would be open to implementing it, but I think xfs is
> > cold on the idea. Christoph originally proposed it here [2], before
> > Dave went on to propose immutable semantics.
> 
> Hmm.  Given a choice between a very clean API that works without
> privilege but is awkward to implement on XFS and an awkward-to-use
> API, I'd personally choose the former.

Yup, you have the choice of a clean kernel API that will be
substantially slower than the existing "dirty page" tracking and
having the app run fsync() when necessary, or having to do a little
more work in a library routine that preallocates a file and sets a
flag on it?

The apps will use the library API, not the kernel API, so who really
cares if there's a few steps to setting up the file state
appropriately?

> Dave, even with the lock ordering issue, couldn't XFS implement
> MAP_PMEM_AWARE by having .page_mkwrite work roughly like this:
> 
> if (metadata is dirty) {
>   up_write(&mmap_sem);
>   sync the metadata;
>   down_write(&mmap_sem);
>   return 0;  /* retry the fault */
> } else {
>   return whatever success code;
> }

How do you know that there is dependent filesystem metadata that
needs syncing at a level that you can safely manipulate the
mmap_sem? And how, exactly, do you do this without races? It'd be
trivial to DOS such retryable DAX faults simply by touching the file
in a tight loop in a separate process...

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com