lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 22 Jun 2017 10:02:35 +1000
From:   Dave Chinner <david@...morbit.com>
To:     Andy Lutomirski <luto@...nel.org>
Cc:     Dan Williams <dan.j.williams@...el.com>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        andy.rudoff@...el.com, Andrew Morton <akpm@...ux-foundation.org>,
        Jan Kara <jack@...e.cz>,
        linux-nvdimm <linux-nvdimm@...ts.01.org>,
        Linux API <linux-api@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Jeff Moyer <jmoyer@...hat.com>,
        Linux FS Devel <linux-fsdevel@...r.kernel.org>,
        Christoph Hellwig <hch@....de>
Subject: Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for
 byte-addressable updates to pmem

On Tue, Jun 20, 2017 at 10:18:24PM -0700, Andy Lutomirski wrote:
> On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner <david@...morbit.com> wrote:
> >> A per-inode
> >> count of the number of live DAX mappings or of the number of struct
> >> file instances that have requested DAX would work here.
> >
> > For what purpose does this serve? The reflink invalidates all the
> > existing mappings, so the next write access causes a fault and then
> > page_mkwrite is called and the shared extent will get COWed....
> 
> The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it
> right), except that IMO an API that doesn't involve making a change to
> an inode that sticks around would be nice.  The inode flag has the
> unfortunate property that, if two different programs each try to set
> the flag, mmap, write, and clear the flag, they'll stomp on each other
> and risk data corruption.
> 
> I admit I'm now thoroughly confused as to exactly what XFS does here
> -- does FS_XFLAG_DAX persist across unmount/mount?

Yes, it is.

i.e. DAX on XFS does not rely on a naive fs-wide mount option. You
can have applications on pmem filesystems use either DAX or normal
IO based on directory/inode flags.  Something doesn't work with DAX,
so just remove the DAX flags from the directories/inodes, and it
will safely and transparently switch to page-cache based IO.

<snip>

> Here's the overall point I'm trying to make: unprivileged programs
> that want to write to DAX files with userspace commit mechanisms
> (CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without
> privilege, and with reasonably clean APIs.  Ideally they could do this
> to any file they have write access to.

The privilege argument is irrelevant now - it was /suggested/
initially as a way of preventing people from shooting themselves in
the foot based on the immutable file model. It's clear that's not
desired, and it's not a show stopper. 

> Programs that want to write to
> mmapped files, DAX or otherwise, without latency spikes due to
> .page_mkwrite should be able to opt in to a heavier weight mechanism.
> But these two issues are someone independent, and I think they should
> be solved separately.

You seem to be calling the "fdatasync on every page fault" the
"lightweight" option. That's the brute-force-with-big-hammer
solution - it's most definitely not lightweight as every page fault
has extra overhead to call ->fsync(). Sure, the API is simple, but
the runtime overhead is significant.

The lightweight *runtime* option is to set up the file in such a
way that there is never any extra overhead at page fault time.  This
is what immutable extent maps provide.  Indeed, because the mappings
never change, you could use hardware dirty tracking if you wanted,
as there's no need to look up the filesystem to do writeback as
everything needed for writeback was mapped at page fault time.  This
"map first and then just write when you need to" is *exactly how
swap files work*.

Even if you are considering the complexity of the APIs, it's hardly
a "heavyweight" when it only requires a single call to fallocate()
before mmap() to set up the immutable extents on the file...

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ