linux-ext4 - Re: [PATCH 2/2] dax: move writeback calls into the filesystems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160208201808.GK27429@dastard>
Date:	Tue, 9 Feb 2016 07:18:08 +1100
From:	Dave Chinner <david@...morbit.com>
To:	Dan Williams <dan.j.williams@...el.com>
Cc:	Ross Zwisler <ross.zwisler@...ux.intel.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Theodore Ts'o <tytso@....edu>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Andreas Dilger <adilger.kernel@...ger.ca>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Jan Kara <jack@...e.com>,
	Matthew Wilcox <willy@...ux.intel.com>,
	linux-ext4 <linux-ext4@...r.kernel.org>,
	linux-fsdevel <linux-fsdevel@...r.kernel.org>,
	Linux MM <linux-mm@...ck.org>,
	"linux-nvdimm@...ts.01.org" <linux-nvdimm@...ts.01.org>,
	XFS Developers <xfs@....sgi.com>, jmoyer <jmoyer@...hat.com>
Subject: Re: [PATCH 2/2] dax: move writeback calls into the filesystems

On Mon, Feb 08, 2016 at 12:18:11AM -0800, Dan Williams wrote:
> On Sun, Feb 7, 2016 at 1:50 PM, Dave Chinner <david@...morbit.com> wrote:
> > On Sun, Feb 07, 2016 at 11:13:51AM -0800, Dan Williams wrote:
> >> On Sat, Feb 6, 2016 at 11:19 PM, Ross Zwisler
> >> <ross.zwisler@...ux.intel.com> wrote:
> >> > Previously calls to dax_writeback_mapping_range() for all DAX filesystems
> >> > (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
> >> > dax_writeback_mapping_range() needs a struct block_device, and it used to
> >> > get that from inode->i_sb->s_bdev.  This is correct for normal inodes
> >> > mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
> >> > block devices and for XFS real-time files.
> >> >
> >> > Instead, call dax_writeback_mapping_range() directly from the filesystem or
> >> > raw block device fsync/msync code so that they can supply us with a valid
> >> > block device.
> >> >
> >> > It should be noted that this will reduce the number of calls to
> >> > dax_writeback_mapping_range() because filemap_write_and_wait_range() is
> >> > called in the various filesystems for operations other than just
> >> > fsync/msync.  Both ext4 & XFS call filemap_write_and_wait_range() outside
> >> > of ->fsync for hole punch, truncate, and block relocation
> >> > (xfs_shift_file_space() && ext4_collapse_range()/ext4_insert_range()).
> >> >
> >> > I don't believe that these extra flushes are necessary in the DAX case.  In
> >> > the page cache case when we have dirty data in the page cache, that data
> >> > will be actively lost if we evict a dirty page cache page without flushing
> >> > it to media first.  For DAX, though, the data will remain consistent with
> >> > the physical address to which it was written regardless of whether it's in
> >> > the processor cache or not - really the only reason I see to flush is in
> >> > response to a fsync or msync so that our data is durable on media in case
> >> > of a power loss.  The case where we could throw dirty data out of the page
> >> > cache and essentially lose writes simply doesn't exist.
> >> >
> >> > Signed-off-by: Ross Zwisler <ross.zwisler@...ux.intel.com>
> >> > ---
> >> >  fs/block_dev.c      |  7 +++++++
> >> >  fs/dax.c            |  5 ++---
> >> >  fs/ext2/file.c      | 10 ++++++++++
> >> >  fs/ext4/fsync.c     | 10 +++++++++-
> >> >  fs/xfs/xfs_file.c   | 12 ++++++++++--
> >> >  include/linux/dax.h |  4 ++--
> >> >  mm/filemap.c        |  6 ------
> >> >  7 files changed, 40 insertions(+), 14 deletions(-)
> >>
> >> This sprinkling of dax specific fixups outside of vm_operations_struct
> >> routines still has me thinking that we are going in the wrong
> >> direction for fsync/msync support.
> >>
> >> If an application is both unaware of DAX and doing mmap I/O it is
> >> better served by the page cache where writeback is durable by default.
> >> We expect DAX-aware applications to assume responsibility for cpu
> >> cache management [1].  Making DAX mmap semantics explicit opt-in
> >> solves not only durability support, but also the current problem that
> >> DAX gets silently disabled leaving an app to wonder if it really got a
> >> direct mapping. DAX also silently picks pud, pmd, or pte mappings
> >> which is information an application would really like to know at map
> >> time.
> >>
> >> The proposal: make applications explicitly request DAX semantics with
> >> a new MAP_DAX flag and fail if DAX is unavailable.
> >
> > No.
> >
> > As I've stated before, the entire purpose of enabling DAX through
> > existing filesytsems like XFS and ext4 is so that existing
> > applications work with DAX *without modification*.
> >
> > That is, applications can be entirely unaware of the fact that the
> > filesystem is giving them direct access to the storage because the
> > access and failure semantics of DAX enabled mmap are *identical to
> > the existing mmap semantics*.
> >
> > Given this, the app doesn't need to care whether DAX is enabled or
> > not; all that will be seen is a difference in speed of access.
> > Enabling and disabling DAX is, at this point, purely an
> > administration decision - if the hardware and filesystem supports
> > it, it can be turned on without having to wait years for application
> > developers to add support for it....
> 
> Setting aside the current block zeroing problem you seem to assuming
> that DAX will always be faster and that may not be true at a media
> level.  Waiting years for some applications to determine if DAX makes
> sense for their use case seems completely reasonable.  In the meantime
> the apps that are already making these changes want to know that a DAX
> mapping request has not silently dropped backed to page cache.  They
> also want to know if they successfully jumped through all the hoops to
> get a larger than pte mapping.
> 
> I agree it is useful to be able to force DAX on an unmodified
> application to see what happens, and it follows that if those
> applications want to run in that mode they will need functional
> fsync()...
> 
> I would feel better if we were talking about specific applications and
> performance numbers to know if forcing DAX on application is a debug
> facility or a production level capability.  You seem to have already
> made that determination and I'm curious what I'm missing.

I'm not setting any policy here at all.  This whole argument is
based around the DAX mount option doing "global fs enable or
silently turning it off" and the application not knowing about that.

The whole point of having a persistent per-inode DAX flags is that
it is a policy mechanism, not a policy.  The application can, if it
is DAX aware, directly control whether DAX is used on a file or not.
The application can even query and clear that persistent inode flag
if it is configured not to (or cannot) use DAX.

If the filesystem cannot support DAX, then we can error out attempts
to set the DAX flag and then the app knows DAX is not available.
i.e. the attempt to set policy failed. If the flag is set, then the
inode will *always* use DAX - there is no "fall back to page cache"
when DAX is enabled.

If the applicaiton is not DAX aware, then the admin can control the
DAX policy by manipulating these flags themselves, and hence control
whether DAX is used by the application or not.

If you think I'm dictating policy for DAX users and application,
then you haven't understood anything I've previously said about why
the DAX mount option needs to die before any of this is considered
production ready. DAX is not an opaque "all or nothing" option. XFS
will provide apps and admins with fine-grained, persistent,
discoverable policy flags to allow admins and applications to set
DAX policies however they see fit. This simply cannot be done if the
only knob you have is a mount option that may or may not stick.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html