linux-ext4 - Re: [Q] gfs2: mmap write vs. punch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190906231526.GA16973@dread.disaster.area>
Date:   Sat, 7 Sep 2019 09:15:26 +1000
From:   Dave Chinner <david@...morbit.com>
To:     Andreas Gruenbacher <agruenba@...hat.com>
Cc:     cluster-devel <cluster-devel@...hat.com>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        linux-ext4 <linux-ext4@...r.kernel.org>,
        linux-xfs@...r.kernel.org,
        "Darrick J . Wong" <darrick.wong@...cle.com>,
        Christoph Hellwig <hch@....de>,
        Lukas Czerner <lczerner@...hat.com>
Subject: Re: [Q] gfs2: mmap write vs. punch_hole consistency

On Fri, Sep 06, 2019 at 11:48:31PM +0200, Andreas Gruenbacher wrote:
> On Fri, Sep 6, 2019 at 11:28 PM Dave Chinner <david@...morbit.com> wrote:
> > On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote:
> > > Hi,
> > >
> > > I've just fixed a mmap write vs. truncate consistency issue on gfs on
> > > filesystems with a block size smaller that the page size [1].
> > >
> > > It turns out that the same problem exists between mmap write and hole
> > > punching, and since xfstests doesn't seem to cover that,
> >
> > AFAIA, fsx exercises it pretty often. Certainly it's found problems
> > with XFS in the past w.r.t. these operations.
> >
> > > I've written a
> > > new test [2].
> >
> > I suspect that what we really want is a test that runs
> > _test_generic_punch using mmap rather than pwrite...
> >
> > > Ext4 and xfs both pass that test; they both apparently
> > > mark the pages that have a hole punched in them as read-only so that
> > > page_mkwrite is called before those pages can be written to again.
> >
> > XFS invalidates the range being hole punched (see
> > xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any
> > attempt to fault that page back in will block on the MMAPLOCK until
> > the hole punch finishes.
> 
> This isn't about writes during the hole punching, this is about writes
> once the hole is punched.

How do you prevent this:

hole punch process:		other process
  clean page
  invalidate page
				write page fault
				instantiate new page
				page_mkwrite
				page dirty
  do hole punch
  overwrite hole

Firstly, you end up with a dirty page over a hole with no backing
store, and second, you get no page fault on overwrite because the
pages are already dirty.

That's the problem the MMAPLOCK in XFS solves - it's integral to
ensuring the page faults on mmap write are correctly serialised with
both the page cache invalidation and the underlying extent
manipulation that the hole punch operation then does.

i.e. if you want a page fault /after/ a hole punch, you have to
prevent it occurring during the hole punch after the page has
already been marked clean and/or invalidated.

> For example, the test case I've posted
> creates the following file layout with 1k blocksize:
> 
>   DDDD DDDD DDDD
> 
> Then it punches a hole like this:
> 
>   DDHH HHHH HHDD
> 
> Then it fills the hole again with mwrite:
> 
>   DDDD DDDD DDDD
> 
> As far as I can tell, that needs to trigger page faults on all three
> pages.

Yes, but only /after/ the hole has been punched.

> I did get these on ext4; judging from the fact that xfs works,
> the also seem to occur there; but on gfs2, page_mkwrite isn't called
> for the two partially mapped pages, only for the page in the middle
> that's entirely within the hole. And I don't see where those pages are
> marked read-only; it appears like pagecache_isize_extended isn't
> called on ext4 or xfs. So how does this work there?

As I said in my last response, the answer is in
xfs_flush_unmap_range(). That uses truncate_pagecache_range() to do
the necessary page cache manipulation.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com