[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170829154922.GA24592@quack2.suse.cz>
Date: Tue, 29 Aug 2017 17:49:22 +0200
From: Jan Kara <jack@...e.cz>
To: Dave Chinner <david@...morbit.com>
Cc: Christoph Hellwig <hch@...radead.org>,
Theodore Ts'o <tytso@....edu>, linux-ext4@...r.kernel.org,
Lukas Czerner <lczerner@...hat.com>, linux-xfs@...r.kernel.org
Subject: Re: [PATCH] ext4: introduce per-inode DAX flag
On Mon 28-08-17 20:10:14, Dave Chinner wrote:
> On Mon, Aug 28, 2017 at 12:38:54AM -0700, Christoph Hellwig wrote:
> > On Sat, Aug 26, 2017 at 09:33:58AM +1000, Dave Chinner wrote:
> > > > Nah, -o dax works very well. It's just the flag instead of the -o dax
> > > > option or rather switching it on a mapped file will probably be very dangerous.
> > >
> > > In what way is it dangerous, Christoph?
> >
> > When I run the following script as a normal user:
> >
> > FSXDIR=~/xfstests/ltp/
> > FILE=/mnt/foo
> >
> > ${FSXDIR}/fsx $FILE &
> >
> > while true; do
> > xfs_io -c 'chattr +x' $FILE
> > xfs_io -c 'chattr -x' $FILE
> > done
> >
> > I get this nice little crash:
>
> Can you please package that up into an xfstest?
>
> > root@...tvm:~# sh test.sh
> > skipping zero size read
> > skipping insert range behind EOF
> > truncating to largest ever: 0x3a290
> > zero_range to largest ever: 0x3a8d1
> > zero_range to largest ever: 0x3fe3e
> > zero_range to largest ever: 0x40000
> > [ 344.898390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > [ 344.899306] IP: iomap_page_mkwrite+0x17/0xf0
> > [ 344.899795] PGD 7db37067
> > [ 344.899796] P4D 7db37067
> > [ 344.900099] PUD 78c61067
> > [ 344.900389] PMD 0
> > [ 344.900665]
> > [ 344.901075] Oops: 0000 [#1] SMP
> > [ 344.901536] Modules linked in:
> > [ 344.901716] CPU: 3 PID: 6052 Comm: fsx Not tainted 4.12.0+ #2199
> > [ 344.901716] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
> > [ 344.901716] task: ffff880079a0da00 task.stack: ffffc900068a4000
> > [ 344.901716] RIP: 0010:iomap_page_mkwrite+0x17/0xf0
> > [ 344.901716] RSP: 0000:ffffc900068a7d38 EFLAGS: 00010246
> > [ 344.901716] RAX: ffff8800798dd0d0 RBX: 0000000000000200 RCX: 0000000000000001
> > [ 344.901716] RDX: 0000000070eb898e RSI: ffffffff82109010 RDI: ffffc900068a7df0
> > [ 344.901716] RBP: ffffc900068a7d60 R08: ffffffff82ff9fa8 R09: 0000000000000000
> > [ 344.901716] R10: ffffc900068a7cb0 R11: ffffffff8159b5cc R12: ffffffff82109010
> > [ 344.901716] R13: 0000000000000000 R14: ffffc900068a7df0 R15: ffff88007da89580
> ^^^^^^^^^^^^^^^^
>
> vmf->page is null.
>
> Which means IS_DAX changed half way through a fault, despite us
> holding the MMAPLOCK and protecting all the filesystem side of the
> fault code from races.
>
> Seems to me that even allowing filesystems to switch between
> different mapping tree behaviours based on an inode flag is a
> fundamentally broken model. The fault action that needs to taken by
> the filesystem has already been predetermined by the fault
> processing that has already occurred and placed into the contents of
> the vmf we've been passed.
I don't think the problem is actually within MM in this particular case.
The problem seems to be that xfs_filemap_fault() checks IS_DAX without
holding MMAPLOCK and so it can change after that test and before the test
in xfs_filemap_page_mkwrite().
> Hence I think that if we need to process the fault as a DAX fault
> then the vmf needs to tell us that, not require us to look up an
> inode flag to determine what to do. ANd if the inode flag changes,
> then that needs to be propagated through the mapping and VMAs in a
> sane fashion, not just run an invalidation from the filesystem. I
> don't know enough about the VM code to say anything useful about how
> this needs to be set up, but it's clear that mapping invalidation
> and behaviour swaps can't be completely serialised against page
> faults from the filesystem side.
But there is no difference in vmf setup from generic MM side. In particular
vmf->page is set by the ->fault handler and then it is passed to
->page_mkwrite handler. And changes to mapping behavior between these two
callbacks should be prevented by the page lock / radix entry lock...
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists