linux-kernel - Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6356c783c1813_1d2129457@dwillia2-mobl3.amr.corp.intel.com.notmuch>
Date:   Mon, 24 Oct 2022 10:12:35 -0700
From:   Dan Williams <dan.j.williams@...el.com>
To:     Dave Chinner <david@...morbit.com>,
        "Darrick J. Wong" <djwong@...nel.org>
CC:     Yang, Xiao/杨 晓 <yangx.jy@...itsu.com>,
        Gotou, Yasunori/五島 康文 
        <y-goto@...itsu.com>, Brian Foster <bfoster@...hat.com>,
        "hch@...radead.org" <hch@...radead.org>,
        Ruan, Shiyang/阮 世阳 
        <ruansy.fnst@...itsu.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
        "nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        <zwisler@...nel.org>, Jeff Moyer <jmoyer@...hat.com>,
        <dm-devel@...hat.com>, <toshi.kani@....com>
Subject: Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

Dave Chinner wrote:
> On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
> > > In addition, I don't like your idea about the test change because it will
> > > make generic/470 become the special test for XFS. Do you know if we can fix
> > > the issue by changing the test in another way? blkdiscard -z can fix the
> > > issue because it does zero-fill rather than discard on the block device.
> > > However, blkdiscard -z will take a lot of time when the block device is
> > > large.
> > 
> > Well we /could/ just do that too, but that will suck if you have 2TB of
> > pmem. ;)
> > 
> > Maybe as an alternative path we could just create a very small
> > filesystem on the pmem and then blkdiscard -z it?
> > 
> > That said -- does persistent memory actually have a future?  Intel
> > scuttled the entire Optane product, cxl.mem sounds like expansion
> > chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
> > asserts everywhere) and 6.1 (every time I run fstests now I see massive
> > data corruption).
> 
> Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
> on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
> don't think that has changed at all - I still see lots of kernel
> warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
> argument" errors.
> 
> If I turn off reflink, then instead of data corruption I get kernel
> warnings like this from fsx and fsstress workloads:
> 
> [415478.558426] ------------[ cut here ]------------
> [415478.560548] WARNING: CPU: 12 PID: 1515260 at fs/dax.c:380 dax_insert_entry+0x2a5/0x320
> [415478.564028] Modules linked in:
> [415478.565488] CPU: 12 PID: 1515260 Comm: fsx Tainted: G        W 6.1.0-rc1-dgc+ #1615
> [415478.569221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [415478.572876] RIP: 0010:dax_insert_entry+0x2a5/0x320
> [415478.574980] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 da ee a7 00 eb a4 48 81 e6
> [415478.582740] RSP: 0000:ffffc90002867b70 EFLAGS: 00010002
> [415478.584730] RAX: ffffea000f0d0800 RBX: 0000000000000001 RCX: 0000000000000001
> [415478.587487] RDX: ffffea0000000000 RSI: 000000000000003a RDI: ffffea000f0d0840
> [415478.590122] RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000
> [415478.592380] R10: ffff888800dc9c18 R11: 0000000000000001 R12: ffffc90002867c58
> [415478.594865] R13: ffff888800dc9c18 R14: ffffc90002867e18 R15: 0000000000000000
> [415478.596983] FS:  00007fd719fa2b80(0000) GS:ffff88883ec00000(0000) knlGS:0000000000000000
> [415478.599364] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [415478.600905] CR2: 00007fd71a1ad640 CR3: 00000005cf241006 CR4: 0000000000060ee0
> [415478.602883] Call Trace:
> [415478.603598]  <TASK>
> [415478.604229]  dax_fault_iter+0x240/0x600
> [415478.605410]  dax_iomap_pte_fault+0x19c/0x3d0
> [415478.606706]  __xfs_filemap_fault+0x1dd/0x2b0
> [415478.607744]  __do_fault+0x2e/0x1d0
> [415478.608587]  __handle_mm_fault+0xcec/0x17b0
> [415478.609593]  handle_mm_fault+0xd0/0x2a0
> [415478.610517]  exc_page_fault+0x1d9/0x810
> [415478.611398]  asm_exc_page_fault+0x22/0x30
> [415478.612311] RIP: 0033:0x7fd71a04b9ba
> [415478.613168] Code: 4d 29 c1 4c 29 c2 48 3b 15 db 95 11 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 <41> 0f 29 01 41 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9
> [415478.617083] RSP: 002b:00007ffcf277be18 EFLAGS: 00010206
> [415478.618213] RAX: 00007fd71a1a3fc5 RBX: 0000000000000fc5 RCX: 00007fd719f5a610
> [415478.619854] RDX: 000000000000964b RSI: 00007fd719f50fd5 RDI: 00007fd71a1a3fc5
> [415478.621286] RBP: 0000000000030fc5 R08: 000000000000000e R09: 00007fd71a1ad640
> [415478.622730] R10: 0000000000000001 R11: 00007fd71a1ad64e R12: 0000000000009699
> [415478.624164] R13: 000000000000a65e R14: 00007fd71a1a3000 R15: 0000000000000001
> [415478.625600]  </TASK>
> [415478.626087] ---[ end trace 0000000000000000 ]---
> 
> Even generic/247 is generating a warning like this from xfs_io,
> which is a mmap vs DIO racer. Given that DIO doesn't exist for
> fsdax, this test turns into just a normal write() vs mmap() racer.
> 
> Given these are the same fsdax infrastructure failures that I
> reported for 6.0, it is also likely that ext4 is still throwing
> them. IOWs, whatever got broke in the 6.0 cycle wasn't fixed in the
> 6.1 cycle.
> 
> > Frankly at this point I'm tempted just to turn of fsdax support for XFS
> > for the 6.1 LTS because I don't have time to fix it.
> 
> /me shrugs
> 
> Backporting fixes (whenever they come along) is a problem for the
> LTS kernel maintainer to deal with, not the upstream maintainer.
> 
> IMO, the issue right now is that the DAX maintainers seem to have
> little interest in ensuring that the FSDAX infrastructure actually
> works correctly. If anything, they seem to want to make things
> harder for block based filesystems to use pmem devices and hence
> FSDAX. e.g. the direction of the DAX core away from block interfaces
> that filesystems need for their userspace tools to manage the
> storage.
> 
> At what point do we simply say "the experiment failed, FSDAX is
> dead" and remove it from XFS altogether?

A fair question, given the regressions made it all the way into
v6.0-final. In retrospect I made the wrong priority call to focus on dax
page reference counting these past weeks.

When I fired up the dax unit tests on v6.0-rc1 I found basic problems
with the notify failure patches that concerned me that they had never
been tested after the final version was merged [1]. Then the rest of the
development cycle was spent fixing dax reference counting [2]. That was
a longstanding wishlist item from gup and folio developers, but, as I
said, that seems the wrong priority given the lingering regressions. I
will take a look the current dax-xfstests regression backlog. That may
find a need to consider reverting the problematic commits depending on
what is still broken if the fixes are trending towards being invasive.

[1]: https://lore.kernel.org/all/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com/

[2]: https://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/