linux-kernel - Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221023220018.GX3600936@dread.disaster.area>
Date:   Mon, 24 Oct 2022 09:00:18 +1100
From:   Dave Chinner <david@...morbit.com>
To:     "Darrick J. Wong" <djwong@...nel.org>
Cc:     Yang, Xiao/杨 晓 <yangx.jy@...itsu.com>,
        Gotou, Yasunori/五島 康文 
        <y-goto@...itsu.com>, Brian Foster <bfoster@...hat.com>,
        "hch@...radead.org" <hch@...radead.org>,
        Ruan, Shiyang/阮 世阳 
        <ruansy.fnst@...itsu.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
        "nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        zwisler@...nel.org, Jeff Moyer <jmoyer@...hat.com>,
        dm-devel@...hat.com, toshi.kani@....com
Subject: Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

On Fri, Oct 21, 2022 at 07:11:02PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 20, 2022 at 10:17:45PM +0800, Yang, Xiao/杨 晓 wrote:
> > In addition, I don't like your idea about the test change because it will
> > make generic/470 become the special test for XFS. Do you know if we can fix
> > the issue by changing the test in another way? blkdiscard -z can fix the
> > issue because it does zero-fill rather than discard on the block device.
> > However, blkdiscard -z will take a lot of time when the block device is
> > large.
> 
> Well we /could/ just do that too, but that will suck if you have 2TB of
> pmem. ;)
> 
> Maybe as an alternative path we could just create a very small
> filesystem on the pmem and then blkdiscard -z it?
> 
> That said -- does persistent memory actually have a future?  Intel
> scuttled the entire Optane product, cxl.mem sounds like expansion
> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird kernel
> asserts everywhere) and 6.1 (every time I run fstests now I see massive
> data corruption).

Yup, I see the same thing. fsdax was a train wreck in 6.0 - broken
on both ext4 and XFS. Now that I run a quick check on 6.1-rc1, I
don't think that has changed at all - I still see lots of kernel
warnings, data corruption and "XFS_IOC_CLONE_RANGE: Invalid
argument" errors.

If I turn off reflink, then instead of data corruption I get kernel
warnings like this from fsx and fsstress workloads:

[415478.558426] ------------[ cut here ]------------
[415478.560548] WARNING: CPU: 12 PID: 1515260 at fs/dax.c:380 dax_insert_entry+0x2a5/0x320
[415478.564028] Modules linked in:
[415478.565488] CPU: 12 PID: 1515260 Comm: fsx Tainted: G        W 6.1.0-rc1-dgc+ #1615
[415478.569221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[415478.572876] RIP: 0010:dax_insert_entry+0x2a5/0x320
[415478.574980] Code: 08 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 58 20 48 8d 53 01 e9 65 ff ff ff 48 8b 58 20 48 8d 53 01 e9 50 ff ff ff <0f> 0b e9 70 ff ff ff 31 f6 4c 89 e7 e8 da ee a7 00 eb a4 48 81 e6
[415478.582740] RSP: 0000:ffffc90002867b70 EFLAGS: 00010002
[415478.584730] RAX: ffffea000f0d0800 RBX: 0000000000000001 RCX: 0000000000000001
[415478.587487] RDX: ffffea0000000000 RSI: 000000000000003a RDI: ffffea000f0d0840
[415478.590122] RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000
[415478.592380] R10: ffff888800dc9c18 R11: 0000000000000001 R12: ffffc90002867c58
[415478.594865] R13: ffff888800dc9c18 R14: ffffc90002867e18 R15: 0000000000000000
[415478.596983] FS:  00007fd719fa2b80(0000) GS:ffff88883ec00000(0000) knlGS:0000000000000000
[415478.599364] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[415478.600905] CR2: 00007fd71a1ad640 CR3: 00000005cf241006 CR4: 0000000000060ee0
[415478.602883] Call Trace:
[415478.603598]  <TASK>
[415478.604229]  dax_fault_iter+0x240/0x600
[415478.605410]  dax_iomap_pte_fault+0x19c/0x3d0
[415478.606706]  __xfs_filemap_fault+0x1dd/0x2b0
[415478.607744]  __do_fault+0x2e/0x1d0
[415478.608587]  __handle_mm_fault+0xcec/0x17b0
[415478.609593]  handle_mm_fault+0xd0/0x2a0
[415478.610517]  exc_page_fault+0x1d9/0x810
[415478.611398]  asm_exc_page_fault+0x22/0x30
[415478.612311] RIP: 0033:0x7fd71a04b9ba
[415478.613168] Code: 4d 29 c1 4c 29 c2 48 3b 15 db 95 11 00 0f 87 af 00 00 00 0f 10 01 0f 10 49 f0 0f 10 51 e0 0f 10 59 d0 48 83 e9 40 48 83 ea 40 <41> 0f 29 01 41 0f 29 49 f0 41 0f 29 51 e0 41 0f 29 59 d0 49 83 e9
[415478.617083] RSP: 002b:00007ffcf277be18 EFLAGS: 00010206
[415478.618213] RAX: 00007fd71a1a3fc5 RBX: 0000000000000fc5 RCX: 00007fd719f5a610
[415478.619854] RDX: 000000000000964b RSI: 00007fd719f50fd5 RDI: 00007fd71a1a3fc5
[415478.621286] RBP: 0000000000030fc5 R08: 000000000000000e R09: 00007fd71a1ad640
[415478.622730] R10: 0000000000000001 R11: 00007fd71a1ad64e R12: 0000000000009699
[415478.624164] R13: 000000000000a65e R14: 00007fd71a1a3000 R15: 0000000000000001
[415478.625600]  </TASK>
[415478.626087] ---[ end trace 0000000000000000 ]---

Even generic/247 is generating a warning like this from xfs_io,
which is a mmap vs DIO racer. Given that DIO doesn't exist for
fsdax, this test turns into just a normal write() vs mmap() racer.

Given these are the same fsdax infrastructure failures that I
reported for 6.0, it is also likely that ext4 is still throwing
them. IOWs, whatever got broke in the 6.0 cycle wasn't fixed in the
6.1 cycle.

> Frankly at this point I'm tempted just to turn of fsdax support for XFS
> for the 6.1 LTS because I don't have time to fix it.

/me shrugs

Backporting fixes (whenever they come along) is a problem for the
LTS kernel maintainer to deal with, not the upstream maintainer.

IMO, the issue right now is that the DAX maintainers seem to have
little interest in ensuring that the FSDAX infrastructure actually
works correctly. If anything, they seem to want to make things
harder for block based filesystems to use pmem devices and hence
FSDAX. e.g. the direction of the DAX core away from block interfaces
that filesystems need for their userspace tools to manage the
storage.

At what point do we simply say "the experiment failed, FSDAX is
dead" and remove it from XFS altogether?

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com