[<prev] [next>] [day] [month] [year] [list]
Message-ID: <OS3PR01MB9499BE1E2B3E1823B6F8F1C4832F9@OS3PR01MB9499.jpnprd01.prod.outlook.com>
Date: Sun, 23 Oct 2022 07:04:53 +0000
From: "yangx.jy@...itsu.com" <yangx.jy@...itsu.com>
To: "Darrick J. Wong" <djwong@...nel.org>
CC: Brian Foster <bfoster@...hat.com>,
"hch@...radead.org" <hch@...radead.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
"nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"david@...morbit.com" <david@...morbit.com>,
"zwisler@...nel.org" <zwisler@...nel.org>,
Jeff Moyer <jmoyer@...hat.com>,
"dm-devel@...hat.com" <dm-devel@...hat.com>,
"toshi.kani@....com" <toshi.kani@....com>
Subject: RE: [PATCH] xfs: fail dax mount if reflink is enabled on a partition
On 2022/10/22 10:11, Darrick J. Wong wrote:
>> We need to fix the issue by discarding XFS log on the block device.
>> mkfs.xfs will try to discard the blocks including XFS log by calling
>> ioctl(BLKDISCARD) but it will ignore error silently when the block
>> device doesn't support ioctl(BLKDISCARD).
> ...but I think here's where I think your understanding isn't correct.
> It might help to show how the nested logging creates its own problems.
>
> First, let's say there's a block B that contains some stale garbage
> AAAA.
>
> XFS writes a block into the XFS log (call the block L) with the
> instructions "allocate block B and write CCCC to block B".
> dm-logwrites doesn't know or care about the contents of the blocks
> that it is told to write; it only knows that XFS told it to write some
> data (the
> instructions) to block L. So it remembers the fact that some data got
> written to L, but it doesn't know about B at all.
>
> At the point where we create the dm-logwrites preunmap mark, it's only
> tracking L. It is not tracking B. After the mark is taken, the XFS
> AIL writes CCCC to B, and only then does dm-logwrites begin tracking B.
> Hence B is not included in the preunmap mark. The pre-unmount process
> in XFS writes to the XFS log "write DDDD to block B" and the unmount
> process checkpoints the log contents, so now block B contains contains
> DDDD.
>
> Now the test wants to roll to the preunmap mark. Unfortunately,
> dm-logwrites doesn't record former block contents, which means that
> the log replay tools cannot roll backwards from "umount" to "preunmap"
> -- they can only roll forward from the beginning. So there's no way
> to undo writing DDDD or CCCC to B. IOWs, there's no way to revert B's
> state back to AAAA when doing dm-logwrites recovery.
>
> Now XFS log recovery starts. It sees "allocate block B and write CCCC
> to block B". However, it reads block B, sees that it contains DDDD,
> and it skips writing CCCC. Incorrectly. The only way to avoid this
> is to zero B before replaying the dm-logwrites.
>
> So you could solve the problem via BLKDISCARD, or writing zeroes to
> the entire block device, or scanning the metadata and writing zeroes
> to just those blocks, or by adding undo buffer records to dm-logwrites
> and teaching it to do a proper rollback.
Hi Darrick,
Thanks for your patient explanation.
Do you know if XFS log records still use buffer write? In other words, they cannot be written into the block device in DAX mode, right?
In fact, I can reproduce the inconsistent filesystem issue on
generic/482 but cannot reproduce the issue on generic/470.
>
>> Discarding XFS log is what you said "reinitialize the entire block
>> device", right?
> No, I really meant the/entire/ block device.
>
>>> I think the only way to fix this test is (a) revert all of
>>> Christoph's changes so far and scuttle the divorce; or (b) change this test like so:
>> Sorry, I didn't know which Christoph's patches need to be reverted?
>> Could you tell me the URL about Christoph's patches?
> Eh, it's a whole long series of patches scuttling various parts where
> pmem could talk to the block layer. I doubt he'll accept you
> reverting his removal code.
Where can I find the Christoph's patch set you mentioned.
I just want to know the content of Christoph's patch set.
>
>>> 1. Create a large sparse file on $TEST_DIR and losetup that sparse
>>> file. The resulting loop device will not have dax capability.
>>>
>>> 2. Set up the dmthin/dmlogwrites stack on top of this loop device.
>>>
>>> 3. Call mkfs.xfs with the SCRATCH_DEV (which hopefully is a pmem
>>> device) as the realtime device, and set the daxinherit and rtinherit
>>> flags on the root directory. The result is a filesystem with a data
>>> section that the kernel will treat as a regular block device, a
>>> realtime section backed by pmem, and the necessary flags to make
>>> sure that the test file will actually get fsdax mode.
>>>
>>> 4. Acknowledge that we no longer have any way to test MAP_SYNC
>>> functionality on ext4, which means that generic/470 has to move to
>>> tests/xfs/.
>> Sorry, I didn't understand why the above test change can fix the issue.
> XFS supports two-device filesystems -- the "realtime" section, and the
> "data" section. FS metadata and log all live in the "data" section.
>
> So we change the test to set up some regular files, loop-mount the
> files, set up the requisite dm-logwrites stuff atop the loop devices,
> and format the XFS with the data section backed by the dm-logwrites
> device, and make the realtime section backed by the pmem.
>
> This way the log replay program can actually discard the data device
> (because it's a regular file) and replay the log forward to the
> preunmap mark. The pmem device is not involved in the replay at all,
> since changes to file data are never logged. It now becomes
> irrelevant that pmem no longer supports device mapper.
>
>> Could you tell me which step can discard XFS log?
> (None of the steps do that.)
>
>> In addition, I don't like your idea about the test change because it
>> will make generic/470 become the special test for XFS. Do you know if
>> we can fix the issue by changing the test in another way? blkdiscard
>> -z can fix the issue because it does zero-fill rather than discard on the block device.
>> However, blkdiscard -z will take a lot of time when the block device
>> is large.
> Well we/could/ just do that too, but that will suck if you have 2TB
> of
> pmem.;)
>
> Maybe as an alternative path we could just create a very small
> filesystem on the pmem and then blkdiscard -z it?
Good idea, I have sent a patch set to do it.
https://lore.kernel.org/fstests/20221023064810.847110-1-yangx.jy@fujitsu.com/T/#t
>
> That said -- does persistent memory actually have a future? Intel
> scuttled the entire Optane product, cxl.mem sounds like expansion
> chassis full of DRAM, and fsdax is horribly broken in 6.0 (weird
> kernel asserts everywhere) and 6.1 (every time I run fstests now I see
> massive data corruption).
As far as I know, cxl.mem will take use of nvdimm driver and can be used by many existing applications.
Best Regards,
Xiao Yang
Powered by blists - more mailing lists