linux-kernel - Re: [Bug 219609] File corruptions on SSD in 1st M.2 socket of AsRock X600M-STX + Ryzen 8700G

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3b693647-5e82-4c39-8017-22cada56eb55@leemhuis.info>
Date: Wed, 15 Jan 2025 09:40:04 +0100
From: Thorsten Leemhuis <regressions@...mhuis.info>
To: Bruno Gravato <bgravato@...il.com>, Stefan <linux-kernel@...g.de>
Cc: Keith Busch <kbusch@...nel.org>, bugzilla-daemon@...nel.org,
 Adrian Huang <ahuang12@...ovo.com>,
 Linux kernel regressions list <regressions@...ts.linux.dev>,
 linux-nvme@...ts.infradead.org, Jens Axboe <axboe@...com>,
 "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
 LKML <linux-kernel@...r.kernel.org>, Christoph Hellwig <hch@....de>
Subject: Re: [Bug 219609] File corruptions on SSD in 1st M.2 socket of AsRock
 X600M-STX + Ryzen 8700G

On 15.01.25 07:37, Bruno Gravato wrote:
> I finally got the chance to run some more tests with some interesting
> and unexpected results...

FWIW, I briefly looked into the issue in between as well and can
reproduce it[1] locally with my Samsung SSD 990 EVO Plus 4TB in the main
M.2 slot of my DeskMini X600 using btrfs on a mainline kernel with a
config from Fedora rawhide.

So what can we that are affected by the problem do to narrow it down?

What does it mean that disabling the NVMe devices's write cache often
but apparently not always helps? It it just reducing the chance of the
problem occurring or accidentally working around it?

hch initially brought up that swiotlb seems to be used. Are there any
BIOS setup settings we should try? I tried a few changes yesterday, but
I still get the "PCI-DMA: Using software bounce buffering for IO
(SWIOTLB)" message in the log and not a single line mentioning DMAR.

Ciao, Thorsten

[1] see start of this thread and/or
https://bugzilla.kernel.org/show_bug.cgi?id=219609 for details

> I put another disk (WD Black SN750) in the main M.2 slot (the
> problematic one), but kept my main disk (Solidigm P44 Pro) in the
> secondary M.2 slot (where it doesn't have any issues).
> I rerun my test: step 1) copy a large number of files to the WD disk
> (main slot), step 2) run btrfs scrub on it and expect some checksum
> errors
> To my surprise there were no errors!
> I tried it twice with different kernels (6.2.6 and 6.11.5) and booting
> from either disk (I have linux installations on both).
> Still no errors.
> 
> I then removed the Solidigm disk from the secondary and kept the WD
> disk in the main M.2 slot.
> Rerun my tests (on kernel 6.11.5) and bang! btrfs scrub now detected
> quite a few checksum errors!
> 
> I then tried disabling volatile write cache with "nvme set-feature
> /dev/nvme0 -f 6 -v 0"
> "nvme get-feature /dev/nvme0 -f 6" confirmed it was disabled, but
> /sys/block/nvme0n1/queue/fua still showed 1... Was that supposed to
> turn into 0?
> 
> I re-run my test, but I still got checksum errors on btrfs scrub. So
> disabling volatile write cache (assuming I did it correctly) didn't
> make a difference in my case.
> 
> I put the Solidigm disk back into the secondary slot, booted and rerun
> the test on the WD disk (main slot) just to be triple sure and still
> no errors.
> 
> So it looks like the corruption only happens if only the main M.2 slot
> is occupied and the secondary M.2 slot is free.
> With two nvme disks (one on each M.2 slot), there were no errors at all.
> 
> Stefan, did you ever try running your tests with 2 nvme disks
> installed on both slots? Or did you use only one slot at a time?

$ journalctl -k | grep -i -e DMAR -e IOMMU -e AMD-Vi -e SWIOTLB
AMD-Vi: Using global IVHD EFR:0x246577efa2254afa, EFR2:0x0
iommu: Default domain type: Translated
iommu: DMA domain TLB invalidation policy: lazy mode
pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
pci 0000:00:01.0: Adding to iommu group 0
pci 0000:00:01.3: Adding to iommu group 1
pci 0000:00:02.0: Adding to iommu group 2
pci 0000:00:02.3: Adding to iommu group 3
pci 0000:00:03.0: Adding to iommu group 4
pci 0000:00:04.0: Adding to iommu group 5
pci 0000:00:08.0: Adding to iommu group 6
pci 0000:00:08.1: Adding to iommu group 7
pci 0000:00:08.2: Adding to iommu group 8
pci 0000:00:08.3: Adding to iommu group 9
pci 0000:00:14.0: Adding to iommu group 10
pci 0000:00:14.3: Adding to iommu group 10
pci 0000:00:18.0: Adding to iommu group 11
pci 0000:00:18.1: Adding to iommu group 11
pci 0000:00:18.2: Adding to iommu group 11
pci 0000:00:18.3: Adding to iommu group 11
pci 0000:00:18.4: Adding to iommu group 11
pci 0000:00:18.5: Adding to iommu group 11
pci 0000:00:18.6: Adding to iommu group 11
pci 0000:00:18.7: Adding to iommu group 11
pci 0000:01:00.0: Adding to iommu group 12
pci 0000:02:00.0: Adding to iommu group 13
pci 0000:03:00.0: Adding to iommu group 14
pci 0000:03:00.1: Adding to iommu group 15
pci 0000:03:00.2: Adding to iommu group 16
pci 0000:03:00.3: Adding to iommu group 17
pci 0000:03:00.4: Adding to iommu group 18
pci 0000:03:00.6: Adding to iommu group 19
pci 0000:04:00.0: Adding to iommu group 20
pci 0000:04:00.1: Adding to iommu group 21
pci 0000:05:00.0: Adding to iommu group 22
AMD-Vi: Extended features (0x246577efa2254afa, 0x0): PPR NX GT [5] IA GA
PC GA_vAPIC
AMD-Vi: Interrupt remapping enabled
AMD-Vi: Virtual APIC enabled
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).