linux-kernel - Re: [Bug 219609] File corruptions on SSD in 1st M.2 socket of AsRock X600M-STX + Ryzen 8700G

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7e5b72e9-7f7f-4e68-a42a-4411cc5778b1@simg.de>
Date: Wed, 15 Jan 2025 17:26:14 +0100
From: Stefan <linux-kernel@...g.de>
To: Bruno Gravato <bgravato@...il.com>, bugzilla-daemon@...nel.org
Cc: bugzilla-daemon@...nel.org, Keith Busch <kbusch@...nel.org>,
 Adrian Huang <ahuang12@...ovo.com>,
 Linux kernel regressions list <regressions@...ts.linux.dev>,
 linux-nvme@...ts.infradead.org, Jens Axboe <axboe@...com>,
 "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
 LKML <linux-kernel@...r.kernel.org>,
 Thorsten Leemhuis <regressions@...mhuis.info>, Christoph Hellwig <hch@....de>
Subject: Re: [Bug 219609] File corruptions on SSD in 1st M.2 socket of AsRock
 X600M-STX + Ryzen 8700G

Hi,

Am 15.01.25 um 14:14 schrieb Bruno Gravato:
> If yours behaves like mine, I'd expect that if you swap the disks in
> config 2, that you won't have any errors as well...

yeah, I would just need to plug something into the 2nd M.2 socket. But
that can't be done remotely. I will do that on weekend or in next week.

BTW, is there a kernel parameter to ignore a NVME/PCI device? If the
corruptions appear again after disabling the 2nd SSD, it is more likely
that it is a kernel problem, e.g. a driver writing to memory reserved
for some other driver/component. Such a bug may only occur under rare
conditions. AFAIU, the patch "nvme-pci: place descriptor addresses in
iod" form 6.3-rc1 attempts to use some space which is otherwise unused.
Unfortunately I was not able to revert that patch because later changes
depend on it.

So, I now only tried out whether just `NVME_MAX_SEGS 127` helps (see
message from Matthias). Answer is no. This only seem to by an upper
limit, because `/sys/class/block/nvme0n1/queue/max_segments` reports 33
with unmodified kernels >= 6.3.7. With older kernels or kernels with
reversed patch "nvme-pci: clamp max_hw_sectors based on DMA optimized
limitation" (introduced in 6.3.7) this value is 127 and corruptions
disappear.

I guess, this value somehow has to be 127. In my case it is sufficient
to revert the patch form 6.3.7. In Matthias's case, the values then
becomes 128 and has to be limited additionally using `NVME_MAX_SEGS 127`

Regards Stefan