lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250204061208.GA29300@lst.de>
Date: Tue, 4 Feb 2025 07:12:08 +0100
From: Christoph Hellwig <hch@....de>
To: Bruno Gravato <bgravato@...il.com>
Cc: Stefan <linux-kernel@...g.de>,
	"Dr. David Alan Gilbert" <linux@...blig.org>,
	Christoph Hellwig <hch@....de>,
	Thorsten Leemhuis <linux@...mhuis.info>,
	Mario Limonciello <mario.limonciello@....com>,
	Keith Busch <kbusch@...nel.org>, Adrian Huang <ahuang12@...ovo.com>,
	Linux kernel regressions list <regressions@...ts.linux.dev>,
	linux-nvme@...ts.infradead.org, Jens Axboe <axboe@...com>,
	"iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [Bug 219609] File corruptions on SSD in 1st M.2 socket of
 AsRock X600M-STX + Ryzen 8700G

On Sun, Feb 02, 2025 at 08:32:31AM +0000, Bruno Gravato wrote:
> In my tests I was using real data: a backup of my files.
> 
> On one such test I copied over 300K files, variables sizes and types
> totalling about 60GB. A bit over 20 files got corrupted.
> I tried copying the files over the network (ethernet) using rsync/ssh.
> I also tried restoring the files using restic (over ssh as well). And
> I also tried copying the files locally from a SATA disk. In all cases
> I got similar results with some files being corrupted.
> The destination nvme disk was using btrfs and running btrfs scrub
> after the copy detects quite a few checksum errors.

So you used various different data sources, and the desintation was
always the nvme device in the suspect slot.

> I analyzed some of those corrupted files and one of them happened to
> be a text file (linux kernel source code).
> A big portion of the text was replaced with text from another file in
> the same directory (being text made it easy to find where it came
> from).
> So this was a contiguous block of text that was overwritten with a
> contiguous block of text from another file.
> If I remember correctly the other file was not corrupted (so the
> blocks weren't swapped). It looked like a certain block of text was
> written twice: on the correct file and on another file in the same
> directory.

That's a very interesting pattern.

> I also got some jpeg images corrupted. I was able to open and view
> (partially) those images and it looked like a portion of the image was
> repeated in a different part of it), so blocks of the same file were
> probably duplicated and overwritten within itself.
> 
> The blocks being overwritten seemed to be different sizes on different files.

This does sound like a fairly common pattern due to SSD FTL issues,
but I still don't want to rule out swiotlb, which due to the bucketing
could maybe also lead to these, but I can't really see how.  But the
fact that the affected systems seem to be using swiotlb despite no
good reason for them to do so still leaves me puzzled.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ