lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z4BlZCWzu3OV3V2U@kbusch-mbp>
Date: Thu, 9 Jan 2025 17:10:12 -0700
From: Keith Busch <kbusch@...nel.org>
To: Christoph Hellwig <hch@....de>
Cc: Thorsten Leemhuis <regressions@...mhuis.info>,
	Adrian Huang <ahuang12@...ovo.com>,
	Linux kernel regressions list <regressions@...ts.linux.dev>,
	linux-nvme@...ts.infradead.org, Jens Axboe <axboe@...com>,
	"iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [Regression] File corruptions on SSD in 1st M.2 socket of AsRock
 X600M-STX

On Thu, Jan 09, 2025 at 09:28:49AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 08, 2025 at 08:07:28AM -0700, Keith Busch wrote:
> > It should always be okay to do smaller transfers as long as everything
> > stays aligned the logical block size. I'm guessing the dma opt change
> > has exposed some other flaw in the nvme controller. For example, two
> > consecutive smaller writes are hitting some controller side caching bug
> > that a single larger trasnfer would have handled correctly. The host
> > could have sent such a sequence even without the patch reverted, but
> > happens to not be doing that in this particular test.
> 
> Yes.  This somehow reminds of the bug with an Intel SSD that got
> really upset with quickly following writes to different LBAs inside the
> same indirection unit.

Good old https://bugzilla.redhat.com/show_bug.cgi?id=1402533 ...

> But as the new smaller size is nicely aligned
> that seems unlikely.  Maybe the higher number of commands simply overloads
> the buggy firmware?

Maybe the higher size creates different splits that better straddle some
unreported internal boundary we don't know about. This all just points
to some probabilisitic scenario that somehow happens more often with
a lower transfer limit.

The bugzilla reports disabling VWC makes the problem go away. That may
be a timing thing or a caching thing, but suggests a kernel bug is less
likely (yay!?); not easy to tell so far. It's just concerning multiple
vendor devices are reporting a similiar observation, so maybe these are
not even the same root problem.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ