[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220219210354.GF59715@dread.disaster.area>
Date: Sun, 20 Feb 2022 08:03:54 +1100
From: Dave Chinner <david@...morbit.com>
To: Kyle Sanderson <kyle.leet@...il.com>
Cc: qat-linux@...el.com, giovanni.cabiddu@...el.com,
Linux-Kernal <linux-kernel@...r.kernel.org>,
linux-xfs@...r.kernel.org, linux-crypto@...r.kernel.org,
dm-devel@...hat.com, Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: Intel QAT on A2SDi-8C-HLN4F causes massive data corruption with
dm-crypt + xfs
On Fri, Feb 18, 2022 at 09:02:28PM -0800, Kyle Sanderson wrote:
> A2SDi-8C-HLN4F has IQAT enabled by default, when this device is
> attempted to be used by xfs (through dm-crypt) the entire kernel
> thread stalls forever. Multiple users have hit this over the years
> (through sporadic reporting) - I ended up trying ZFS and encryption
> wasn't an issue there at all because I guess they don't use this
> device. Returning to sanity (xfs), I was able to provision a dm-crypt
> volume no problem on the disk, however when running mkfs.xfs on the
> volume is what triggers the cascading failure (each request kills a
> kthread).
Can you provide the full stack traces for these errors so we can see
exactly what this cascading failure looks like, please? In reality,
the stall messages some time after this are not interesting - it's
the first errors that cause the stall that need to be investigated.
A good idea would be to provide the full storage stack decription
and hardware in use, as per:
https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> Disabling IQAT on the south bridge results in a working
> system, however this is not the default configuration for the
> distribution of choice (Ubuntu 20.04.3 LTS), nor the motherboard. I'm
> convinced this never worked properly based on the lack of popularity
> for kernel encryption (crypto), and the embedded nature that
> SuperMicro has integrated this device in collaboration with intel as
> it looks like the primary usage is through external accelerator cards.
This really sounds like broken hardware, not a kernel problem.
> Kernels tried were from RHEL8 over a year ago, and this impacts the
> entirety of the 5.4 series on Ubuntu.
> Please CC me on replies as I'm not subscribed to all lists. CPU is C3758.
[snip stalled kcryptd worker threads]
This implies a dmcrypt level problem - XFS can't make progress is
dmcrypt is not completing IOs.
Where are the XFS corruption reports that the subject implies is
occurring?
Cheers,
Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists