linux-kernel - Re: 6.13/regression/bisected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z8m9AbD3tjNpBt6p@kbusch-mbp>
Date: Thu, 6 Mar 2025 08:19:29 -0700
From: Keith Busch <kbusch@...nel.org>
To: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
Cc: Christoph Hellwig <hch@....de>, axboe@...nel.dk,
	linux-nvme@...ts.infradead.org,
	Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
	Linux regressions mailing list <regressions@...ts.linux.dev>
Subject: Re: 6.13/regression/bisected - new nvme timeout errors

On Wed, Jan 15, 2025 at 02:58:04AM +0500, Mikhail Gavrilov wrote:
> Hi,
> During 6.13 development cycle I spotted strange new nvme errors in the
> log which I never seen before.
> 
> [87774.010474] nvme nvme1: I/O tag 0 (3000) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:131072

...

> I still haven't found a stable way to reproduce this.
> But I'm pretty sure that if this error don't appearing within two
> days, then we can assume that the kernel isn't affected by the
> problem.
> So I made bisection with above assumption and found this commit:
> 
> beadf0088501d9dcf2454b05d90d5d31ea3ba55f is the first bad commit
> commit beadf0088501d9dcf2454b05d90d5d31ea3ba55f
> Author: Christoph Hellwig <hch@....de>
> Date:   Wed Nov 13 16:20:41 2024 +0100
> 
>     nvme-pci: reverse request order in nvme_queue_rqs

The patch here uses the order recieved to dispatch commands in
consequetive submission queue entries, which is supposed to be the
desired behavior for any device. I did some testing on mailine, and it
sure looks like the order the driver does this is optimal, so I'm not
sure what's going on with your observation.

Do you have a scheduler enabled on your device?

How are you generating IO? Is it a pattern I should be able to replicate
with 'fio'?