lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CABXGCsP9_SNBpGmqDvc3t1Qeb7Op5kt7mrfO=2QS_fGVKHtZpg@mail.gmail.com>
Date: Mon, 10 Feb 2025 13:47:59 +0500
From: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
To: Christoph Hellwig <hch@....de>, axboe@...nel.dk, linux-nvme@...ts.infradead.org, 
	Linux List Kernel Mailing <linux-kernel@...r.kernel.org>, 
	Linux regressions mailing list <regressions@...ts.linux.dev>
Subject: Re: 6.13/regression/bisected - new nvme timeout errors

On Wed, Jan 15, 2025 at 2:58 AM Mikhail Gavrilov
<mikhail.v.gavrilov@...il.com> wrote:
>
> Hi,
> During 6.13 development cycle I spotted strange new nvme errors in the
> log which I never seen before.
>
> [87774.010474] nvme nvme1: I/O tag 0 (3000) opcode 0x1 (I/O Cmd) QID 1
> timeout, aborting req_op:WRITE(1) size:131072
> [87774.010502] nvme nvme1: I/O tag 1 (d001) opcode 0x1 (I/O Cmd) QID 1
> timeout, aborting req_op:WRITE(1) size:131072
> [87774.010507] nvme nvme1: I/O tag 2 (4002) opcode 0x1 (I/O Cmd) QID 1
> timeout, aborting req_op:WRITE(1) size:131072
> [87774.010513] nvme nvme1: I/O tag 3 (e003) opcode 0x1 (I/O Cmd) QID 1
> timeout, aborting req_op:WRITE(1) size:131072
> [87774.010525] nvme nvme1: Abort status: 0x0
> [87774.010535] nvme nvme1: I/O tag 21 (a015) opcode 0x1 (I/O Cmd) QID
> 1 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010535] nvme nvme1: Abort status: 0x0
> [87774.010538] nvme nvme1: I/O tag 22 (3016) opcode 0x1 (I/O Cmd) QID
> 1 timeout, aborting req_op:WRITE(1) size:32768
> [87774.010539] nvme nvme1: Abort status: 0x0
> [87774.010542] nvme nvme1: Abort status: 0x0
> [87774.010545] nvme nvme1: Abort status: 0x0
> [87774.010548] nvme nvme1: I/O tag 384 (7180) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010553] nvme nvme1: I/O tag 385 (b181) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010554] nvme nvme1: Abort status: 0x0
> [87774.010557] nvme nvme1: I/O tag 386 (8182) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010561] nvme nvme1: I/O tag 387 (5183) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010562] nvme nvme1: Abort status: 0x0
> [87774.010564] nvme nvme1: I/O tag 388 (6184) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010570] nvme nvme1: Abort status: 0x0
> [87774.010573] nvme nvme1: I/O tag 393 (5189) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010578] nvme nvme1: Abort status: 0x0
> [87774.010581] nvme nvme1: I/O tag 398 (718e) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010586] nvme nvme1: Abort status: 0x0
> [87774.010589] nvme nvme1: I/O tag 403 (a193) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:61440
> [87774.010593] nvme nvme1: Abort status: 0x0
> [87774.010595] nvme nvme1: I/O tag 406 (0196) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010601] nvme nvme1: Abort status: 0x0
> [87774.010604] nvme nvme1: I/O tag 411 (419b) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:40960
> [87774.010608] nvme nvme1: Abort status: 0x0
> [87774.010612] nvme nvme1: I/O tag 416 (91a0) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010616] nvme nvme1: Abort status: 0x0
> [87774.010619] nvme nvme1: I/O tag 420 (f1a4) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010624] nvme nvme1: Abort status: 0x0
> [87774.010627] nvme nvme1: I/O tag 425 (51a9) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010632] nvme nvme1: Abort status: 0x0
> [87774.010634] nvme nvme1: I/O tag 429 (11ad) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:98304
> [87774.010639] nvme nvme1: Abort status: 0x0
> [87774.010641] nvme nvme1: I/O tag 433 (11b1) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010647] nvme nvme1: Abort status: 0x0
> [87774.010650] nvme nvme1: I/O tag 438 (c1b6) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010654] nvme nvme1: Abort status: 0x0
> [87774.010657] nvme nvme1: I/O tag 442 (91ba) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010662] nvme nvme1: Abort status: 0x0
> [87774.010664] nvme nvme1: I/O tag 446 (d1be) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:32768
> [87774.010670] nvme nvme1: Abort status: 0x0
> [87774.010674] nvme nvme1: I/O tag 452 (f1c4) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010678] nvme nvme1: Abort status: 0x0
> [87774.010681] nvme nvme1: I/O tag 456 (a1c8) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010686] nvme nvme1: Abort status: 0x0
> [87774.010689] nvme nvme1: I/O tag 461 (31cd) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:90112
> [87774.010694] nvme nvme1: Abort status: 0x0
> [87774.010697] nvme nvme1: I/O tag 466 (51d2) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:8192
> [87774.010702] nvme nvme1: Abort status: 0x0
> [87774.010704] nvme nvme1: I/O tag 470 (e1d6) opcode 0x1 (I/O Cmd) QID
> 17 timeout, aborting req_op:WRITE(1) size:131072
> [87774.010709] nvme nvme1: Abort status: 0x0
> [87774.010717] nvme nvme1: Abort status: 0x0
> [87774.010725] nvme nvme1: Abort status: 0x0
> [87774.010732] nvme nvme1: Abort status: 0x0
> [87774.034477] nvme nvme1: I/O tag 1 (a001) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:131072
> [87774.034487] nvme nvme1: I/O tag 2 (0002) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:131072
> [87774.034492] nvme nvme1: I/O tag 3 (d003) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:131072
> [87774.034496] nvme nvme1: I/O tag 4 (b004) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:90112
> [87774.034512] nvme nvme1: Abort status: 0x0
> [87774.034524] nvme nvme1: Abort status: 0x0
> [87774.034533] nvme nvme1: Abort status: 0x0
> [87774.034537] nvme nvme1: Abort status: 0x0
> [87774.051474] nvme nvme1: I/O tag 0 (1000) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:81920
> [87774.051484] nvme nvme1: I/O tag 5 (e005) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051489] nvme nvme1: I/O tag 6 (a006) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051493] nvme nvme1: I/O tag 7 (7007) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051496] nvme nvme1: Abort status: 0x0
> [87774.051514] nvme nvme1: I/O tag 25 (a019) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051518] nvme nvme1: Abort status: 0x0
> [87774.051523] nvme nvme1: Abort status: 0x0
> [87774.051523] nvme nvme1: I/O tag 31 (601f) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051527] nvme nvme1: Abort status: 0x0
> [87774.051528] nvme nvme1: I/O tag 32 (c020) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051532] nvme nvme1: I/O tag 33 (4021) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051535] nvme nvme1: Abort status: 0x0
> [87774.051539] nvme nvme1: I/O tag 37 (4025) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051540] nvme nvme1: Abort status: 0x0
> [87774.051544] nvme nvme1: I/O tag 38 (0026) opcode 0x1 (I/O Cmd) QID
> 29 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051550] nvme nvme1: Abort status: 0x0
> [87774.051559] nvme nvme1: Abort status: 0x0
> [87774.051559] nvme nvme1: I/O tag 512 (f200) opcode 0x1 (I/O Cmd) QID
> 32 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051562] nvme nvme1: Abort status: 0x0
> [87774.051564] nvme nvme1: I/O tag 513 (a201) opcode 0x1 (I/O Cmd) QID
> 32 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051568] nvme nvme1: I/O tag 514 (4202) opcode 0x1 (I/O Cmd) QID
> 32 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051570] nvme nvme1: Abort status: 0x0
> [87774.051573] nvme nvme1: I/O tag 516 (d204) opcode 0x1 (I/O Cmd) QID
> 32 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051576] nvme nvme1: Abort status: 0x0
> [87774.051579] nvme nvme1: I/O tag 520 (b208) opcode 0x1 (I/O Cmd) QID
> 32 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051583] nvme nvme1: Abort status: 0x0
> [87774.051586] nvme nvme1: I/O tag 524 (f20c) opcode 0x1 (I/O Cmd) QID
> 32 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051592] nvme nvme1: Abort status: 0x0
> [87774.051594] nvme nvme1: I/O tag 529 (f211) opcode 0x1 (I/O Cmd) QID
> 32 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051599] nvme nvme1: Abort status: 0x0
> [87774.051601] nvme nvme1: I/O tag 533 (6215) opcode 0x1 (I/O Cmd) QID
> 32 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051607] nvme nvme1: Abort status: 0x0
> [87774.051609] nvme nvme1: I/O tag 538 (121a) opcode 0x1 (I/O Cmd) QID
> 32 timeout, aborting req_op:WRITE(1) size:131072
> [87774.051615] nvme nvme1: Abort status: 0x0
> [87774.051628] nvme nvme1: Abort status: 0x0
> [87774.051632] nvme nvme1: Abort status: 0x0
> [87774.051642] nvme nvme1: Abort status: 0x0
> [87775.490533] nvme nvme1: I/O tag 542 (021e) opcode 0x2 (I/O Cmd) QID
> 32 timeout, aborting req_op:READ(0) size:16384
> [87775.490568] nvme nvme1: Abort status: 0x0
> [87778.435607] nvme nvme1: I/O tag 543 (e21f) opcode 0x9 (I/O Cmd) QID
> 32 timeout, aborting req_op:DISCARD(3) size:45056
> [87778.435643] nvme nvme1: Abort status: 0x0
> [87801.658089] nvme nvme1: I/O tag 23 (7017) opcode 0x2 (I/O Cmd) QID
> 1 timeout, aborting req_op:READ(0) size:16384
> [87801.658159] nvme nvme1: Abort status: 0x0
> [87804.019139] nvme nvme1: I/O tag 0 (3000) opcode 0x1 (I/O Cmd) QID 1
> timeout, reset controller
> [87804.064993] nvme nvme1: 32/0/0 default/read/poll queues
>
> I still haven't found a stable way to reproduce this.
> But I'm pretty sure that if this error don't appearing within two
> days, then we can assume that the kernel isn't affected by the
> problem.
> So I made bisection with above assumption and found this commit:
>
> beadf0088501d9dcf2454b05d90d5d31ea3ba55f is the first bad commit
> commit beadf0088501d9dcf2454b05d90d5d31ea3ba55f
> Author: Christoph Hellwig <hch@....de>
> Date:   Wed Nov 13 16:20:41 2024 +0100
>
>     nvme-pci: reverse request order in nvme_queue_rqs
>
>     blk_mq_flush_plug_list submits requests in the reverse order that they
>     were submitted, which leads to a rather suboptimal I/O pattern especially
>     in rotational devices.  Fix this by rewriting nvme_queue_rqs so that it
>     always pops the requests from the passed in request list, and then adds
>     them to the head of a local submit list.  This actually simplifies the
>     code a bit as it removes the complicated list splicing, at the cost of
>     extra updates of the rq_next pointer.  As that should be cache hot
>     anyway it should be an easy price to pay.
>
>     Fixes: d62cbcf62f2f ("nvme: add support for mq_ops->queue_rqs()")
>     Signed-off-by: Christoph Hellwig <hch@....de>
>     Link: https://lore.kernel.org/r/20241113152050.157179-2-hch@lst.de
>     Signed-off-by: Jens Axboe <axboe@...nel.dk>
>
>  drivers/nvme/host/pci.c | 39 +++++++++++++++++----------------------
>  1 file changed, 17 insertions(+), 22 deletions(-)
>
>  Christoph can you look into it, please?
>
> My machine specs: https://linux-hardware.org/?probe=a4003b140c
> Hiding under nvme1 is INTEL SSDPF2NV307TZ (ACV10340)
>
> # nvme smart-log /dev/nvme1
> Smart Log for NVME device:nvme1 namespace-id:ffffffff
> critical_warning : 0
> temperature : 98 °F (310 K)
> available_spare : 100%
> available_spare_threshold : 10%
> percentage_used : 1%
> endurance group critical warning summary: 0
> Data Units Read : 694928083 (355.80 TB)
> Data Units Written : 861761625 (441.22 TB)
> host_read_commands : 6389380673
> host_write_commands : 11970151442
> controller_busy_time : 147
> power_cycles : 474
> power_on_hours : 17268
> unsafe_shutdowns : 307
> media_errors : 0
> num_err_log_entries : 111
> Warning Temperature Time : 566
> Critical Composite Temperature Time : 0
> Thermal Management T1 Trans Count : 0
> Thermal Management T2 Trans Count : 0
> Thermal Management T1 Total Time : 0
> Thermal Management T2 Total Time : 0
>
> I attached below build config and full kernel log.
>
> --
> Best Regards,
> Mike Gavrilov.

Hi again.
So no one looked at this?
Unfornitally this issue is still happening in 6.14-rc1 :(
I attached a fresh kernel log here.

-- 
Best Regards,
Mike Gavrilov.

Download attachment "dmesg-6.14.0-rc1-8f6629c004b1.zip" of type "application/zip" (55213 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ