lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CABXGCsMcxu3pCF8jYPeqF_jN34saBwc8Fci+c-Dg2Lh7rqvuFQ@mail.gmail.com>
Date: Wed, 15 Jan 2025 02:58:04 +0500
From: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
To: Christoph Hellwig <hch@....de>, axboe@...nel.dk, linux-nvme@...ts.infradead.org, 
	Linux List Kernel Mailing <linux-kernel@...r.kernel.org>, 
	Linux regressions mailing list <regressions@...ts.linux.dev>
Subject: 6.13/regression/bisected - new nvme timeout errors

Hi,
During 6.13 development cycle I spotted strange new nvme errors in the
log which I never seen before.

[87774.010474] nvme nvme1: I/O tag 0 (3000) opcode 0x1 (I/O Cmd) QID 1
timeout, aborting req_op:WRITE(1) size:131072
[87774.010502] nvme nvme1: I/O tag 1 (d001) opcode 0x1 (I/O Cmd) QID 1
timeout, aborting req_op:WRITE(1) size:131072
[87774.010507] nvme nvme1: I/O tag 2 (4002) opcode 0x1 (I/O Cmd) QID 1
timeout, aborting req_op:WRITE(1) size:131072
[87774.010513] nvme nvme1: I/O tag 3 (e003) opcode 0x1 (I/O Cmd) QID 1
timeout, aborting req_op:WRITE(1) size:131072
[87774.010525] nvme nvme1: Abort status: 0x0
[87774.010535] nvme nvme1: I/O tag 21 (a015) opcode 0x1 (I/O Cmd) QID
1 timeout, aborting req_op:WRITE(1) size:131072
[87774.010535] nvme nvme1: Abort status: 0x0
[87774.010538] nvme nvme1: I/O tag 22 (3016) opcode 0x1 (I/O Cmd) QID
1 timeout, aborting req_op:WRITE(1) size:32768
[87774.010539] nvme nvme1: Abort status: 0x0
[87774.010542] nvme nvme1: Abort status: 0x0
[87774.010545] nvme nvme1: Abort status: 0x0
[87774.010548] nvme nvme1: I/O tag 384 (7180) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010553] nvme nvme1: I/O tag 385 (b181) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010554] nvme nvme1: Abort status: 0x0
[87774.010557] nvme nvme1: I/O tag 386 (8182) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010561] nvme nvme1: I/O tag 387 (5183) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010562] nvme nvme1: Abort status: 0x0
[87774.010564] nvme nvme1: I/O tag 388 (6184) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010570] nvme nvme1: Abort status: 0x0
[87774.010573] nvme nvme1: I/O tag 393 (5189) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010578] nvme nvme1: Abort status: 0x0
[87774.010581] nvme nvme1: I/O tag 398 (718e) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010586] nvme nvme1: Abort status: 0x0
[87774.010589] nvme nvme1: I/O tag 403 (a193) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:61440
[87774.010593] nvme nvme1: Abort status: 0x0
[87774.010595] nvme nvme1: I/O tag 406 (0196) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010601] nvme nvme1: Abort status: 0x0
[87774.010604] nvme nvme1: I/O tag 411 (419b) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:40960
[87774.010608] nvme nvme1: Abort status: 0x0
[87774.010612] nvme nvme1: I/O tag 416 (91a0) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010616] nvme nvme1: Abort status: 0x0
[87774.010619] nvme nvme1: I/O tag 420 (f1a4) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010624] nvme nvme1: Abort status: 0x0
[87774.010627] nvme nvme1: I/O tag 425 (51a9) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010632] nvme nvme1: Abort status: 0x0
[87774.010634] nvme nvme1: I/O tag 429 (11ad) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:98304
[87774.010639] nvme nvme1: Abort status: 0x0
[87774.010641] nvme nvme1: I/O tag 433 (11b1) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010647] nvme nvme1: Abort status: 0x0
[87774.010650] nvme nvme1: I/O tag 438 (c1b6) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010654] nvme nvme1: Abort status: 0x0
[87774.010657] nvme nvme1: I/O tag 442 (91ba) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010662] nvme nvme1: Abort status: 0x0
[87774.010664] nvme nvme1: I/O tag 446 (d1be) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:32768
[87774.010670] nvme nvme1: Abort status: 0x0
[87774.010674] nvme nvme1: I/O tag 452 (f1c4) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010678] nvme nvme1: Abort status: 0x0
[87774.010681] nvme nvme1: I/O tag 456 (a1c8) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010686] nvme nvme1: Abort status: 0x0
[87774.010689] nvme nvme1: I/O tag 461 (31cd) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:90112
[87774.010694] nvme nvme1: Abort status: 0x0
[87774.010697] nvme nvme1: I/O tag 466 (51d2) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:8192
[87774.010702] nvme nvme1: Abort status: 0x0
[87774.010704] nvme nvme1: I/O tag 470 (e1d6) opcode 0x1 (I/O Cmd) QID
17 timeout, aborting req_op:WRITE(1) size:131072
[87774.010709] nvme nvme1: Abort status: 0x0
[87774.010717] nvme nvme1: Abort status: 0x0
[87774.010725] nvme nvme1: Abort status: 0x0
[87774.010732] nvme nvme1: Abort status: 0x0
[87774.034477] nvme nvme1: I/O tag 1 (a001) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:131072
[87774.034487] nvme nvme1: I/O tag 2 (0002) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:131072
[87774.034492] nvme nvme1: I/O tag 3 (d003) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:131072
[87774.034496] nvme nvme1: I/O tag 4 (b004) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:90112
[87774.034512] nvme nvme1: Abort status: 0x0
[87774.034524] nvme nvme1: Abort status: 0x0
[87774.034533] nvme nvme1: Abort status: 0x0
[87774.034537] nvme nvme1: Abort status: 0x0
[87774.051474] nvme nvme1: I/O tag 0 (1000) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:81920
[87774.051484] nvme nvme1: I/O tag 5 (e005) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:131072
[87774.051489] nvme nvme1: I/O tag 6 (a006) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:131072
[87774.051493] nvme nvme1: I/O tag 7 (7007) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:131072
[87774.051496] nvme nvme1: Abort status: 0x0
[87774.051514] nvme nvme1: I/O tag 25 (a019) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:131072
[87774.051518] nvme nvme1: Abort status: 0x0
[87774.051523] nvme nvme1: Abort status: 0x0
[87774.051523] nvme nvme1: I/O tag 31 (601f) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:131072
[87774.051527] nvme nvme1: Abort status: 0x0
[87774.051528] nvme nvme1: I/O tag 32 (c020) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:131072
[87774.051532] nvme nvme1: I/O tag 33 (4021) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:131072
[87774.051535] nvme nvme1: Abort status: 0x0
[87774.051539] nvme nvme1: I/O tag 37 (4025) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:131072
[87774.051540] nvme nvme1: Abort status: 0x0
[87774.051544] nvme nvme1: I/O tag 38 (0026) opcode 0x1 (I/O Cmd) QID
29 timeout, aborting req_op:WRITE(1) size:131072
[87774.051550] nvme nvme1: Abort status: 0x0
[87774.051559] nvme nvme1: Abort status: 0x0
[87774.051559] nvme nvme1: I/O tag 512 (f200) opcode 0x1 (I/O Cmd) QID
32 timeout, aborting req_op:WRITE(1) size:131072
[87774.051562] nvme nvme1: Abort status: 0x0
[87774.051564] nvme nvme1: I/O tag 513 (a201) opcode 0x1 (I/O Cmd) QID
32 timeout, aborting req_op:WRITE(1) size:131072
[87774.051568] nvme nvme1: I/O tag 514 (4202) opcode 0x1 (I/O Cmd) QID
32 timeout, aborting req_op:WRITE(1) size:131072
[87774.051570] nvme nvme1: Abort status: 0x0
[87774.051573] nvme nvme1: I/O tag 516 (d204) opcode 0x1 (I/O Cmd) QID
32 timeout, aborting req_op:WRITE(1) size:131072
[87774.051576] nvme nvme1: Abort status: 0x0
[87774.051579] nvme nvme1: I/O tag 520 (b208) opcode 0x1 (I/O Cmd) QID
32 timeout, aborting req_op:WRITE(1) size:131072
[87774.051583] nvme nvme1: Abort status: 0x0
[87774.051586] nvme nvme1: I/O tag 524 (f20c) opcode 0x1 (I/O Cmd) QID
32 timeout, aborting req_op:WRITE(1) size:131072
[87774.051592] nvme nvme1: Abort status: 0x0
[87774.051594] nvme nvme1: I/O tag 529 (f211) opcode 0x1 (I/O Cmd) QID
32 timeout, aborting req_op:WRITE(1) size:131072
[87774.051599] nvme nvme1: Abort status: 0x0
[87774.051601] nvme nvme1: I/O tag 533 (6215) opcode 0x1 (I/O Cmd) QID
32 timeout, aborting req_op:WRITE(1) size:131072
[87774.051607] nvme nvme1: Abort status: 0x0
[87774.051609] nvme nvme1: I/O tag 538 (121a) opcode 0x1 (I/O Cmd) QID
32 timeout, aborting req_op:WRITE(1) size:131072
[87774.051615] nvme nvme1: Abort status: 0x0
[87774.051628] nvme nvme1: Abort status: 0x0
[87774.051632] nvme nvme1: Abort status: 0x0
[87774.051642] nvme nvme1: Abort status: 0x0
[87775.490533] nvme nvme1: I/O tag 542 (021e) opcode 0x2 (I/O Cmd) QID
32 timeout, aborting req_op:READ(0) size:16384
[87775.490568] nvme nvme1: Abort status: 0x0
[87778.435607] nvme nvme1: I/O tag 543 (e21f) opcode 0x9 (I/O Cmd) QID
32 timeout, aborting req_op:DISCARD(3) size:45056
[87778.435643] nvme nvme1: Abort status: 0x0
[87801.658089] nvme nvme1: I/O tag 23 (7017) opcode 0x2 (I/O Cmd) QID
1 timeout, aborting req_op:READ(0) size:16384
[87801.658159] nvme nvme1: Abort status: 0x0
[87804.019139] nvme nvme1: I/O tag 0 (3000) opcode 0x1 (I/O Cmd) QID 1
timeout, reset controller
[87804.064993] nvme nvme1: 32/0/0 default/read/poll queues

I still haven't found a stable way to reproduce this.
But I'm pretty sure that if this error don't appearing within two
days, then we can assume that the kernel isn't affected by the
problem.
So I made bisection with above assumption and found this commit:

beadf0088501d9dcf2454b05d90d5d31ea3ba55f is the first bad commit
commit beadf0088501d9dcf2454b05d90d5d31ea3ba55f
Author: Christoph Hellwig <hch@....de>
Date:   Wed Nov 13 16:20:41 2024 +0100

    nvme-pci: reverse request order in nvme_queue_rqs

    blk_mq_flush_plug_list submits requests in the reverse order that they
    were submitted, which leads to a rather suboptimal I/O pattern especially
    in rotational devices.  Fix this by rewriting nvme_queue_rqs so that it
    always pops the requests from the passed in request list, and then adds
    them to the head of a local submit list.  This actually simplifies the
    code a bit as it removes the complicated list splicing, at the cost of
    extra updates of the rq_next pointer.  As that should be cache hot
    anyway it should be an easy price to pay.

    Fixes: d62cbcf62f2f ("nvme: add support for mq_ops->queue_rqs()")
    Signed-off-by: Christoph Hellwig <hch@....de>
    Link: https://lore.kernel.org/r/20241113152050.157179-2-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@...nel.dk>

 drivers/nvme/host/pci.c | 39 +++++++++++++++++----------------------
 1 file changed, 17 insertions(+), 22 deletions(-)

 Christoph can you look into it, please?

My machine specs: https://linux-hardware.org/?probe=a4003b140c
Hiding under nvme1 is INTEL SSDPF2NV307TZ (ACV10340)

# nvme smart-log /dev/nvme1
Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning : 0
temperature : 98 °F (310 K)
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 1%
endurance group critical warning summary: 0
Data Units Read : 694928083 (355.80 TB)
Data Units Written : 861761625 (441.22 TB)
host_read_commands : 6389380673
host_write_commands : 11970151442
controller_busy_time : 147
power_cycles : 474
power_on_hours : 17268
unsafe_shutdowns : 307
media_errors : 0
num_err_log_entries : 111
Warning Temperature Time : 566
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0

I attached below build config and full kernel log.

-- 
Best Regards,
Mike Gavrilov.

Download attachment ".config.zip" of type "application/zip" (67405 bytes)

Download attachment "dmesg-6.13.0-rc6-b62cef9a5c67.zip" of type "application/zip" (59451 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ