linux-kernel - Periodic locking IO is causing server to stop responding

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <01046937-1027-c88c-a0de-6cbdb28132ca@iewc.co.za>
Date:   Mon, 29 Mar 2021 08:59:17 +0200
From:   Ian Coetzee <ian@...c.co.za>
To:     linux-kernel@...r.kernel.org
Subject: Periodic locking IO is causing server to stop responding

Hi All,

We have run into a slight mishap here on one of our servers, which I am 
hoping you could help narrow down to a cause.

One of our servers locks up every so often, seemingly because of a disk 
IO lockup. Symptoms include high load average (106) stemming from the 
processor waiting around 97-99% for the disk. When this occurs any new 
ssh sessions is met with a connection timeout.

Kernel version: 5.10.24-uls #1 SMP Fri Mar 19 11:31:52 SAST 2021 x86_64 
Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz GenuineIntel GNU/Linux

The following log entries appeared in dmesg around the time the last 
lock up started occurring.

> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: handle(0x0012), 
> sas_address(0x500304800175f088), phy(8)
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: enclosure logical 
> id(0x500304800175f0bf), slot(8)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: No reference found at driver, 
> assuming scmd(0x0000000029a7ef73) might have completed
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: task abort: SUCCESS 
> scmd(0x0000000029a7ef73)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: attempting task 
> abort!scmd(0x00000000de97f273), outstanding for 184620 ms & timeout 
> 180000 ms
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: [sdi] tag#1993 CDB: Write(16) 
> 8a 00 00 00 00 00 34 d5 2c 00 00 00 04 00 00 00
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: handle(0x0012), 
> sas_address(0x500304800175f088), phy(8)
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: enclosure logical 
> id(0x500304800175f0bf), slot(8)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: No reference found at driver, 
> assuming scmd(0x00000000de97f273) might have completed
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: task abort: SUCCESS 
> scmd(0x00000000de97f273)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: attempting task 
> abort!scmd(0x00000000e4cfbc75), outstanding for 184600 ms & timeout 
> 180000 ms
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: [sdi] tag#1987 CDB: Write(16) 
> 8a 00 00 00 00 00 34 d5 96 a8 00 00 01 58 00 00
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: handle(0x0012), 
> sas_address(0x500304800175f088), phy(8)
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: enclosure logical 
> id(0x500304800175f0bf), slot(8)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: No reference found at driver, 
> assuming scmd(0x00000000e4cfbc75) might have completed
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: task abort: SUCCESS 
> scmd(0x00000000e4cfbc75)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: attempting task 
> abort!scmd(0x000000002282f27d), outstanding for 184620 ms & timeout 
> 180000 ms
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: [sdi] tag#1986 CDB: Write(16) 
> 8a 00 00 00 00 00 34 d5 3c 00 00 00 04 00 00 00
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: handle(0x0012), 
> sas_address(0x500304800175f088), phy(8)
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: enclosure logical 
> id(0x500304800175f0bf), slot(8)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: No reference found at driver, 
> assuming scmd(0x000000002282f27d) might have completed
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: task abort: SUCCESS 
> scmd(0x000000002282f27d)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: device_unblock and setting to 
> running, handle(0x0012)
> [Tue Mar 23 06:09:33 2021] sd 0:0:8:0: Power-on or device reset occurred

We are running a bank of drives on software raid, all on controller

>            *-storage
>                 description: Serial Attached SCSI controller
>                 product: SAS3008 PCI-Express Fusion-MPT SAS-3
>                 vendor: Broadcom / LSI
>                 physical id: 0
>                 bus info: pci@...0:01:00.0
>                 logical name: scsi0
>                 version: 02
>                 width: 64 bits
>                 clock: 33MHz
>                 capabilities: storage pm pciexpress vpd msi msix 
> bus_master cap_list rom
>                 configuration: driver=mpt3sas latency=0
>                 resources: irq:24 ioport:e000(size=256) 
> memory:fb200000-fb20ffff memory:fb100000-fb1fffff

So far we have not seen this on any of other servers on the same kernel 
version.

Please let me know if I can provide anymore information, we have since 
restarted the server.

Kind regards
Ian Coetzee