[<prev] [next>] [day] [month] [year] [list]
Message-ID: <01046937-1027-c88c-a0de-6cbdb28132ca@iewc.co.za>
Date: Mon, 29 Mar 2021 08:59:17 +0200
From: Ian Coetzee <ian@...c.co.za>
To: linux-kernel@...r.kernel.org
Subject: Periodic locking IO is causing server to stop responding
Hi All,
We have run into a slight mishap here on one of our servers, which I am
hoping you could help narrow down to a cause.
One of our servers locks up every so often, seemingly because of a disk
IO lockup. Symptoms include high load average (106) stemming from the
processor waiting around 97-99% for the disk. When this occurs any new
ssh sessions is met with a connection timeout.
Kernel version: 5.10.24-uls #1 SMP Fri Mar 19 11:31:52 SAST 2021 x86_64
Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz GenuineIntel GNU/Linux
The following log entries appeared in dmesg around the time the last
lock up started occurring.
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: handle(0x0012),
> sas_address(0x500304800175f088), phy(8)
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: enclosure logical
> id(0x500304800175f0bf), slot(8)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: No reference found at driver,
> assuming scmd(0x0000000029a7ef73) might have completed
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: task abort: SUCCESS
> scmd(0x0000000029a7ef73)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: attempting task
> abort!scmd(0x00000000de97f273), outstanding for 184620 ms & timeout
> 180000 ms
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: [sdi] tag#1993 CDB: Write(16)
> 8a 00 00 00 00 00 34 d5 2c 00 00 00 04 00 00 00
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: handle(0x0012),
> sas_address(0x500304800175f088), phy(8)
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: enclosure logical
> id(0x500304800175f0bf), slot(8)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: No reference found at driver,
> assuming scmd(0x00000000de97f273) might have completed
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: task abort: SUCCESS
> scmd(0x00000000de97f273)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: attempting task
> abort!scmd(0x00000000e4cfbc75), outstanding for 184600 ms & timeout
> 180000 ms
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: [sdi] tag#1987 CDB: Write(16)
> 8a 00 00 00 00 00 34 d5 96 a8 00 00 01 58 00 00
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: handle(0x0012),
> sas_address(0x500304800175f088), phy(8)
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: enclosure logical
> id(0x500304800175f0bf), slot(8)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: No reference found at driver,
> assuming scmd(0x00000000e4cfbc75) might have completed
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: task abort: SUCCESS
> scmd(0x00000000e4cfbc75)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: attempting task
> abort!scmd(0x000000002282f27d), outstanding for 184620 ms & timeout
> 180000 ms
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: [sdi] tag#1986 CDB: Write(16)
> 8a 00 00 00 00 00 34 d5 3c 00 00 00 04 00 00 00
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: handle(0x0012),
> sas_address(0x500304800175f088), phy(8)
> [Tue Mar 23 06:09:32 2021] scsi target0:0:8: enclosure logical
> id(0x500304800175f0bf), slot(8)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: No reference found at driver,
> assuming scmd(0x000000002282f27d) might have completed
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: task abort: SUCCESS
> scmd(0x000000002282f27d)
> [Tue Mar 23 06:09:32 2021] sd 0:0:8:0: device_unblock and setting to
> running, handle(0x0012)
> [Tue Mar 23 06:09:33 2021] sd 0:0:8:0: Power-on or device reset occurred
We are running a bank of drives on software raid, all on controller
> *-storage
> description: Serial Attached SCSI controller
> product: SAS3008 PCI-Express Fusion-MPT SAS-3
> vendor: Broadcom / LSI
> physical id: 0
> bus info: pci@...0:01:00.0
> logical name: scsi0
> version: 02
> width: 64 bits
> clock: 33MHz
> capabilities: storage pm pciexpress vpd msi msix
> bus_master cap_list rom
> configuration: driver=mpt3sas latency=0
> resources: irq:24 ioport:e000(size=256)
> memory:fb200000-fb20ffff memory:fb100000-fb1fffff
So far we have not seen this on any of other servers on the same kernel
version.
Please let me know if I can provide anymore information, we have since
restarted the server.
Kind regards
Ian Coetzee
Powered by blists - more mailing lists