[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49da4d80-5cc3-35c3-ccaa-6def8165eb65@huawei.com>
Date: Mon, 31 Jan 2022 15:58:50 +0000
From: John Garry <john.garry@...wei.com>
To: Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
<jejb@...ux.ibm.com>, <martin.petersen@...cle.com>,
<artur.paszkiewicz@...el.com>, <jinpu.wang@...ud.ionos.com>,
<chenxiang66@...ilicon.com>, <Ajish.Koshy@...rochip.com>
CC: <yanaijie@...wei.com>, <linux-doc@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-scsi@...r.kernel.org>,
<linuxarm@...wei.com>, <liuqi115@...wei.com>,
<Viswas.G@...rochip.com>
Subject: Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code
On 28/01/2022 09:09, John Garry wrote:
>> I ran some more tests. In particular, I ran libzbc compliance tests on a
>> 20TB SMR drives. All tests pass with 5.17-rc1, but after applying your
>> series, I see command timeout that take forever to recover from, with
>> the drive revalidation failing after that.
>>
>> [ 385.102073] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
>> [ 385.108026] sas: sas_scsi_find_task: aborting task 0x000000007068ed73
>> [ 405.561099] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
>> timeout.
>> [ 405.568236] sas: sas_scsi_find_task: task 0x000000007068ed73 is
>> aborted
>> [ 405.574930] sas: sas_eh_handle_sas_errors: task 0x000000007068ed73 is
>> aborted
>> [ 411.192602] ata21.00: qc timeout (cmd 0xec)
>> [ 431.672122] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
>> timeout.
>> [ 431.679282] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
>> [ 431.685544] ata21.00: revalidation failed (errno=-5)
>> [ 441.911948] ata21.00: qc timeout (cmd 0xec)
>> [ 462.391545] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
>> timeout.
>> [ 462.398696] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
>> [ 462.404992] ata21.00: revalidation failed (errno=-5)
>> [ 492.598769] ata21.00: qc timeout (cmd 0xec)
>> ...
>>
>> So there is a problem. Need to dig into this. I see this issue only with
>> libzbc passthrough tests. fio runs with libaio are fine.
>
> Thanks for the notice. I think that I also saw a hang, but, IIRC, it
> happened on mainline for me - but it's hard to know if I broke something
> if it is already broke in another way. That is why I wanted this card
> working properly...
Hi Damien,
From testing mainline, I can see a hang on my arm64 system for SAS
disks. I think that the reason is the we don't finish some commands in
EH properly for pm8001:
- In EH, we attempt to abort the task in sas_scsi_find_task() ->
lldd_abort_task()
The default return from pm8001_exec_internal_tmf_task() is
-TMF_RESP_FUNC_FAILED, so if the TMF does not execute properly we return
this value
- sas_scsi_find_task() cannot handle -TMF_RESP_FUNC_FAILED, and returns
-TMF_RESP_FUNC_FAILED directly to sas_eh_handle_sas_errors(), which,
again, does not handle -TMF_RESP_FUNC_FAILED. So we don't progress to
ever finish the comand.
This looks like the correct fix for mainline:
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -766,7 +766,7 @@ static int pm8001_exec_internal_tmf_task(struct
domain_device *dev,
pm8001_dev, DS_OPERATIONAL);
wait_for_completion(&completion_setstate);
}
- res = -TMF_RESP_FUNC_FAILED;
+ res = TMF_RESP_FUNC_FAILED;
That's effectively the same as what I have in this series in
sas_execute_tmf().
However your testing is a SATA device, which I'll check further.
Thanks,
John
Powered by blists - more mailing lists