linux-kernel - Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <49da4d80-5cc3-35c3-ccaa-6def8165eb65@huawei.com>
Date:   Mon, 31 Jan 2022 15:58:50 +0000
From:   John Garry <john.garry@...wei.com>
To:     Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
        <jejb@...ux.ibm.com>, <martin.petersen@...cle.com>,
        <artur.paszkiewicz@...el.com>, <jinpu.wang@...ud.ionos.com>,
        <chenxiang66@...ilicon.com>, <Ajish.Koshy@...rochip.com>
CC:     <yanaijie@...wei.com>, <linux-doc@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>, <linux-scsi@...r.kernel.org>,
        <linuxarm@...wei.com>, <liuqi115@...wei.com>,
        <Viswas.G@...rochip.com>
Subject: Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

On 28/01/2022 09:09, John Garry wrote:
>> I ran some more tests. In particular, I ran libzbc compliance tests on a
>> 20TB SMR drives. All tests pass with 5.17-rc1, but after applying your
>> series, I see command timeout that take forever to recover from, with
>> the drive revalidation failing after that.
>>
>> [  385.102073] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
>> [  385.108026] sas: sas_scsi_find_task: aborting task 0x000000007068ed73
>> [  405.561099] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
>> timeout.
>> [  405.568236] sas: sas_scsi_find_task: task 0x000000007068ed73 is 
>> aborted
>> [  405.574930] sas: sas_eh_handle_sas_errors: task 0x000000007068ed73 is
>> aborted
>> [  411.192602] ata21.00: qc timeout (cmd 0xec)
>> [  431.672122] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
>> timeout.
>> [  431.679282] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
>> [  431.685544] ata21.00: revalidation failed (errno=-5)
>> [  441.911948] ata21.00: qc timeout (cmd 0xec)
>> [  462.391545] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
>> timeout.
>> [  462.398696] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
>> [  462.404992] ata21.00: revalidation failed (errno=-5)
>> [  492.598769] ata21.00: qc timeout (cmd 0xec)
>> ...
>>
>> So there is a problem. Need to dig into this. I see this issue only with
>> libzbc passthrough tests. fio runs with libaio are fine.
> 
> Thanks for the notice. I think that I also saw a hang, but, IIRC, it 
> happened on mainline for me - but it's hard to know if I broke something 
> if it is already broke in another way. That is why I wanted this card 
> working properly...

Hi Damien,

 From testing mainline, I can see a hang on my arm64 system for SAS 
disks. I think that the reason is the we don't finish some commands in 
EH properly for pm8001:
- In EH, we attempt to abort the task in sas_scsi_find_task() -> 
lldd_abort_task()
The default return from pm8001_exec_internal_tmf_task() is 
-TMF_RESP_FUNC_FAILED, so if the TMF does not execute properly we return 
this value
- sas_scsi_find_task() cannot handle -TMF_RESP_FUNC_FAILED, and returns 
-TMF_RESP_FUNC_FAILED directly to sas_eh_handle_sas_errors(), which, 
again, does not handle -TMF_RESP_FUNC_FAILED. So we don't progress to 
ever finish the comand.

This looks like the correct fix for mainline:

--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -766,7 +766,7 @@ static int pm8001_exec_internal_tmf_task(struct 
domain_device *dev,
pm8001_dev, DS_OPERATIONAL);
wait_for_completion(&completion_setstate);
}
- res = -TMF_RESP_FUNC_FAILED;
+ res = TMF_RESP_FUNC_FAILED;

That's effectively the same as what I have in this series in 
sas_execute_tmf().

However your testing is a SATA device, which I'll check further.

Thanks,
John