[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <098f988e-1f12-c412-3111-60393dfe0f0b@huawei.com>
Date: Thu, 3 Feb 2022 15:55:22 +0000
From: John Garry <john.garry@...wei.com>
To: Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
<jejb@...ux.ibm.com>, <martin.petersen@...cle.com>,
<artur.paszkiewicz@...el.com>, <jinpu.wang@...ud.ionos.com>,
<chenxiang66@...ilicon.com>, <Ajish.Koshy@...rochip.com>
CC: <yanaijie@...wei.com>, <linux-doc@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-scsi@...r.kernel.org>,
<linuxarm@...wei.com>, <liuqi115@...wei.com>,
<Viswas.G@...rochip.com>
Subject: Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code
On 03/02/2022 09:44, Damien Le Moal wrote:
Hi Damien,
>>>> [ 385.102073] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
>>>> [ 385.108026] sas: sas_scsi_find_task: aborting task 0x000000007068ed73
>>>> [ 405.561099] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
Contrary to mentioning TMF in the log, this is not a TMF but rather an
internal abort timing out. I don't think that this should ever happen.
This command should just abort pending IO commands in the controller and
not send anything to the target. So for this to timeout means a HW fault
or driver bug. And I did not touch this code for pm8001.
>>>> timeout.
>>>> [ 405.568236] sas: sas_scsi_find_task: task 0x000000007068ed73 is
>>>> aborted
>>>> [ 405.574930] sas: sas_eh_handle_sas_errors: task 0x000000007068ed73 is
>>>> aborted
>>>> [ 411.192602] ata21.00: qc timeout (cmd 0xec)
>>>> [ 431.672122] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
>>>> timeout.
>>>> [ 431.679282] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
>>>> [ 431.685544] ata21.00: revalidation failed (errno=-5)
>>>> [ 441.911948] ata21.00: qc timeout (cmd 0xec)
>>>> [ 462.391545] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
>>>> timeout.
>>>> [ 462.398696] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
>>>> [ 462.404992] ata21.00: revalidation failed (errno=-5)
>>>> [ 492.598769] ata21.00: qc timeout (cmd 0xec)
>>>> ...
>>>>
Do you have a fuller dmesg with my series?
...
>> }
>> - res = -TMF_RESP_FUNC_FAILED;
>> + res = TMF_RESP_FUNC_FAILED;
>>
>> That's effectively the same as what I have in this series in
>> sas_execute_tmf().
>>
>> However your testing is a SATA device, which I'll check further.
> This did not help. Still seeing 100% reproducible hangs.
OK, but I think that we should also have this change as the mainline
codes looks broken to be begin with:
--->8 ---
[PATCH] scsi: libsas: Handle all errors in sas_scsi_find_task()
LLDD TMFs callbacks may return linux or other error codes instead of TMF
codes. This may cause problems in sas_scsi_find_task() ->
.lldd_query_task(), as only TMF codes are handled there. As such, we may
not return a task_disposition type. Function sas_eh_handle_sas_errors()
only handles that type, and may exit error
handling early for unrecognised types.
So use TASK_ABORT_FAILED for non-TMF types returned from
.lldd_query_task(), on the assumption that the command may still be
alive and error handling should be escalated.
Signed-off-by: John Garry <john.garry@...wei.com>
diff --git a/drivers/scsi/libsas/sas_scsi_host.c
b/drivers/scsi/libsas/sas_scsi_host.c
index 53d8b7ede0cd..02274f471308 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -316,8 +316,11 @@ static enum task_disposition
sas_scsi_find_task(struct sas_task *task)
pr_notice("%s: task 0x%p failed to abort\n",
__func__, task);
return TASK_ABORT_FAILED;
+ default:
+ pr_notice("%s: task 0x%p result code %d not handled, assuming
failed\n",
+ __func__, task, res);
+ return TASK_ABORT_FAILED;
}
-
}
}
return res;
---8< ----
>
> I did a lot of testing/digging today,
Thanks for the effort!
> and the hang cause seems to be
> missing task completions.
> At random, a task times out as its completion
That sounds fimilar to my general issue running this driver on an arm64
host...
> does not come, and subsequent abort trial for the task fail, revalidate
> fails
I assume SMP IOs fail if revalidation fails - if this is the case, then
the controller seems to be in bad state.
> and the device is dropped (capacity goes to 0). But at that point,
> doing rmmod/modprobe to reset the device does not work. sync cache
> command issued at rmmod time never completes. I end up needing to power
> cycle the machine every time...
>
> No clue about the root cause yet, but it definitely seem to be related
> to NCQ/high QD operation. If I force my tests to use non-NCQ commands,
> everything is fine and the tests run to completion without any issue.
>
> I wonder if their is a tag management bug somewhere...
Maybe. Not sure.
On a related point, Hannes' change here could avoid it:
https://lore.kernel.org/linux-scsi/20210222132405.91369-32-hare@suse.de/
>
> I did stumble on something very ugly in libsas too: sas_ata_qc_issue()
> drops and retake the ata port lock. No other ATA driver do that since
> the ata completion also take that lock. The ata port lock is taken
> before ata_qc_issue() is called with IRQ disabled (spin_lock_irqsave()).
> So doing a spin_unlock()/spin_lock() in sas_ata_qc_issue() (called from
> ata_qc_issue()) seems like a very bad idea. I removed that and
> everything work the same way (the lld execute does not sleep). But that
> did not solve the hang problem.
I would need to check why this is done again. Before my time...
>
> Of note is this is all with your libsas patches applied. Without the
> patches, I have KASAN screaming at me about use-after-free in completion
> context. With your patches, KASAN is silent.
>
> Another thing: this driver does not allow changing the max qd... Very
> annoying.
>
> echo 1 > /sys/block/sdX/device/queue_depth
>
> has no effect. QD stays at 32 for an ATA drive. Need to look into that too.
I had a look at this. It seems that we fail in
__ata_change_queue_depth() -> ata_scsi_find_dev() returning NULL.
Thanks again for your effort, I will continue to look.
john
Powered by blists - more mailing lists