[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a2de1656-b1ec-2fb7-caab-657e27dacb48@huawei.com>
Date: Fri, 28 Jan 2022 09:09:03 +0000
From: John Garry <john.garry@...wei.com>
To: Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
<jejb@...ux.ibm.com>, <martin.petersen@...cle.com>,
<artur.paszkiewicz@...el.com>, <jinpu.wang@...ud.ionos.com>,
<chenxiang66@...ilicon.com>, <Ajish.Koshy@...rochip.com>
CC: <yanaijie@...wei.com>, <linux-doc@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-scsi@...r.kernel.org>,
<linuxarm@...wei.com>, <liuqi115@...wei.com>,
<Viswas.G@...rochip.com>
Subject: Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code
On 28/01/2022 06:37, Damien Le Moal wrote:
Hi Damien,
>> However using this same adapter type on my arm64 system has error
>> handling kick in almost straight away - and the handling looks sane. A
>> silver lining, I suppose ..
> I ran some more tests. In particular, I ran libzbc compliance tests on a
> 20TB SMR drives. All tests pass with 5.17-rc1, but after applying your
> series, I see command timeout that take forever to recover from, with
> the drive revalidation failing after that.
>
> [ 385.102073] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
> [ 385.108026] sas: sas_scsi_find_task: aborting task 0x000000007068ed73
> [ 405.561099] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
> timeout.
> [ 405.568236] sas: sas_scsi_find_task: task 0x000000007068ed73 is aborted
> [ 405.574930] sas: sas_eh_handle_sas_errors: task 0x000000007068ed73 is
> aborted
> [ 411.192602] ata21.00: qc timeout (cmd 0xec)
> [ 431.672122] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
> timeout.
> [ 431.679282] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> [ 431.685544] ata21.00: revalidation failed (errno=-5)
> [ 441.911948] ata21.00: qc timeout (cmd 0xec)
> [ 462.391545] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
> timeout.
> [ 462.398696] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> [ 462.404992] ata21.00: revalidation failed (errno=-5)
> [ 492.598769] ata21.00: qc timeout (cmd 0xec)
> ...
>
> So there is a problem. Need to dig into this. I see this issue only with
> libzbc passthrough tests. fio runs with libaio are fine.
Thanks for the notice. I think that I also saw a hang, but, IIRC, it
happened on mainline for me - but it's hard to know if I broke something
if it is already broke in another way. That is why I wanted this card
working properly...
Anyway, I will investigate more.
>
>>> And sparse/make C=1 complains about:
>>>
>>> drivers/scsi/libsas/sas_port.c:77:13: warning: context imbalance in
>>> 'sas_form_port' - different lock contexts for basic block
>> I think it's talking about the port->phy_list_lock usage - it prob
>> doesn't like segments where we fall out a loop with the lock held (which
>> was grabbed in the loop). Anyway it looks ok. Maybe we can improve this.
>>
>>> But I have not checked if it is something that your series touch.
>>>
>>> And there is a ton of complaints about __le32 use in the pm80xx code...
>>> I can try to have a look at these if you want, on top of your series.
>> I really need to get make C=1 working for me - it segfaults in any env I
>> have:(
> I now have a 12 patch series that fixes*all* the sparse warnings. Some
> of the fixes were trivial, but most of them are simply hard bugs with
> the handling of le32 struct field values. There is no way that this
> driver is working as-is on big-endian machines. Some calculations are
> actually done using cpu_to_le32() values !
Great, I'll have a look when you send them.
>
> But even though these fixes should have essentially no effect on
> little-endian x86_64, with my series applied, I see the same command
> timeout problem as with your libsas update, and both series together
> result in the same timeout issue too.
>
> So it looks like "fixing" the code actually is revealing some other bug
> that was previously hidden... This will take some time to debug.
>
> Another problem I noticed: doing "rmmod pm80xx; modprobe pm80xx" result
> in a failure of device scans. I get loops of "link is slow to respond
> ->reset". For the above tests, I had to reboot every time I changed the
> driver module code. Another thing to look at.
Sounds odd, I would expect everything runs from afresh when insmod.
Thanks,
John
Powered by blists - more mailing lists