linux-kernel - Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a2de1656-b1ec-2fb7-caab-657e27dacb48@huawei.com>
Date:   Fri, 28 Jan 2022 09:09:03 +0000
From:   John Garry <john.garry@...wei.com>
To:     Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
        <jejb@...ux.ibm.com>, <martin.petersen@...cle.com>,
        <artur.paszkiewicz@...el.com>, <jinpu.wang@...ud.ionos.com>,
        <chenxiang66@...ilicon.com>, <Ajish.Koshy@...rochip.com>
CC:     <yanaijie@...wei.com>, <linux-doc@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>, <linux-scsi@...r.kernel.org>,
        <linuxarm@...wei.com>, <liuqi115@...wei.com>,
        <Viswas.G@...rochip.com>
Subject: Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

On 28/01/2022 06:37, Damien Le Moal wrote:

Hi Damien,

>> However using this same adapter type on my arm64 system has error
>> handling kick in almost straight away - and the handling looks sane. A
>> silver lining, I suppose ..
> I ran some more tests. In particular, I ran libzbc compliance tests on a
> 20TB SMR drives. All tests pass with 5.17-rc1, but after applying your
> series, I see command timeout that take forever to recover from, with
> the drive revalidation failing after that.
> 
> [  385.102073] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
> [  385.108026] sas: sas_scsi_find_task: aborting task 0x000000007068ed73
> [  405.561099] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
> timeout.
> [  405.568236] sas: sas_scsi_find_task: task 0x000000007068ed73 is aborted
> [  405.574930] sas: sas_eh_handle_sas_errors: task 0x000000007068ed73 is
> aborted
> [  411.192602] ata21.00: qc timeout (cmd 0xec)
> [  431.672122] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
> timeout.
> [  431.679282] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> [  431.685544] ata21.00: revalidation failed (errno=-5)
> [  441.911948] ata21.00: qc timeout (cmd 0xec)
> [  462.391545] pm80xx0:: pm8001_exec_internal_task_abort  757:TMF task
> timeout.
> [  462.398696] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> [  462.404992] ata21.00: revalidation failed (errno=-5)
> [  492.598769] ata21.00: qc timeout (cmd 0xec)
> ...
> 
> So there is a problem. Need to dig into this. I see this issue only with
> libzbc passthrough tests. fio runs with libaio are fine.

Thanks for the notice. I think that I also saw a hang, but, IIRC, it 
happened on mainline for me - but it's hard to know if I broke something 
if it is already broke in another way. That is why I wanted this card 
working properly...

Anyway, I will investigate more.

> 
>>> And sparse/make C=1 complains about:
>>>
>>> drivers/scsi/libsas/sas_port.c:77:13: warning: context imbalance in
>>> 'sas_form_port' - different lock contexts for basic block
>> I think it's talking about the port->phy_list_lock usage - it prob
>> doesn't like segments where we fall out a loop with the lock held (which
>> was grabbed in the loop). Anyway it looks ok. Maybe we can improve this.
>>
>>> But I have not checked if it is something that your series touch.
>>>
>>> And there is a ton of complaints about __le32 use in the pm80xx code...
>>> I can try to have a look at these if you want, on top of your series.
>> I really need to get make C=1 working for me - it segfaults in any env I
>> have:(
> I now have a 12 patch series that fixes*all*  the sparse warnings. Some
> of the fixes were trivial, but most of them are simply hard bugs with
> the handling of le32 struct field values. There is no way that this
> driver is working as-is on big-endian machines. Some calculations are
> actually done using cpu_to_le32() values !

Great, I'll have a look when you send them.

> 
> But even though these fixes should have essentially no effect on
> little-endian x86_64, with my series applied, I see the same command
> timeout problem as with your libsas update, and both series together
> result in the same timeout issue too.
> 
> So it looks like "fixing" the code actually is revealing some other bug
> that was previously hidden... This will take some time to debug.
> 
> Another problem I noticed: doing "rmmod pm80xx; modprobe pm80xx" result
> in a failure of device scans. I get loops of "link is slow to respond
> ->reset". For the above tests, I had to reboot every time I changed the
> driver module code. Another thing to look at.

Sounds odd, I would expect everything runs from afresh when insmod.

Thanks,
John