linux-kernel - Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e8af9d55-bf36-faa0-defb-3a9a4931826e@huawei.com>
Date:   Fri, 4 Feb 2022 10:36:41 +0000
From:   John Garry <john.garry@...wei.com>
To:     Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
        <jejb@...ux.ibm.com>, <martin.petersen@...cle.com>,
        <artur.paszkiewicz@...el.com>, <jinpu.wang@...ud.ionos.com>,
        <chenxiang66@...ilicon.com>, <Ajish.Koshy@...rochip.com>
CC:     <yanaijie@...wei.com>, <linux-doc@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>, <linux-scsi@...r.kernel.org>,
        <linuxarm@...wei.com>, <liuqi115@...wei.com>,
        <Viswas.G@...rochip.com>
Subject: Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

On 04/02/2022 03:02, Damien Le Moal wrote:
>> The inconsistency is this line says:
>> [  137.193944] pm80xx0:: pm80xx_chip_sata_req  4581:no data
>> Which seems to be sensical for NCQ_NON_DATA command, but then, this line
>> seems wrong:
>> [  137.228015] pm80xx0:: mpi_sata_completion  2515:FPDMA len = 8
>>
>> I need to go and check the specs what the FIS reply format is for
>> NCQ_NON_DATA.
>>
>>
>> [  137.187184] pm80xx0:: pm8001_queue_command  408:pm8001_task_exec device
>> [  137.193944] pm80xx0:: pm80xx_chip_sata_req  4581:no data
>> [  137.199339] pm80xx0:: pm80xx_chip_sata_req  4682:Sending Normal SATA
>> command 0x63 inb 4
>> [  137.207577] pm80xx0:: pm8001_mpi_msg_consume  1446:: CI=46 PI=47
>> msgHeader=8104200d
>> [  137.215399] pm80xx0:: mpi_sata_completion  2481:IO_SUCCESS
>> [  137.220961] pm80xx0:: mpi_sata_completion  2503:SAS_PROTO_RESPONSE
>> len = 20
>> [  137.228015] pm80xx0:: mpi_sata_completion  2515:FPDMA len = 8
>> [  137.233878] pm80xx0:: pm8001_mpi_msg_free_set  1403: CI=47 PI=47
>> [  137.236696] pm80xx0:: pm8001_queue_command  408:pm8001_task_exec device
>> [  137.247102] pm80xx0:: pm80xx_chip_sata_req  4585:DMA
>> [  137.252186] pm80xx0:: pm80xx_chip_sata_req  4593:FPDMA
>> [  137.257400] pm80xx0:: pm80xx_chip_sata_req  4682:Sending Normal SATA
>> command 0x65 inb f
>> [  167.506280] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
>> [  167.512363] sas: sas_scsi_find_task: aborting task 0x00000000aa372627
>> [  167.519049] pm80xx0:: pm8001_chip_abort_task  4607:cmd_tag = 2, abort
>> task tag = 0x1
>> [  187.969173] pm80xx0:: pm8001_exec_internal_task_abort  753:TMF task
>> timeout.

As I mentioned, having this fail is a red flag. If I was pushed to guess 
what has happened, I'd say the FW is faulting due to some erroneous 
driver behaviour.

>> [  187.976450] sas: sas_scsi_find_task: task 0x00000000aa372627 is aborted
>> [  187.983244] sas: sas_eh_handle_sas_errors: task 0x00000000aa372627 is
>> aborted



>>
>> After these messages, the tests exit on failure (drive dropped) and
>> there are no more messages. Doing rmmod or anything else get stuck too.
>> I have to reset the machine to get back to a good state.
>>
>> I am starting to think that NCQ NON DATA command is being very badly
>> handled... Switching the tests to run with all non-NCQ commands is
>> working fine, albeit horribly slow (much slower than with other HBAs,
>> e.g. Broadcom).
>>
>> Digging...
> I missed a KASAN splat during device scan on boot:
> 
>     33.725184]
> ==================================================================
> [   33.746168] BUG: KASAN: use-after-free in __lock_acquire+0x41a5/0x5b50
> [   33.764181] Read of size 8 at addr ffff88818a318660 by task
> kworker/u64:6/583

...

> ==================================================================
> 
> This is the submission path, not completion. The code is:
> 
> (gdb) list *(pm8001_queue_command+0x842)
> 0x3d42 is in pm8001_queue_command (drivers/scsi/pm8001/pm8001_sas.c:491).
> 486				atomic_dec(&pm8001_dev->running_req);
> 487				goto err_out_tag;
> 488			}
> 489			/* TODO: select normal or high priority */
> 490			spin_lock(&t->task_state_lock);
> 491			t->task_state_flags |= SAS_TASK_AT_INITIATOR;
> 492			spin_unlock(&t->task_state_lock);
> 493		} while (0);
> 494		rc = 0;
> 495		goto out_done;
> 
> So the task is already completed when the submission path tries to set
> the state flag ? Debugging...

Yeah, that's how it looks.

I already mentioned this problem here:

https://lore.kernel.org/linux-scsi/0cc0c435-b4f2-9c76-258d-865ba50a29dd@huawei.com/

Maybe we should just fix it now to rule it out of possibly causing other 
issues... I was reluctant to fix it as many places seems to need to be 
touched. Let me check it.

Thanks,
John