[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e8af9d55-bf36-faa0-defb-3a9a4931826e@huawei.com>
Date: Fri, 4 Feb 2022 10:36:41 +0000
From: John Garry <john.garry@...wei.com>
To: Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
<jejb@...ux.ibm.com>, <martin.petersen@...cle.com>,
<artur.paszkiewicz@...el.com>, <jinpu.wang@...ud.ionos.com>,
<chenxiang66@...ilicon.com>, <Ajish.Koshy@...rochip.com>
CC: <yanaijie@...wei.com>, <linux-doc@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-scsi@...r.kernel.org>,
<linuxarm@...wei.com>, <liuqi115@...wei.com>,
<Viswas.G@...rochip.com>
Subject: Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code
On 04/02/2022 03:02, Damien Le Moal wrote:
>> The inconsistency is this line says:
>> [ 137.193944] pm80xx0:: pm80xx_chip_sata_req 4581:no data
>> Which seems to be sensical for NCQ_NON_DATA command, but then, this line
>> seems wrong:
>> [ 137.228015] pm80xx0:: mpi_sata_completion 2515:FPDMA len = 8
>>
>> I need to go and check the specs what the FIS reply format is for
>> NCQ_NON_DATA.
>>
>>
>> [ 137.187184] pm80xx0:: pm8001_queue_command 408:pm8001_task_exec device
>> [ 137.193944] pm80xx0:: pm80xx_chip_sata_req 4581:no data
>> [ 137.199339] pm80xx0:: pm80xx_chip_sata_req 4682:Sending Normal SATA
>> command 0x63 inb 4
>> [ 137.207577] pm80xx0:: pm8001_mpi_msg_consume 1446:: CI=46 PI=47
>> msgHeader=8104200d
>> [ 137.215399] pm80xx0:: mpi_sata_completion 2481:IO_SUCCESS
>> [ 137.220961] pm80xx0:: mpi_sata_completion 2503:SAS_PROTO_RESPONSE
>> len = 20
>> [ 137.228015] pm80xx0:: mpi_sata_completion 2515:FPDMA len = 8
>> [ 137.233878] pm80xx0:: pm8001_mpi_msg_free_set 1403: CI=47 PI=47
>> [ 137.236696] pm80xx0:: pm8001_queue_command 408:pm8001_task_exec device
>> [ 137.247102] pm80xx0:: pm80xx_chip_sata_req 4585:DMA
>> [ 137.252186] pm80xx0:: pm80xx_chip_sata_req 4593:FPDMA
>> [ 137.257400] pm80xx0:: pm80xx_chip_sata_req 4682:Sending Normal SATA
>> command 0x65 inb f
>> [ 167.506280] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
>> [ 167.512363] sas: sas_scsi_find_task: aborting task 0x00000000aa372627
>> [ 167.519049] pm80xx0:: pm8001_chip_abort_task 4607:cmd_tag = 2, abort
>> task tag = 0x1
>> [ 187.969173] pm80xx0:: pm8001_exec_internal_task_abort 753:TMF task
>> timeout.
As I mentioned, having this fail is a red flag. If I was pushed to guess
what has happened, I'd say the FW is faulting due to some erroneous
driver behaviour.
>> [ 187.976450] sas: sas_scsi_find_task: task 0x00000000aa372627 is aborted
>> [ 187.983244] sas: sas_eh_handle_sas_errors: task 0x00000000aa372627 is
>> aborted
>>
>> After these messages, the tests exit on failure (drive dropped) and
>> there are no more messages. Doing rmmod or anything else get stuck too.
>> I have to reset the machine to get back to a good state.
>>
>> I am starting to think that NCQ NON DATA command is being very badly
>> handled... Switching the tests to run with all non-NCQ commands is
>> working fine, albeit horribly slow (much slower than with other HBAs,
>> e.g. Broadcom).
>>
>> Digging...
> I missed a KASAN splat during device scan on boot:
>
> 33.725184]
> ==================================================================
> [ 33.746168] BUG: KASAN: use-after-free in __lock_acquire+0x41a5/0x5b50
> [ 33.764181] Read of size 8 at addr ffff88818a318660 by task
> kworker/u64:6/583
...
> ==================================================================
>
> This is the submission path, not completion. The code is:
>
> (gdb) list *(pm8001_queue_command+0x842)
> 0x3d42 is in pm8001_queue_command (drivers/scsi/pm8001/pm8001_sas.c:491).
> 486 atomic_dec(&pm8001_dev->running_req);
> 487 goto err_out_tag;
> 488 }
> 489 /* TODO: select normal or high priority */
> 490 spin_lock(&t->task_state_lock);
> 491 t->task_state_flags |= SAS_TASK_AT_INITIATOR;
> 492 spin_unlock(&t->task_state_lock);
> 493 } while (0);
> 494 rc = 0;
> 495 goto out_done;
>
> So the task is already completed when the submission path tries to set
> the state flag ? Debugging...
Yeah, that's how it looks.
I already mentioned this problem here:
https://lore.kernel.org/linux-scsi/0cc0c435-b4f2-9c76-258d-865ba50a29dd@huawei.com/
Maybe we should just fix it now to rule it out of possibly causing other
issues... I was reluctant to fix it as many places seems to need to be
touched. Let me check it.
Thanks,
John
Powered by blists - more mailing lists