lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2bd15f44-0b56-b6e0-8ef8-c33ba9ee7caa@opensource.wdc.com>
Date:   Fri, 4 Feb 2022 20:27:45 +0900
From:   Damien Le Moal <damien.lemoal@...nsource.wdc.com>
To:     John Garry <john.garry@...wei.com>, jejb@...ux.ibm.com,
        martin.petersen@...cle.com, artur.paszkiewicz@...el.com,
        jinpu.wang@...ud.ionos.com, chenxiang66@...ilicon.com,
        Ajish.Koshy@...rochip.com
Cc:     yanaijie@...wei.com, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-scsi@...r.kernel.org,
        linuxarm@...wei.com, liuqi115@...wei.com, Viswas.G@...rochip.com
Subject: Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

On 2/4/22 19:36, John Garry wrote:
> On 04/02/2022 03:02, Damien Le Moal wrote:
>>> The inconsistency is this line says:
>>> [  137.193944] pm80xx0:: pm80xx_chip_sata_req  4581:no data
>>> Which seems to be sensical for NCQ_NON_DATA command, but then, this line
>>> seems wrong:
>>> [  137.228015] pm80xx0:: mpi_sata_completion  2515:FPDMA len = 8
>>>
>>> I need to go and check the specs what the FIS reply format is for
>>> NCQ_NON_DATA.
>>>
>>>
>>> [  137.187184] pm80xx0:: pm8001_queue_command  408:pm8001_task_exec device
>>> [  137.193944] pm80xx0:: pm80xx_chip_sata_req  4581:no data
>>> [  137.199339] pm80xx0:: pm80xx_chip_sata_req  4682:Sending Normal SATA
>>> command 0x63 inb 4
>>> [  137.207577] pm80xx0:: pm8001_mpi_msg_consume  1446:: CI=46 PI=47
>>> msgHeader=8104200d
>>> [  137.215399] pm80xx0:: mpi_sata_completion  2481:IO_SUCCESS
>>> [  137.220961] pm80xx0:: mpi_sata_completion  2503:SAS_PROTO_RESPONSE
>>> len = 20
>>> [  137.228015] pm80xx0:: mpi_sata_completion  2515:FPDMA len = 8
>>> [  137.233878] pm80xx0:: pm8001_mpi_msg_free_set  1403: CI=47 PI=47
>>> [  137.236696] pm80xx0:: pm8001_queue_command  408:pm8001_task_exec device
>>> [  137.247102] pm80xx0:: pm80xx_chip_sata_req  4585:DMA
>>> [  137.252186] pm80xx0:: pm80xx_chip_sata_req  4593:FPDMA
>>> [  137.257400] pm80xx0:: pm80xx_chip_sata_req  4682:Sending Normal SATA
>>> command 0x65 inb f
>>> [  167.506280] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
>>> [  167.512363] sas: sas_scsi_find_task: aborting task 0x00000000aa372627
>>> [  167.519049] pm80xx0:: pm8001_chip_abort_task  4607:cmd_tag = 2, abort
>>> task tag = 0x1
>>> [  187.969173] pm80xx0:: pm8001_exec_internal_task_abort  753:TMF task
>>> timeout.
> 
> As I mentioned, having this fail is a red flag. If I was pushed to guess 
> what has happened, I'd say the FW is faulting due to some erroneous 
> driver behaviour.

I am still thinking that there is something wrong with the handling of
NCQ NON DATA command. There are several places where the code determines
non-data vs pio vs dma vs fpdma (ncq), and NCQ NON DATA always falls in
the fpdma bucket, which is wrong.

>> This is the submission path, not completion. The code is:
>>
>> (gdb) list *(pm8001_queue_command+0x842)
>> 0x3d42 is in pm8001_queue_command (drivers/scsi/pm8001/pm8001_sas.c:491).
>> 486				atomic_dec(&pm8001_dev->running_req);
>> 487				goto err_out_tag;
>> 488			}
>> 489			/* TODO: select normal or high priority */
>> 490			spin_lock(&t->task_state_lock);
>> 491			t->task_state_flags |= SAS_TASK_AT_INITIATOR;
>> 492			spin_unlock(&t->task_state_lock);
>> 493		} while (0);
>> 494		rc = 0;
>> 495		goto out_done;
>>
>> So the task is already completed when the submission path tries to set
>> the state flag ? Debugging...
> 
> Yeah, that's how it looks.
> 
> I already mentioned this problem here:
> 
> https://lore.kernel.org/linux-scsi/0cc0c435-b4f2-9c76-258d-865ba50a29dd@huawei.com/
> 
> Maybe we should just fix it now to rule it out of possibly causing other 
> issues... I was reluctant to fix it as many places seems to need to be 
> touched. Let me check it.

Here is my current fix:

diff --git a/drivers/scsi/pm8001/pm8001_sas.c
b/drivers/scsi/pm8001/pm8001_sas.c
index 1b95c73d12d1..16c101577dd3 100644
--- a/drivers/scsi/pm8001/pm8001_sas.c
+++ b/drivers/scsi/pm8001/pm8001_sas.c
@@ -453,13 +453,18 @@ int pm8001_queue_command(struct sas_task *task,
gfp_t gfp_flags)
                ccb->ccb_tag = tag;
                ccb->task = t;
                ccb->device = pm8001_dev;
+
+               /* TODO: select normal or high priority */
+               atomic_inc(&pm8001_dev->running_req);
+               spin_lock(&t->task_state_lock);
+               t->task_state_flags |= SAS_TASK_AT_INITIATOR;
+               spin_unlock(&t->task_state_lock);
+
                switch (task_proto) {
                case SAS_PROTOCOL_SMP:
-                       atomic_inc(&pm8001_dev->running_req);
                        rc = pm8001_task_prep_smp(pm8001_ha, ccb);
                        break;
                case SAS_PROTOCOL_SSP:
-                       atomic_inc(&pm8001_dev->running_req);
                        if (is_tmf)
                                rc = pm8001_task_prep_ssp_tm(pm8001_ha,
                                        ccb, tmf);
@@ -468,7 +473,6 @@ int pm8001_queue_command(struct sas_task *task,
gfp_t gfp_flags)
                        break;
                case SAS_PROTOCOL_SATA:
                case SAS_PROTOCOL_STP:
-                       atomic_inc(&pm8001_dev->running_req);
                        rc = pm8001_task_prep_ata(pm8001_ha, ccb);
                        break;
                default:
@@ -480,13 +484,12 @@ int pm8001_queue_command(struct sas_task *task,
gfp_t gfp_flags)

                if (rc) {
                        pm8001_dbg(pm8001_ha, IO, "rc is %x\n", rc);
+                       spin_lock(&t->task_state_lock);
+                       t->task_state_flags &= ~SAS_TASK_AT_INITIATOR;
+                       spin_unlock(&t->task_state_lock);
                        atomic_dec(&pm8001_dev->running_req);
                        goto err_out_tag;
                }
-               /* TODO: select normal or high priority */
-               spin_lock(&t->task_state_lock);
-               t->task_state_flags |= SAS_TASK_AT_INITIATOR;
-               spin_unlock(&t->task_state_lock);
        } while (0);
        rc = 0;
        goto out_done;

With this, No KASAN complaint. I will send a proper patch ASAP.

Of note is that I cannot see what the flag SAS_TASK_AT_INITIATOR is for.
It is set and unset only, never tested anywhere in libsas nor pm8001
driver. This flag seems totally useless to me, unless this is something
that the HW can see ?

-- 
Damien Le Moal
Western Digital Research

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ