lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cfa52b91-db81-a179-76c2-8a61266c099d@huawei.com>
Date:   Tue, 4 Oct 2022 15:04:51 +0100
From:   John Garry <john.garry@...wei.com>
To:     Niklas Cassel <Niklas.Cassel@....com>
CC:     "jejb@...ux.ibm.com" <jejb@...ux.ibm.com>,
        "martin.petersen@...cle.com" <martin.petersen@...cle.com>,
        "jinpu.wang@...ud.ionos.com" <jinpu.wang@...ud.ionos.com>,
        "damien.lemoal@...nsource.wdc.com" <damien.lemoal@...nsource.wdc.com>,
        "linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linuxarm@...wei.com" <linuxarm@...wei.com>,
        "yangxingui@...wei.com" <yangxingui@...wei.com>,
        "yanaijie@...wei.com" <yanaijie@...wei.com>
Subject: Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling

On 04/10/2022 14:05, Niklas Cassel wrote:
>> 2.35.3
>>
> For pm80xx (pm8001 changes untested):
> Tested-by: Niklas Cassel<niklas.cassel@....com>
> 
> 

Thanks!

> 
> Notes unrelated to this patch:
> 
> Both before and after this series, this driver prints:
> [  215.845053] ata21.00: exception Emask 0x0 SAct 0xfc0000 SErr 0x0 action 0x6
> [  215.852308] ata21.00: failed command: WRITE FPDMA QUEUED
> [  215.857801] ata21.00: cmd 61/00:00:00:3a:d3/01:00:b3:04:00/40 tag 18 ncq dma 131072 out
>                          res 43/04:00:ff:3a:d3/00:00:b3:04:00/40 Emask 0x400 (NCQ error) <F>
> [  215.874396] ata21.00: status: { DRDY SENSE ERR }
> [  215.879192] ata21.00: error: { ABRT }
> [  215.882997] ata21.00: failed command: WRITE FPDMA QUEUED
> [  215.888479] ata21.00: cmd 61/00:00:00:3b:d3/01:00:b3:04:00/40 tag 19 ncq dma 131072 out
>                          res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> [  215.904814] ata21.00: failed command: WRITE FPDMA QUEUED
> [  215.910311] ata21.00: cmd 61/00:00:00:3c:d3/01:00:b3:04:00/40 tag 20 ncq dma 131072 out
>                          res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> [  215.932679] ata21.00: failed command: WRITE FPDMA QUEUED
> [  215.941203] ata21.00: cmd 61/00:00:00:3d:d3/01:00:b3:04:00/40 tag 21 ncq dma 131072 out
>                          res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> [  215.963616] ata21.00: failed command: WRITE FPDMA QUEUED
> [  215.972150] ata21.00: cmd 61/00:00:00:3e:d3/01:00:b3:04:00/40 tag 22 ncq dma 131072 out
>                          res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> [  215.994532] ata21.00: failed command: WRITE FPDMA QUEUED
> [  216.003124] ata21.00: cmd 61/00:00:00:3f:d3/01:00:b3:04:00/40 tag 23 ncq dma 131072 out
>                          res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> 
> HSM (Host State Machine) violation errors.
> 
> For the same SATA drive connected via AHCI this will instead give:
> 
> [ 3796.944923] ata14.00: exception Emask 0x0 SAct 0x80800003 SErr 0xc0000 action 0x0
> [ 3796.959375] ata14.00: irq_stat 0x40000008
> [ 3796.970140] ata14: SError: { CommWake 10B8B }
> [ 3796.981231] ata14.00: failed command: WRITE FPDMA QUEUED
> [ 3796.993237] ata14.00: cmd 61/00:08:00:7e:73/02:00:8e:08:00/40 tag 1 ncq dma 262144 out
>                          res 43/04:01:00:00:00/00:00:00:00:00/40 Emask 0x1 (device error)
> [ 3797.017984] ata14.00: status: { DRDY SENSE ERR }
> [ 3797.026833] ata14.00: error: { ABRT }
> [ 3797.034664] ata14.00: failed command: WRITE FPDMA QUEUED
> [ 3797.043015] ata14.00: cmd 61/00:b8:00:60:73/0a:00:8e:08:00/40 tag 23 ncq dma 1310720 out
>                          res 43/04:00:df:67:73/00:00:8e:08:00/40 Emask 0x400 (NCQ error) <F>
> [ 3797.065224] ata14.00: status: { DRDY SENSE ERR }
> [ 3797.072914] ata14.00: error: { ABRT }
> [ 3797.079598] ata14.00: failed command: WRITE FPDMA QUEUED
> [ 3797.087920] ata14.00: cmd 61/00:f8:00:6a:73/0a:00:8e:08:00/40 tag 31 ncq dma 1310720 out
>                          res 43/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
> [ 3797.109800] ata14.00: status: { DRDY SENSE ERR }
> [ 3797.117451] ata14.00: error: { ABRT }
> 
> device error errors.
> 
> 
> Except for the I/O that caused the NCQ error, the remaining outstanding I/Os,
> regardless if they were aborted by the drive, as a side-effect of reading the
> NCQ error log (see 13.7.4 Queued Error Log (10h) in SATA 3.5a spec),
> or if they were aborted by the host (by sas_ata_device_link_abort()),
> I don't think it is correct to report these as HSM violation errors.
> 
> HSM violation errors are e.g. when you try to issue a command to a drive
> that has ATA_BUSY bit set.

We had a similar issue for hisi_sas and solved in patch 4/7: don't set 
ATA_ERR in the fis for those IO which complete with error, but abort the 
IO via sas_abort_task().

For pm80xx the IO is either rejected (actually completes with rejection) 
or is aborted via internal abort command. Maybe we can do similar for 
pm8001 as we allow the IO to complete in both cases with error. I'll check.

Thanks,
John

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ