lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 6 Oct 2022 07:42:30 +0900
From:   Damien Le Moal <damien.lemoal@...nsource.wdc.com>
To:     Niklas Cassel <Niklas.Cassel@....com>
Cc:     John Garry <john.garry@...wei.com>,
        "jejb@...ux.ibm.com" <jejb@...ux.ibm.com>,
        "martin.petersen@...cle.com" <martin.petersen@...cle.com>,
        "jinpu.wang@...ud.ionos.com" <jinpu.wang@...ud.ionos.com>,
        "linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Linuxarm <linuxarm@...wei.com>,
        yangxingui <yangxingui@...wei.com>,
        yanaijie <yanaijie@...wei.com>
Subject: Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling

On 10/6/22 07:11, Niklas Cassel wrote:
> On Thu, Oct 06, 2022 at 06:36:05AM +0900, Damien Le Moal wrote:
>> On 10/6/22 06:28, Niklas Cassel wrote:
>>> On Wed, Oct 05, 2022 at 09:53:52AM +0100, John Garry wrote:
>>>> On 04/10/2022 15:04, John Garry wrote:
>>>>
>>>> Hi Niklas,
>>>>
>>>> Could you try a change like this on top:
>>>>
>>>> void sas_ata_device_link_abort(struct domain_device *device, bool
>>>> force_reset)
>>>> {
>>>> 	struct ata_port *ap = device->sata_dev.ap;
>>>> 	struct ata_link *link = &ap->link;
>>>>
>>>> +	device->sata_dev.fis[2] = ATA_ERR | ATA_DRDY;
>>>> +	device->sata_dev.fis[3] = 0x04;
>>>>
>>>> 	link->eh_info.err_mask |= AC_ERR_DEV;
>>>> 	if (force_reset)
>>>> 		link->eh_info.action |= ATA_EH_RESET;
>>>> 	ata_link_abort(link);
>>>> }
>>>> EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);
>>>>
>>>> I tried it myself and it looked to work ok, except I have a problem with my
>>>> arm64 system in that the read log ext times-out and all TF show "device
>>>> error", like:
>>>
>>> Do you know why it fails to read the log?
>>> Can you read the NCQ Command Error log using ATA16 passthrough commands?
>>>
>>> sudo sg_sat_read_gplog -d --log=0x10 /dev/sdc
>>>
>>> The first byte is the last NCQ tag (in hex) that failed.
>>
>> libata issues read log as a non-ncq command under EH. So the NCQ error log
>> will not help.
> 
> Hello Damien,
> 
> John explained that he got a timeout from EH when reading the log:
> [  350.281581] ata1: failed to read log page 10h (errno=-5)
> [  350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen
> 
> ata_eh_read_log_10h() uses ata_read_log_page(), which will first try to read
> the log using READ LOG DMA EXT. If that fails, it will retry using READ LOG EXT.
> 
> Therefore, to see if this is a driver specific bug, I suggested to try to read
> the NCQ Command Error log using ATA16 passthrough commands:
> 
> $ sudo sg_sat_read_gplog -d --log=0x10 /dev/sdc
> will read the log using READ LOG DMA EXT.
> 
> $ sudo sg_sat_read_gplog --log=0x10 /dev/sdc
> will read the log using READ LOG EXT.
> 
> Neither of these two suggested commands are NCQ commands.
> (Neither command is encapsulated in a RECEIVE FPDMA QUEUED,
> so I'm not sure what you mean.)
> 
> 
> Garry, I now see that:
> [  350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen
> Your port is frozen.
> 
> ata_read_log_page() calls ata_exec_internal() which calls ata_exec_internal_sg(),
> which will simply return an error without sending down the command to the drive,
> if the port is frozen.
> 
> Not sure why your port is frozen, mine is obviously not.
> 
> ata_do_link_abort() calls ata_eh_set_pending() without activating fast drain:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-eh.c?h=v6.0#n989
> 
> So I'm not sure why your port is frozen.
> (The fast drain timer does freeze the port, but it shouldn't be enabled.)
> It might be worthwhile to see who freezes the port in your case.

Might come from the command timeout. John has had many problems with the
pm80xx HBA in his Arm machine from a while back. Likely not a driver issue
but a hw one... No-one seems to be able to recreate the same problem.

We need to try the HBA on our Arm board to see what happens.

> 
> 
> Kind regards,
> Niklas

-- 
Damien Le Moal
Western Digital Research

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ