linux-kernel - Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f190f19e-34b2-611c-1cf4-f8f34d12fe74@huawei.com>
Date:   Thu, 6 Oct 2022 09:33:23 +0100
From:   John Garry <john.garry@...wei.com>
To:     Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
        Niklas Cassel <Niklas.Cassel@....com>
CC:     "jejb@...ux.ibm.com" <jejb@...ux.ibm.com>,
        "martin.petersen@...cle.com" <martin.petersen@...cle.com>,
        "jinpu.wang@...ud.ionos.com" <jinpu.wang@...ud.ionos.com>,
        "linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Linuxarm <linuxarm@...wei.com>,
        yangxingui <yangxingui@...wei.com>,
        yanaijie <yanaijie@...wei.com>
Subject: Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling

On 05/10/2022 23:42, Damien Le Moal wrote:
>> Hello Damien,
>>
>> John explained that he got a timeout from EH when reading the log:
>> [  350.281581] ata1: failed to read log page 10h (errno=-5)
>> [  350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen
>>
>> ata_eh_read_log_10h() uses ata_read_log_page(), which will first try to read
>> the log using READ LOG DMA EXT. If that fails, it will retry using READ LOG EXT.
>>
>> Therefore, to see if this is a driver specific bug, I suggested to try to read
>> the NCQ Command Error log using ATA16 passthrough commands:
>>
>> $ sudo sg_sat_read_gplog -d --log=0x10 /dev/sdc
>> will read the log using READ LOG DMA EXT.
>>
>> $ sudo sg_sat_read_gplog --log=0x10 /dev/sdc
>> will read the log using READ LOG EXT.

Note that I can't get a distro to boot on this system from the HDD for 
the same timeout problem (so no tools easily available).

>>
>> Neither of these two suggested commands are NCQ commands.
>> (Neither command is encapsulated in a RECEIVE FPDMA QUEUED,
>> so I'm not sure what you mean.)
>>
>>
>> Garry, I now see that:
>> [  350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen
>> Your port is frozen.
>>
>> ata_read_log_page() calls ata_exec_internal() which calls ata_exec_internal_sg(),
>> which will simply return an error without sending down the command to the drive,
>> if the port is frozen.
>>
>> Not sure why your port is frozen, mine is obviously not.

I think that it gets frozen when the internal command for read log ext 
times out. More below about that timeout.

>>
>> ata_do_link_abort() calls ata_eh_set_pending() without activating fast drain:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-eh.c?h=v6.0#n989
>>
>> So I'm not sure why your port is frozen.
>> (The fast drain timer does freeze the port, but it shouldn't be enabled.)
>> It might be worthwhile to see who freezes the port in your case.
> Might come from the command timeout. John has had many problems with the
> pm80xx HBA in his Arm machine from a while back. Likely not a driver issue
> but a hw one... No-one seems to be able to recreate the same problem.
> 
> We need to try the HBA on our Arm board to see what happens.
> 

Yeah, it just looks to be the longstanding issue of using this card on 
my arm64 machine - that is that I get IO timeouts quite regularly. I 
should have mentioned that yesterday. This just seems to be a driver issue.

Interestingly this read log ext always seems to timeout, so maybe I 
could see if there is anything specific about this command which could 
give a clue to the underlying issue. But I have spent much time trying 
to debug this issue, so not too motivated any more if I’m completely 
honest ...

Thanks,
John