[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f190f19e-34b2-611c-1cf4-f8f34d12fe74@huawei.com>
Date: Thu, 6 Oct 2022 09:33:23 +0100
From: John Garry <john.garry@...wei.com>
To: Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
Niklas Cassel <Niklas.Cassel@....com>
CC: "jejb@...ux.ibm.com" <jejb@...ux.ibm.com>,
"martin.petersen@...cle.com" <martin.petersen@...cle.com>,
"jinpu.wang@...ud.ionos.com" <jinpu.wang@...ud.ionos.com>,
"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Linuxarm <linuxarm@...wei.com>,
yangxingui <yangxingui@...wei.com>,
yanaijie <yanaijie@...wei.com>
Subject: Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling
On 05/10/2022 23:42, Damien Le Moal wrote:
>> Hello Damien,
>>
>> John explained that he got a timeout from EH when reading the log:
>> [ 350.281581] ata1: failed to read log page 10h (errno=-5)
>> [ 350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen
>>
>> ata_eh_read_log_10h() uses ata_read_log_page(), which will first try to read
>> the log using READ LOG DMA EXT. If that fails, it will retry using READ LOG EXT.
>>
>> Therefore, to see if this is a driver specific bug, I suggested to try to read
>> the NCQ Command Error log using ATA16 passthrough commands:
>>
>> $ sudo sg_sat_read_gplog -d --log=0x10 /dev/sdc
>> will read the log using READ LOG DMA EXT.
>>
>> $ sudo sg_sat_read_gplog --log=0x10 /dev/sdc
>> will read the log using READ LOG EXT.
Note that I can't get a distro to boot on this system from the HDD for
the same timeout problem (so no tools easily available).
>>
>> Neither of these two suggested commands are NCQ commands.
>> (Neither command is encapsulated in a RECEIVE FPDMA QUEUED,
>> so I'm not sure what you mean.)
>>
>>
>> Garry, I now see that:
>> [ 350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen
>> Your port is frozen.
>>
>> ata_read_log_page() calls ata_exec_internal() which calls ata_exec_internal_sg(),
>> which will simply return an error without sending down the command to the drive,
>> if the port is frozen.
>>
>> Not sure why your port is frozen, mine is obviously not.
I think that it gets frozen when the internal command for read log ext
times out. More below about that timeout.
>>
>> ata_do_link_abort() calls ata_eh_set_pending() without activating fast drain:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-eh.c?h=v6.0#n989
>>
>> So I'm not sure why your port is frozen.
>> (The fast drain timer does freeze the port, but it shouldn't be enabled.)
>> It might be worthwhile to see who freezes the port in your case.
> Might come from the command timeout. John has had many problems with the
> pm80xx HBA in his Arm machine from a while back. Likely not a driver issue
> but a hw one... No-one seems to be able to recreate the same problem.
>
> We need to try the HBA on our Arm board to see what happens.
>
Yeah, it just looks to be the longstanding issue of using this card on
my arm64 machine - that is that I get IO timeouts quite regularly. I
should have mentioned that yesterday. This just seems to be a driver issue.
Interestingly this read log ext always seems to timeout, so maybe I
could see if there is anything specific about this command which could
give a clue to the underlying issue. But I have spent much time trying
to debug this issue, so not too motivated any more if I’m completely
honest ...
Thanks,
John
Powered by blists - more mailing lists