[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cc74b7c2-4946-af6a-b761-cdfba2162a4d@huawei.com>
Date: Tue, 13 Jun 2017 15:05:30 +0100
From: John Garry <john.garry@...wei.com>
To: Arnd Bergmann <arnd@...db.de>
CC: "James E.J. Bottomley" <jejb@...ux.vnet.ibm.com>,
"Martin K. Petersen" <martin.petersen@...cle.com>,
John Garry <john.garry2@...l.dcu.ie>, <linuxarm@...wei.com>,
<linux-scsi@...r.kernel.org>,
"Linux Kernel Mailing List" <linux-kernel@...r.kernel.org>,
Xiang Chen <chenxiang66@...ilicon.com>
Subject: Re: [PATCH 20/22] scsi: hisi_sas: Add v3 code to support ECC and AXI
bus fatal error
On 17/05/2017 13:38, John Garry wrote:
> On 17/05/2017 13:27, Arnd Bergmann wrote:
>> On Wed, May 17, 2017 at 12:49 PM, John Garry <john.garry@...wei.com>
>> wrote:
>>> > From: Xiang Chen <chenxiang66@...ilicon.com>
>>> >
>>> > For ECC 1bit error, logic can recover it, so we only print a warning.
>>> > For ECC multi-bit and AXI bus fatal error, we panic.
>>> >
>>> > Signed-off-by: John Garry <john.garry@...wei.com>
>>> > Signed-off-by: Xiang Chen <chenxiang66@...ilicon.com>
>> This one is tricky as there are conflicting requirements:
>>
>> - For debugging purposes, you want to continue running the system
>> to figure out what exactly went wrong. Often enough, having the
>> kernel panic means you don't get to see the panic message because
>> console access is unavailable and you cannot log in any more
>>
>> - For data consistency purposes you want to stop the system as
>> soon as there is any uncorrectable data error
>>
>> I see that most scsi drivers don't ever call panic or BUG(), though
>> you already do so for v1 and v2 hw.
>>
>> Maybe the SCSI maintainers can provide some more guidance here.
>>
>> Arnd
>>
>> .
>>
>
> Hi Arnd,
>
> Actually latest code for v2 has been updated to do a controller reset,
> and not panic, for unrecoverable error:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c?h=v4.12-rc1#n2926
>
>
> We never got around to implementing controller reset for v1 as this
> platform (hip05) is not used much anymore.
>
> As for v3, we will change to to do same once controller reset is
> implemented. I should have added this to the commit log.
>
> Thanks,
> John
It has come to light that hip08 RAS architecture requires handling
certain errors with firmware first model. I am not sure on the flow of
controller reset for fatal errors - I'm currently checking the details.
But it is not worth adding this non-critical patch and reverting it
later, so I'll omit this patch when sending the v6 patchset which
includes the fix for sloppy spinlock usage.
Thanks,
John
Powered by blists - more mailing lists