linux-kernel - Re: [PATCH 20/22] scsi: hisi_sas: Add v3 code to support ECC and AXI bus fatal error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <cc74b7c2-4946-af6a-b761-cdfba2162a4d@huawei.com>
Date:   Tue, 13 Jun 2017 15:05:30 +0100
From:   John Garry <john.garry@...wei.com>
To:     Arnd Bergmann <arnd@...db.de>
CC:     "James E.J. Bottomley" <jejb@...ux.vnet.ibm.com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        John Garry <john.garry2@...l.dcu.ie>, <linuxarm@...wei.com>,
        <linux-scsi@...r.kernel.org>,
        "Linux Kernel Mailing List" <linux-kernel@...r.kernel.org>,
        Xiang Chen <chenxiang66@...ilicon.com>
Subject: Re: [PATCH 20/22] scsi: hisi_sas: Add v3 code to support ECC and AXI
 bus fatal error

On 17/05/2017 13:38, John Garry wrote:
> On 17/05/2017 13:27, Arnd Bergmann wrote:
>> On Wed, May 17, 2017 at 12:49 PM, John Garry <john.garry@...wei.com>
>> wrote:
>>> > From: Xiang Chen <chenxiang66@...ilicon.com>
>>> >
>>> > For ECC 1bit error, logic can recover it, so we only print a warning.
>>> > For ECC multi-bit and AXI bus fatal error, we panic.
>>> >
>>> > Signed-off-by: John Garry <john.garry@...wei.com>
>>> > Signed-off-by: Xiang Chen <chenxiang66@...ilicon.com>
>> This one is tricky as there are conflicting requirements:
>>
>> - For debugging purposes, you want to continue running the system
>>   to figure out what exactly went wrong. Often enough, having the
>>   kernel panic means you don't get to see the panic message because
>>   console access is unavailable and you cannot log in any more
>>
>> - For data consistency purposes you want to stop the system as
>>   soon as there is any uncorrectable data error
>>
>> I see that most scsi drivers don't ever call panic or BUG(), though
>> you already do so for v1 and v2 hw.
>>
>> Maybe the SCSI maintainers can provide some more guidance here.
>>
>>       Arnd
>>
>> .
>>
>
> Hi Arnd,
>
> Actually latest code for v2 has been updated to do a controller reset,
> and not panic, for unrecoverable error:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c?h=v4.12-rc1#n2926
>
>
> We never got around to implementing controller reset for v1 as this
> platform (hip05) is not used much anymore.
>
> As for v3, we will change to to do same once controller reset is
> implemented. I should have added this to the commit log.
>
> Thanks,
> John

It has come to light that hip08 RAS architecture requires handling 
certain errors with firmware first model. I am not sure on the flow of 
controller reset for fatal errors - I'm currently checking the details.

But it is not worth adding this non-critical patch and reverting it 
later, so I'll omit this patch when sending the v6 patchset which 
includes the fix for sloppy spinlock usage.

Thanks,
John