linux-kernel - Re: [PATCH] pci-error-recover: doc cleanup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <584A6470.60502@cn.fujitsu.com>
Date:   Fri, 9 Dec 2016 15:59:44 +0800
From:   Cao jin <caoj.fnst@...fujitsu.com>
To:     <linasvepstas@...il.com>
CC:     Jonathan Corbet <corbet@....net>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        <linux-doc@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Bjorn Helgaas <bhelgaas@...gle.com>
Subject: Re: [PATCH] pci-error-recover: doc cleanup



On 12/09/2016 02:44 PM, Linas Vepstas wrote:
> On Fri, Dec 9, 2016 at 2:37 PM, Cao jin <caoj.fnst@...fujitsu.com> wrote:
>>
>>
>> On 12/09/2016 02:24 PM, Linas Vepstas wrote:
>>> I suppose I'm confused, but I recall that link resets are non-fatal.
>>> Fatal errors typically require that the the pci adapter be completely
>>> reset, any adapter firmware to be reloaded from scratch, the device
>>> driver has to kill all device state and start from scratch. Its huge.
>>> If the fatal error is on pci device that is under a block device
>>> holding a file system, then (usually) there is no way to recover,
>>> because the block layer (and file system) cannot deal with a block
>>> device that disappeared and then reappeared some few seconds later.
>>> (maybe some future zfs or lvm or btrfs might be able to deal with
>>> this, but not today)
>>>
>>> By contrast, link resets are far more gentle: the device driver might
>>> have to discard some half-full FIFO's, or cancel some in-flight
>>> commands, but can otherwise gracefully recover without telling the
>>> higher layers that there were any problems.
>>>
>>> --linas
>>>
>>
>> I am little confused too, even not sure if we are talking the same
>> *fatal error*, I am talking the fatal error defined in PCI Express spec,
>> chapter 6.2.2.2.1:
>>
>> Fatal errors are uncorrectable error conditions which render the
>> particular Link and related hardware unreliable. For Fatal errors, a
>> reset of the components on the Link may be required to return to
>> reliable operation. Platform handling of Fatal errors, and any efforts
>> to limit the effects of these errors, is platform implementation specific.
>>
>> Link reset means set *secondary bus reset* bit in pci bridge config
>> space, can reset the link and device simultaneously, is the strongest
>> kind of reset as I know.
> 
> OK, well, its been far too many years, and I don't have the PCI spec
> at my fingertips.
> Isn't there a link reset that can be performed, without forcing a device reset?
> 

At least I don't find the exact words saying that.

-- 
Sincerely,
Cao jin

> The intent was that some PCI link errors are due to vibration,
> ground-bounce, humidity, etc. and that these errors can be detected
> and do not corrupt the device state or the device driver state.  Since
> they are not associated with data corruption (or rather, the
> corruption is local to the link), these can be recovered by reseting
> just the link, without resetting the whole adapter. They may require
> reseting some device-driver state, but not all of it.
> 
> However, this was all decided before the PCI-E spec was written, so
> maybe the newer PCI-E specs now say something different.
> 
> --linas
> 
>>
>>> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin <caoj.fnst@...fujitsu.com> wrote:
>>>>
>>>>
>>>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>>>>> On Thu, 8 Dec 2016 16:16:14 +0800
>>>>> Cao jin <caoj.fnst@...fujitsu.com> wrote:
>>>>>
>>>>>>  The platform resets the link, and then calls the link_reset() callback
>>>>>>  on all affected device drivers.  This is a PCI-Express specific state
>>>>>> -and is done whenever a non-fatal error has been detected that can be
>>>>>> +and is done whenever a fatal error has been detected that can be
>>>>>>  "solved" by resetting the link. This call informs the driver of the
>>>>>
>>>>> As far as I can tell, the original text was correct here; why do you
>>>>> think this change needs to be made?
>>>>>
>>>>
>>>> See do_recovery() in aer core, reset_link() is called only seeing fatal
>>>> error.
>>>>
>>>> --
>>>> Sincerely,
>>>> Cao jin
>>>>
>>>>
>>>
>>>
>>>
>>
>> --
>> Sincerely,
>> Cao jin
>>
>>
> 
> 
> .
>