lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170406093119.792a591e@t450s.home>
Date:   Thu, 6 Apr 2017 09:31:19 -0600
From:   Alex Williamson <alex.williamson@...hat.com>
To:     Cao jin <caoj.fnst@...fujitsu.com>
Cc:     "Michael S. Tsirkin" <mst@...hat.com>,
        <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
        <qemu-devel@...gnu.org>, <izumi.taku@...fujitsu.com>
Subject: Re: [PATCH v6] vfio error recovery: kernel support

On Thu, 6 Apr 2017 16:49:35 +0800
Cao jin <caoj.fnst@...fujitsu.com> wrote:

> On 04/06/2017 05:56 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 04:54:33PM +0800, Cao jin wrote:  
> >> Apparently, I don't have experience to induce non-fatal error, device
> >> error is more of a chance related with the environment(temperature,
> >> humidity, etc) as I understand.  
> > 
> > I'm not sure how to interpret this statement. I think what Alex is
> > saying is simply that patches should include some justification. They
> > make changes but what are they improving?
> > For example:
> > 
> > 	I tested device ABC in conditions DEF. Without a patch VM
> > 	stops. With the patches applied VM recovers and proceeds to
> > 	use the device normally.
> > 
> > is one reasonable justification imho.
> >   
> 
> Got it. But unfortunately, until now, I haven't seen a VM stop caused by
> a real device non-fatal error during device assignment(Only saw real
> fatal errors after start VM).
> On one side, AER error could occur theoretically; on the other side,
> seldom people have seen a VM stop caused by AER. Now I am asked that do
> I have a real evidence or scenario to prove that this patchset is really
> useful? I don't, and we all know it is hard to trigger a real hardware
> error, so, seems I am pushed into the corner.  I guess these questions
> also apply for AER driver's author, if the scenario is easy to
> reproduce, there is no need to write aer_inject to fake errors.

Yes, non-fatal errors are rare, so rare in fact that I wonder if
anything actually implements them.  Could we take a survey of Linux
drivers with AER support and see which ones actually do anything useful
on a non-fatal error?  I know fatal errors exist, I've seen them.  The
situation I want to avoid is adding non-fatal AER support that never
gets used and just gets in the way of the next attempt to support fatal
error recovery.  We're extending the user ABI here, which means that
for any feature we add, we need to consider how we'll support it long
term and how we'll deprecate it or further extend it when the next
level of support comes.  I'm reluctant to add code that only partially
solves a problem and only addresses the side of the problem that we're
not even sure occurs.  Thanks,

Alex

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ