linux-kernel - Re: [PATCH] vfio/pci: Support error recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161215150131.2bdac9a8@t450s.home>
Date:   Thu, 15 Dec 2016 15:01:31 -0700
From:   Alex Williamson <alex.williamson@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     Cao jin <caoj.fnst@...fujitsu.com>, linux-kernel@...r.kernel.org,
        kvm@...r.kernel.org, izumi.taku@...fujitsu.com
Subject: Re: [PATCH] vfio/pci: Support error recovery

On Thu, 15 Dec 2016 16:50:07 +0200
"Michael S. Tsirkin" <mst@...hat.com> wrote:

> On Thu, Dec 15, 2016 at 09:56:41PM +0800, Cao jin wrote:
> > 
> > 
> > On 12/15/2016 06:16 AM, Alex Williamson wrote:  
> > > On Wed, 14 Dec 2016 18:24:23 +0800
> > > Cao jin <caoj.fnst@...fujitsu.com> wrote:
> > >   
> > >> Sorry for late.
> > >> after reading all your comments, I think I will try the solution 1.
> > >>
> > >> On 12/13/2016 03:12 AM, Alex Williamson wrote:  
> > >>> On Mon, 12 Dec 2016 21:49:01 +0800
> > >>> Cao jin <caoj.fnst@...fujitsu.com> wrote:
> > >>>     
> > >>>> Hi,
> > >>>> I have 2 solutions(high level design) came to me, please see if they are
> > >>>> acceptable, or which one is acceptable. Also have some questions.
> > >>>>
> > >>>> 1. block guest access during host recovery
> > >>>>
> > >>>>    add new field error_recovering in struct vfio_pci_device to
> > >>>>    indicate host recovery status. aer driver in host will still do
> > >>>>    reset link
> > >>>>
> > >>>>    - set error_recovering in vfio-pci driver's error_detected, used to
> > >>>>      block all kinds of user access(config space, mmio)
> > >>>>    - in order to solve concurrent issue of device resetting & user
> > >>>>      access, check device state[*] in vfio-pci driver's resume, see if
> > >>>>      device reset is done, if it is, then clear"error_recovering", or
> > >>>>      else new a timer, check device state periodically until device
> > >>>>      reset is done. (what if device reset don't end for a long time?)
> > >>>>    - In qemu, translate guest link reset to host link reset.
> > >>>>      A question here: we already have link reset in host, is a second
> > >>>>      link reset necessary? why?
> > >>>>  
> > >>>>    [*] how to check device state: reading certain config space
> > >>>>        register, check return value is valid or not(All F's)    
> > >>>
> > >>> Isn't this exactly the path we were on previously?    
> > >>
> > >> Yes, it is basically the previous path, plus the optimization.
> > >>  
> > >>> There might be an
> > >>> optimization that we could skip back-to-back resets, but how can you
> > >>> necessarily infer that the resets are for the same thing? If the user
> > >>> accesses the device between resets, can you still guarantee the guest
> > >>> directed reset is unnecessary?  If time passes between resets, do you
> > >>> know they're for the same event?  How much time can pass between the
> > >>> host and guest reset to know they're for the same event?  In the
> > >>> process of error handling, which is more important, speed or
> > >>> correctness?
> > >>>      
> > >>
> > >> I think vfio driver itself won't know what each reset comes for, and I
> > >> don't quite understand why should vfio care this question, is this a new
> > >> question in the design?  
> > > 
> > > You're suggesting an optimization to eliminate one of the resets,
> > > and as we've discussed, I don't see removing the host induced reset
> > > as a viable option.  That means you want to eliminate the guest
> > > directed reset.  There are potentially three levels to do that, the
> > > vfio-pci driver in the host kernel, somewhere in QEMU, or eliminate it
> > > within the guest.  My comments were directed to the first option, the
> > > host kernel level cannot correlate user directed resets as duplicates
> > > of host directed resets.  
> > >    
> > 
> > Ah, maybe it is mistake, I don't really want to eliminate guest directed
> > reset very much, I was just not sure why it is very necessary.
> > 
> > The optimization I said just is fully separating host recovery from
> > guest recovery(timer, check device periodically) in time, because there
> > is concurrent device resetting & user access.
> >   
> > >> But I think it make sense that the user access during 2 resets maybe a
> > >> trouble for guest recovery, misbehaved user could be out of our
> > >> imagination.  Correctness is more important.
> > >>
> > >> If I understand you right, let me make a summary: host recovery just
> > >> does link reset, which is incomplete, so we'd better do a complete guest
> > >> recovery for correctness.  
> > > 
> > > We don't know whether the host link reset is incomplete, but we can't do
> > > a link reset transparently to the device, the device is no longer in the
> > > same state after the reset.  The device specific driver, which exists
> > > in userspace needs to be involved in device recovery.  Therefore
> > > regardless of how QEMU handles the error, the driver within the guest
> > > needs to be notified and perform recovery.  Since the device is PCI and
> > > we're on x86 and nobody wants to introduce paravirtual error recovery,
> > > we must use AER.  Part of AER recovery includes the possibility of
> > > performing a link reset.  So it seems this eliminates avoiding the link
> > > reset within the guest.
> > > 
> > > That leaves QEMU.  Here we need to decide whether a guest triggered
> > > link reset induces a host link reset.  The current working theory is
> > > that yes, this must be the case.  If there is ever a case where a
> > > driver within the guest could trigger a link reset for the purposes
> > > of error recovery when the host has not, I think this must be the
> > > case.  Therefore, at least some guest induced link resets must become
> > > host link resets.  Currently we assume all guest induced link resets
> > > become host link resets.  Minimally to avoid that, QEMU would need to
> > > know (not assume) whether the host performed a link reset.  Even with
> > > that, QEMU would need to be able to correlate that a link reset from
> > > the guest is a duplicate of a link reset that was already performed by
> > > the host.  That implies that QEMU needs to deduce the intention of
> > > the guest.  That seems like a complicated task for a patch series that
> > > is already complicated enough, especially for a feature of questionable
> > > value given the configuration restrictions (imo).
> > > 
> > > I would much rather focus on getting it right and making it as simple
> > > as we can, even if that means links get reset one too many times on
> > > error.
> > >   
> > 
> > Thanks very much for your detailed explanation, it does helps me to
> > understand your concern, understand why a second link reset is necessary.
> > 
> > I still want to share my thoughts with you(not argue): now we know host
> > aer driver will do link reset for vfio-pci first, so I can say, even if
> > fatal error is link related, after host link reset, link can work now.
> > Then in qemu, we are not necessary to translate guest link reset to host
> > link reset, just use vfio_pci_reset() as it is to do device
> > reset(probably is FLR). Which also means we don't need following
> > patch(make code easier):
> > 
> > @@ -3120,6 +3122,18 @@ static void vfio_pci_reset(DeviceState *dev)
> > 
> >       trace_vfio_pci_reset(vdev->vbasedev.name);
> > 
> > +     if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
> > +         PCIDevice *br = pci_bridge_get_device(pdev->bus);
> > +
> > +         if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
> > +              PCI_BRIDGE_CTL_BUS_RESET)) {
> > +             if (pci_get_function_0(pdev) == pdev) {
> > +                 vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> > +             }
> > +             return;
> > +         }
> > +     }
> > +
> >       vfio_pci_pre_reset(vdev);
> > 
> > 
> > I think this also implies: we have a virtual link in qemu, but a virtual
> > link will never be broken like a physical link.(In particular we already
> > know host aer driver surely will do link reset to recover physical
> > link). So, guest's link reset don't need to care whether virtual link is
> > reset, just care virtual device.  And qemu "translates guest link reset
> > to host link reset" seems kind of taking link-reset responsibility over
> > from host:)
> >   
> > >>>> 2. skip link reset in aer driver of host kernel, for vfio-pci.
> > >>>>    Let user decide how to do serious recovery
> > >>>>
> > >>>>    add new field "user_driver" in struct pci_dev, used to skip link
> > >>>>    reset for vfio-pci; add new field "link_reset" in struct
> > >>>>    vfio_pci_device to indicate link has been reset or not during
> > >>>>    recovery
> > >>>>
> > >>>>    - set user_driver in vfio_pci_probe(), to skip link reset for
> > >>>>      vfio-pci in host.
> > >>>>    - (use a flag)block user access(config, mmio) during host recovery
> > >>>>      (not sure if this step is necessary)
> > >>>>    - In qemu, translate guest link reset to host link reset.
> > >>>>    - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET
> > >>>>      is executed
> > >>>>    - In vfio-pci driver's resume, new a timer, check "link_reset" field
> > >>>>      periodically, if it is set in reasonable time, then clear it and
> > >>>>      delete timer, or else, vfio-pci driver will does the link reset!    
> > >>>
> > >>> What happens in the case of a multifunction device where each function
> > >>> is part of a separate IOMMU group and one function is hot-removed from
> > >>> the user? We can't do a link reset on that function since the other
> > >>> function is still in use.  We have no choice but release a device in an
> > >>> unknown state back to the host.    
> > >>
> > >> hot-remove from user, do you mean, for example, all functions assigned
> > >> to VM, then suddenly a person does something like following
> > >>
> > >> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind
> > >>
> > >> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind
> > >>
> > >> to return device to host driver, or don't bind it to host driver, let it
> > >> in driver-less state???  
> > > 
> > > Yes, the host kernel has no visiblity to how a user is making use of
> > > devices.  To support AER we require a similar topology between host and
> > > guest such that a guest link reset translates to a host reset.  That
> > > requirement is imposed by userspace, ie. QEMU.  The host kernel cannot
> > > presume that this is the case.  Therefore we could have a
> > > multi-function device where each function is assigned to the same or
> > > different users in any configuration.  If a fault occurs and we defer
> > > to the user to perform the link reset, we have absolutely no guarantee
> > > that it will ever occur.  If the functions are assigned to different
> > > users, then each user individually doesn't have the capability to
> > > perform a link reset.  If the devices happen to be assigned to a single
> > > user when the error occurs, we cannot assume the user has an AER
> > > compatible configuration, the devices could be exposed as separate
> > > single function devices, any one of which might be individually removed
> > > from the user and made use of by the host, such as your sysfs example
> > > above.  The host cannot perform a link reset in this case either
> > > as the sibling devices are still in use by the guest.  Thanks,
> > > 
> > > Alex
> > > 
> > >   
> > 
> > this explanation is valuable to me, so this is also why we can't do link
> > reset in vfio driver when one of the function is closed. And do link
> > reset in vfio driver until all functions are close is poor solution and
> > very complex(quarantine the device) as you said.
> > 
> > I am going to try solution 1, but I still have some consideration share
> > with you, this won't stop my trial, and don't have relationship with
> > above discussion, just FYI:
> > 
> > In non-virtuallization environment, from device's perspective, the steps
> > of a normal recovery consists of:
> >     error_detect
> >     mmio_enabled
> >     link_reset
> >     slot_reset
> >     resume
> > 
> > Now in our condition, the steps become:
> >     *link_reset* (host's, the following are guest's)
> >     error_detect
> >     mmio_enabled
> >     link_reset
> >     slot_reset
> >     resume
> > 
> > Especially, some device's specific driver in guest could do some
> > specific work in error_detect, take igb_io_error_detected() for example.
> > Like the words in pci-error-recovery.txt said:
> > 
> > it gives the driver a chance to cleanup, waiting for pending stuff
> > (timers, whatever, etc...) to complete;
> > 
> > But if link_reset is the first step, we lost all the status(register
> > value, etc) in the device. Of course I don't know if this will be a
> > problem (might not), just curious if this has been your concern:)  
> 
> You'll find I did mention it :)
> 
> But consider Documentation/PCI/pcieaer-howto.txt
> 
> 	3.2.2.2 Non-correctable (non-fatal and fatal) errors
> 
> 	If an error message indicates a non-fatal error, performing link reset
> 	at upstream is not required. The AER driver calls error_detected(dev,
> 	pci_channel_io_normal) to all drivers associated within a hierarchy in
> 	question. for example,
> 	EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
> 	If Upstream port A captures an AER error, the hierarchy consists of
> 	Downstream port B and EndPoint.
> 
> 	A driver may return PCI_ERS_RESULT_CAN_RECOVER,
> 	PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
> 	whether it can recover or the AER driver calls mmio_enabled as next.
> 
> 	If an error message indicates a fatal error, kernel will broadcast
> 	error_detected(dev, pci_channel_io_frozen) to all drivers within
> 	a hierarchy in question. Then, performing link reset at upstream is
> 	necessary.
> 
> I think that if you just forward errors to guests they will get confused.
> I see three possible approaches.
> 
> 
> 1. Always pretend to guest that there was a fatal error,
>   then basically:
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index dce511f..4022f9b 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1299,7 +1299,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>  
>  	vfio_device_put(device);
>  
> -	return PCI_ERS_RESULT_CAN_RECOVER;
> +	return PCI_ERS_RESULT_DISCONNECT;
>  }
>  
>  static const struct pci_error_handlers vfio_err_handlers = {
> 
> 
> probably conditional on userspace invoking some ioctl
> to avoid breaking existing users.
> 
> 2. send non fatal error to guest.
> Add another eventfd to distinguish non fatal and fatal errors.
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index dce511f..e22f449 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1292,14 +1292,17 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>  
>  	mutex_lock(&vdev->igate);
>  
> -	if (vdev->err_trigger)
> +	if (state == pci_channel_io_normal && vdev->recover_trigger)
> +		eventfd_signal(vdev->recover_trigger, 1);
> +	else if (vdev->err_trigger)
>  		eventfd_signal(vdev->err_trigger, 1);
>  
>  	mutex_unlock(&vdev->igate);
>  
>  	vfio_device_put(device);
>  
>  	return PCI_ERS_RESULT_CAN_RECOVER;
>  }
>  
>  static const struct pci_error_handlers vfio_err_handlers = {
> 
> Forward non fatal ones to guest, stop vm on fatal ones.
> 
> 
> 
> 3. forward both non fatal and fatal error to guest
> This includes 1 and 2 above, and
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index dce511f..4022f9b 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1299,7 +1299,8 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>  
>  	vfio_device_put(device);
>  
> -	return PCI_ERS_RESULT_CAN_RECOVER;
> +	return state == pci_channel_io_normal : PCI_ERS_RESULT_CAN_RECOVER :
> +		PCI_ERS_RESULT_DISCONNECT;
>  }
>  
>  static const struct pci_error_handlers vfio_err_handlers = {
> 	
> Maybe make this conditional on recover_trigger to keep
> compatibility.
> 
> 
> You seem to be starting from 1. But how about starting small, and doing
> 2 as a first step? Fatal errors will still stop vm.
> This will help you merge a bunch of error reporting infrastructure
> without worrying about recovery so much.
> 
> Making some progress finally will be good.
> 
> 
> Alex, what do you think?

Well, as you and I discussed on IRC, I'm concerned that none of us
attempting to provide input or submit this code are really subject
matter experts when it comes to AER.  We're all stumbling around
trying to do what we think is right, but we're just taking stabs at how
we assume drivers behave and what really needs to happen.  Personally
I'm also not convinced that the current complexity involved and
configuration restrictions that the path we're taking implies to the VM
and usage of devices makes for a compelling feature.  At what point do
we determine that none of this is worthwhile?

My priorities are:
  * We need to do the right thing for the guest, I don't think we
    should be presuming that different reset types are equivalent,
    leaving gaps where we expect the guest/host to do a reset and don't
    follow through on other reset requests, and we need to notify the
    guest immediately for the error.
  * We need to do the right thing for the host, that means we should
    not give the user the opportunity to leave a device in a state
    where we haven't at least performed a bus reset on link error (this
    may be our current state and if so we should fix it).

I do like the idea of taking a step back and trying to determine if
there are some errors that we can handle incrementally from the
existing behavior, but I also don't want to get into a state where a
QEMU user needs to specify what degree of error handling they want and
follow different configuration rules for each type.  If the feature
isn't usable because the configuration requirements are too restrictive
or too confusing, perhaps this is not a worthwhile feature.  Thanks,

Alex