linux-kernel - Re: [PATCH] vfio/pci: Support error recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5853BEBF.6000306@cn.fujitsu.com>
Date:   Fri, 16 Dec 2016 18:15:27 +0800
From:   Cao jin <caoj.fnst@...fujitsu.com>
To:     Alex Williamson <alex.williamson@...hat.com>,
        "Michael S. Tsirkin" <mst@...hat.com>
CC:     <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
        <izumi.taku@...fujitsu.com>
Subject: Re: [PATCH] vfio/pci: Support error recovery



On 12/16/2016 06:01 AM, Alex Williamson wrote:
> On Thu, 15 Dec 2016 16:50:07 +0200
> "Michael S. Tsirkin" <mst@...hat.com> wrote:
> 
>> On Thu, Dec 15, 2016 at 09:56:41PM +0800, Cao jin wrote:
>>>
>>>
>>> On 12/15/2016 06:16 AM, Alex Williamson wrote:  
>>>> On Wed, 14 Dec 2016 18:24:23 +0800
>>>> Cao jin <caoj.fnst@...fujitsu.com> wrote:
>>>>   
>>>>> Sorry for late.
>>>>> after reading all your comments, I think I will try the solution 1.
>>>>>
>>>>> On 12/13/2016 03:12 AM, Alex Williamson wrote:  
>>>>>> On Mon, 12 Dec 2016 21:49:01 +0800
>>>>>> Cao jin <caoj.fnst@...fujitsu.com> wrote:
>>>>>>     
>>>>>>> Hi,
>>>>>>> I have 2 solutions(high level design) came to me, please see if they are
>>>>>>> acceptable, or which one is acceptable. Also have some questions.
>>>>>>>
>>>>>>> 1. block guest access during host recovery
>>>>>>>
>>>>>>>    add new field error_recovering in struct vfio_pci_device to
>>>>>>>    indicate host recovery status. aer driver in host will still do
>>>>>>>    reset link
>>>>>>>
>>>>>>>    - set error_recovering in vfio-pci driver's error_detected, used to
>>>>>>>      block all kinds of user access(config space, mmio)
>>>>>>>    - in order to solve concurrent issue of device resetting & user
>>>>>>>      access, check device state[*] in vfio-pci driver's resume, see if
>>>>>>>      device reset is done, if it is, then clear"error_recovering", or
>>>>>>>      else new a timer, check device state periodically until device
>>>>>>>      reset is done. (what if device reset don't end for a long time?)
>>>>>>>    - In qemu, translate guest link reset to host link reset.
>>>>>>>      A question here: we already have link reset in host, is a second
>>>>>>>      link reset necessary? why?
>>>>>>>  
>>>>>>>    [*] how to check device state: reading certain config space
>>>>>>>        register, check return value is valid or not(All F's)    
>>>>>>
>>>>>> Isn't this exactly the path we were on previously?    
>>>>>
>>>>> Yes, it is basically the previous path, plus the optimization.
>>>>>  
>>>>>> There might be an
>>>>>> optimization that we could skip back-to-back resets, but how can you
>>>>>> necessarily infer that the resets are for the same thing? If the user
>>>>>> accesses the device between resets, can you still guarantee the guest
>>>>>> directed reset is unnecessary?  If time passes between resets, do you
>>>>>> know they're for the same event?  How much time can pass between the
>>>>>> host and guest reset to know they're for the same event?  In the
>>>>>> process of error handling, which is more important, speed or
>>>>>> correctness?
>>>>>>      
>>>>>
>>>>> I think vfio driver itself won't know what each reset comes for, and I
>>>>> don't quite understand why should vfio care this question, is this a new
>>>>> question in the design?  
>>>>
>>>> You're suggesting an optimization to eliminate one of the resets,
>>>> and as we've discussed, I don't see removing the host induced reset
>>>> as a viable option.  That means you want to eliminate the guest
>>>> directed reset.  There are potentially three levels to do that, the
>>>> vfio-pci driver in the host kernel, somewhere in QEMU, or eliminate it
>>>> within the guest.  My comments were directed to the first option, the
>>>> host kernel level cannot correlate user directed resets as duplicates
>>>> of host directed resets.  
>>>>    
>>>
>>> Ah, maybe it is mistake, I don't really want to eliminate guest directed
>>> reset very much, I was just not sure why it is very necessary.
>>>
>>> The optimization I said just is fully separating host recovery from
>>> guest recovery(timer, check device periodically) in time, because there
>>> is concurrent device resetting & user access.
>>>   
>>>>> But I think it make sense that the user access during 2 resets maybe a
>>>>> trouble for guest recovery, misbehaved user could be out of our
>>>>> imagination.  Correctness is more important.
>>>>>
>>>>> If I understand you right, let me make a summary: host recovery just
>>>>> does link reset, which is incomplete, so we'd better do a complete guest
>>>>> recovery for correctness.  
>>>>
>>>> We don't know whether the host link reset is incomplete, but we can't do
>>>> a link reset transparently to the device, the device is no longer in the
>>>> same state after the reset.  The device specific driver, which exists
>>>> in userspace needs to be involved in device recovery.  Therefore
>>>> regardless of how QEMU handles the error, the driver within the guest
>>>> needs to be notified and perform recovery.  Since the device is PCI and
>>>> we're on x86 and nobody wants to introduce paravirtual error recovery,
>>>> we must use AER.  Part of AER recovery includes the possibility of
>>>> performing a link reset.  So it seems this eliminates avoiding the link
>>>> reset within the guest.
>>>>
>>>> That leaves QEMU.  Here we need to decide whether a guest triggered
>>>> link reset induces a host link reset.  The current working theory is
>>>> that yes, this must be the case.  If there is ever a case where a
>>>> driver within the guest could trigger a link reset for the purposes
>>>> of error recovery when the host has not, I think this must be the
>>>> case.  Therefore, at least some guest induced link resets must become
>>>> host link resets.  Currently we assume all guest induced link resets
>>>> become host link resets.  Minimally to avoid that, QEMU would need to
>>>> know (not assume) whether the host performed a link reset.  Even with
>>>> that, QEMU would need to be able to correlate that a link reset from
>>>> the guest is a duplicate of a link reset that was already performed by
>>>> the host.  That implies that QEMU needs to deduce the intention of
>>>> the guest.  That seems like a complicated task for a patch series that
>>>> is already complicated enough, especially for a feature of questionable
>>>> value given the configuration restrictions (imo).
>>>>
>>>> I would much rather focus on getting it right and making it as simple
>>>> as we can, even if that means links get reset one too many times on
>>>> error.
>>>>   
>>>
>>> Thanks very much for your detailed explanation, it does helps me to
>>> understand your concern, understand why a second link reset is necessary.
>>>
>>> I still want to share my thoughts with you(not argue): now we know host
>>> aer driver will do link reset for vfio-pci first, so I can say, even if
>>> fatal error is link related, after host link reset, link can work now.
>>> Then in qemu, we are not necessary to translate guest link reset to host
>>> link reset, just use vfio_pci_reset() as it is to do device
>>> reset(probably is FLR). Which also means we don't need following
>>> patch(make code easier):
>>>
>>> @@ -3120,6 +3122,18 @@ static void vfio_pci_reset(DeviceState *dev)
>>>
>>>       trace_vfio_pci_reset(vdev->vbasedev.name);
>>>
>>> +     if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
>>> +         PCIDevice *br = pci_bridge_get_device(pdev->bus);
>>> +
>>> +         if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
>>> +              PCI_BRIDGE_CTL_BUS_RESET)) {
>>> +             if (pci_get_function_0(pdev) == pdev) {
>>> +                 vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>>> +             }
>>> +             return;
>>> +         }
>>> +     }
>>> +
>>>       vfio_pci_pre_reset(vdev);
>>>
>>>
>>> I think this also implies: we have a virtual link in qemu, but a virtual
>>> link will never be broken like a physical link.(In particular we already
>>> know host aer driver surely will do link reset to recover physical
>>> link). So, guest's link reset don't need to care whether virtual link is
>>> reset, just care virtual device.  And qemu "translates guest link reset
>>> to host link reset" seems kind of taking link-reset responsibility over
>>> from host:)
>>>   
>>>>>>> 2. skip link reset in aer driver of host kernel, for vfio-pci.
>>>>>>>    Let user decide how to do serious recovery
>>>>>>>
>>>>>>>    add new field "user_driver" in struct pci_dev, used to skip link
>>>>>>>    reset for vfio-pci; add new field "link_reset" in struct
>>>>>>>    vfio_pci_device to indicate link has been reset or not during
>>>>>>>    recovery
>>>>>>>
>>>>>>>    - set user_driver in vfio_pci_probe(), to skip link reset for
>>>>>>>      vfio-pci in host.
>>>>>>>    - (use a flag)block user access(config, mmio) during host recovery
>>>>>>>      (not sure if this step is necessary)
>>>>>>>    - In qemu, translate guest link reset to host link reset.
>>>>>>>    - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET
>>>>>>>      is executed
>>>>>>>    - In vfio-pci driver's resume, new a timer, check "link_reset" field
>>>>>>>      periodically, if it is set in reasonable time, then clear it and
>>>>>>>      delete timer, or else, vfio-pci driver will does the link reset!    
>>>>>>
>>>>>> What happens in the case of a multifunction device where each function
>>>>>> is part of a separate IOMMU group and one function is hot-removed from
>>>>>> the user? We can't do a link reset on that function since the other
>>>>>> function is still in use.  We have no choice but release a device in an
>>>>>> unknown state back to the host.    
>>>>>
>>>>> hot-remove from user, do you mean, for example, all functions assigned
>>>>> to VM, then suddenly a person does something like following
>>>>>
>>>>> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind
>>>>>
>>>>> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind
>>>>>
>>>>> to return device to host driver, or don't bind it to host driver, let it
>>>>> in driver-less state???  
>>>>
>>>> Yes, the host kernel has no visiblity to how a user is making use of
>>>> devices.  To support AER we require a similar topology between host and
>>>> guest such that a guest link reset translates to a host reset.  That
>>>> requirement is imposed by userspace, ie. QEMU.  The host kernel cannot
>>>> presume that this is the case.  Therefore we could have a
>>>> multi-function device where each function is assigned to the same or
>>>> different users in any configuration.  If a fault occurs and we defer
>>>> to the user to perform the link reset, we have absolutely no guarantee
>>>> that it will ever occur.  If the functions are assigned to different
>>>> users, then each user individually doesn't have the capability to
>>>> perform a link reset.  If the devices happen to be assigned to a single
>>>> user when the error occurs, we cannot assume the user has an AER
>>>> compatible configuration, the devices could be exposed as separate
>>>> single function devices, any one of which might be individually removed
>>>> from the user and made use of by the host, such as your sysfs example
>>>> above.  The host cannot perform a link reset in this case either
>>>> as the sibling devices are still in use by the guest.  Thanks,
>>>>
>>>> Alex
>>>>
>>>>   
>>>
>>> this explanation is valuable to me, so this is also why we can't do link
>>> reset in vfio driver when one of the function is closed. And do link
>>> reset in vfio driver until all functions are close is poor solution and
>>> very complex(quarantine the device) as you said.
>>>
>>> I am going to try solution 1, but I still have some consideration share
>>> with you, this won't stop my trial, and don't have relationship with
>>> above discussion, just FYI:
>>>
>>> In non-virtuallization environment, from device's perspective, the steps
>>> of a normal recovery consists of:
>>>     error_detect
>>>     mmio_enabled
>>>     link_reset
>>>     slot_reset
>>>     resume
>>>
>>> Now in our condition, the steps become:
>>>     *link_reset* (host's, the following are guest's)
>>>     error_detect
>>>     mmio_enabled
>>>     link_reset
>>>     slot_reset
>>>     resume
>>>
>>> Especially, some device's specific driver in guest could do some
>>> specific work in error_detect, take igb_io_error_detected() for example.
>>> Like the words in pci-error-recovery.txt said:
>>>
>>> it gives the driver a chance to cleanup, waiting for pending stuff
>>> (timers, whatever, etc...) to complete;
>>>
>>> But if link_reset is the first step, we lost all the status(register
>>> value, etc) in the device. Of course I don't know if this will be a
>>> problem (might not), just curious if this has been your concern:)  
>>
>> You'll find I did mention it :)
>>
>> But consider Documentation/PCI/pcieaer-howto.txt
>>
>> 	3.2.2.2 Non-correctable (non-fatal and fatal) errors
>>
>> 	If an error message indicates a non-fatal error, performing link reset
>> 	at upstream is not required. The AER driver calls error_detected(dev,
>> 	pci_channel_io_normal) to all drivers associated within a hierarchy in
>> 	question. for example,
>> 	EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
>> 	If Upstream port A captures an AER error, the hierarchy consists of
>> 	Downstream port B and EndPoint.
>>
>> 	A driver may return PCI_ERS_RESULT_CAN_RECOVER,
>> 	PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
>> 	whether it can recover or the AER driver calls mmio_enabled as next.
>>
>> 	If an error message indicates a fatal error, kernel will broadcast
>> 	error_detected(dev, pci_channel_io_frozen) to all drivers within
>> 	a hierarchy in question. Then, performing link reset at upstream is
>> 	necessary.
>>
>> I think that if you just forward errors to guests they will get confused.
>> I see three possible approaches.
>>
>>
>> 1. Always pretend to guest that there was a fatal error,
>>   then basically:
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index dce511f..4022f9b 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -1299,7 +1299,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>>  
>>  	vfio_device_put(device);
>>  
>> -	return PCI_ERS_RESULT_CAN_RECOVER;
>> +	return PCI_ERS_RESULT_DISCONNECT;
>>  }
>>  
>>  static const struct pci_error_handlers vfio_err_handlers = {
>>
>>
>> probably conditional on userspace invoking some ioctl
>> to avoid breaking existing users.
>>
>> 2. send non fatal error to guest.
>> Add another eventfd to distinguish non fatal and fatal errors.
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index dce511f..e22f449 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -1292,14 +1292,17 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>>  
>>  	mutex_lock(&vdev->igate);
>>  
>> -	if (vdev->err_trigger)
>> +	if (state == pci_channel_io_normal && vdev->recover_trigger)
>> +		eventfd_signal(vdev->recover_trigger, 1);
>> +	else if (vdev->err_trigger)
>>  		eventfd_signal(vdev->err_trigger, 1);
>>  
>>  	mutex_unlock(&vdev->igate);
>>  
>>  	vfio_device_put(device);
>>  
>>  	return PCI_ERS_RESULT_CAN_RECOVER;
>>  }
>>  
>>  static const struct pci_error_handlers vfio_err_handlers = {
>>
>> Forward non fatal ones to guest, stop vm on fatal ones.
>>
>>
>>
>> 3. forward both non fatal and fatal error to guest
>> This includes 1 and 2 above, and
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index dce511f..4022f9b 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -1299,7 +1299,8 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>>  
>>  	vfio_device_put(device);
>>  
>> -	return PCI_ERS_RESULT_CAN_RECOVER;
>> +	return state == pci_channel_io_normal : PCI_ERS_RESULT_CAN_RECOVER :
>> +		PCI_ERS_RESULT_DISCONNECT;
>>  }
>>  
>>  static const struct pci_error_handlers vfio_err_handlers = {
>> 	
>> Maybe make this conditional on recover_trigger to keep
>> compatibility.
>>
>>
>> You seem to be starting from 1. But how about starting small, and doing
>> 2 as a first step? Fatal errors will still stop vm.
>> This will help you merge a bunch of error reporting infrastructure
>> without worrying about recovery so much.
>>
>> Making some progress finally will be good.
>>
>>
>> Alex, what do you think?
> 
> Well, as you and I discussed on IRC, I'm concerned that none of us
> attempting to provide input or submit this code are really subject
> matter experts when it comes to AER.  We're all stumbling around
> trying to do what we think is right, but we're just taking stabs at how
> we assume drivers behave and what really needs to happen.  Personally
> I'm also not convinced that the current complexity involved and
> configuration restrictions that the path we're taking implies to the VM
> and usage of devices makes for a compelling feature.  At what point do
> we determine that none of this is worthwhile?
> 

Not sure I got your accurate thoughts, seems you still have some
concerns about the current configuration restrictions?

FYI: http://www.vmware.com/pdf/vsp_4_vmdirectpath_host.pdf
chaper: Problems with Device Assignment Dependencies

Seems vmware also adopt basically the same configuration restrictions as
us, they just don't mention AER, and the topology in guest, but I guess
they have the same restriction as us. So these configuration
restrictions isn't invented by us, it is a existed way, do we still need
to worry about it?

> My priorities are:
>   * We need to do the right thing for the guest, I don't think we
>     should be presuming that different reset types are equivalent,
>     leaving gaps where we expect the guest/host to do a reset and don't
>     follow through on other reset requests, and we need to notify the
>     guest immediately for the error.
>   * We need to do the right thing for the host, that means we should
>     not give the user the opportunity to leave a device in a state
>     where we haven't at least performed a bus reset on link error (this
>     may be our current state and if so we should fix it).
> 
> I do like the idea of taking a step back and trying to determine if
> there are some errors that we can handle incrementally from the
> existing behavior, but I also don't want to get into a state where a
> QEMU user needs to specify what degree of error handling they want and
> follow different configuration rules for each type.  If the feature
> isn't usable because the configuration requirements are too restrictive
> or too confusing, perhaps this is not a worthwhile feature.  Thanks,
> 
> Alex
> 

I got your concern, so, we reached an agreement on MST's approach 2?

-- 
Sincerely,
Cao jin