lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 8 Feb 2022 14:26:24 -0700
From:   Alex Williamson <alex.williamson@...hat.com>
To:     Abhishek Sahu <abhsahu@...dia.com>
Cc:     kvm@...r.kernel.org, Cornelia Huck <cohuck@...hat.com>,
        Max Gurtovoy <mgurtovoy@...dia.com>,
        Yishai Hadas <yishaih@...dia.com>,
        Zhen Lei <thunder.leizhen@...wei.com>,
        Jason Gunthorpe <jgg@...dia.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] vfio/pci: fix memory leak during D3hot to D0
 transition

On Tue, 8 Feb 2022 00:50:08 +0530
Abhishek Sahu <abhsahu@...dia.com> wrote:

> On 2/2/2022 5:01 AM, Alex Williamson wrote:
> > On Tue, 1 Feb 2022 17:06:43 +0530
> > Abhishek Sahu <abhsahu@...dia.com> wrote:
> >   
> >> On 2/1/2022 1:41 AM, Alex Williamson wrote:  
> >>> On Mon, 31 Jan 2022 16:54:50 +0530
> >>> Abhishek Sahu <abhsahu@...dia.com> wrote:
> >>>  
> >>>> If needs_pm_restore is set (PCI device does not have support for no
> >>>> soft reset), then the current PCI state will be saved during D0->D3hot
> >>>> transition and same will be restored back during D3hot->D0 transition.
> >>>> For saving the PCI state locally, pci_store_saved_state() is being
> >>>> used and the pci_load_and_free_saved_state() will free the allocated
> >>>> memory.
> >>>>
> >>>> But for reset related IOCTLs, vfio driver calls PCI reset related
> >>>> API's which will internally change the PCI power state back to D0. So,
> >>>> when the guest resumes, then it will get the current state as D0 and it
> >>>> will skip the call to vfio_pci_set_power_state() for changing the
> >>>> power state to D0 explicitly. In this case, the memory pointed by
> >>>> pm_save will never be freed. In a malicious sequence, the state changing
> >>>> to D3hot followed by VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be
> >>>> run in a loop and it can cause an OOM situation.
> >>>>
> >>>> Also, pci_pm_reset() returns -EINVAL if we try to reset a device that
> >>>> isn't currently in D0. Therefore any path where we're triggering a
> >>>> function reset that could use a PM reset and we don't know if the device
> >>>> is in D0, should wake up the device before we try that reset.
> >>>>
> >>>> This patch changes the device power state to D0 by invoking
> >>>> vfio_pci_set_power_state() before calling reset related API's.
> >>>> It will help in fixing the mentioned memory leak and making sure
> >>>> that the device is in D0 during reset. Also, to prevent any similar
> >>>> memory leak for future development, this patch frees memory first
> >>>> before overwriting 'pm_save'.
> >>>>
> >>>> Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition")
> >>>> Signed-off-by: Abhishek Sahu <abhsahu@...dia.com>
> >>>> ---
> >>>>
> >>>> * Changes in v2
> >>>>
> >>>> - Add the Fixes tag and sent this patch independently.
> >>>> - Invoke vfio_pci_set_power_state() before invoking reset related API's.
> >>>> - Removed saving of power state locally.
> >>>> - Removed warning before 'kfree(vdev->pm_save)'.
> >>>> - Updated comments and commit message according to updated changes.
> >>>>
> >>>> * v1 of this patch was sent in
> >>>> https://lore.kernel.org/lkml/20220124181726.19174-4-abhsahu@nvidia.com/
> >>>>
> >>>>  drivers/vfio/pci/vfio_pci_core.c | 27 +++++++++++++++++++++++++++
> >>>>  1 file changed, 27 insertions(+)
> >>>>
> >>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >>>> index f948e6cd2993..d6dd4f7c4b2c 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci_core.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >>>> @@ -228,6 +228,13 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
> >>>>       if (!ret) {
> >>>>               /* D3 might be unsupported via quirk, skip unless in D3 */
> >>>>               if (needs_save && pdev->current_state >= PCI_D3hot) {
> >>>> +                     /*
> >>>> +                      * If somehow, the vfio driver was not able to free the
> >>>> +                      * memory allocated in pm_save, then free the earlier
> >>>> +                      * memory first before overwriting pm_save to prevent
> >>>> +                      * memory leak.
> >>>> +                      */
> >>>> +                     kfree(vdev->pm_save);
> >>>>                       vdev->pm_save = pci_store_saved_state(pdev);
> >>>>               } else if (needs_restore) {
> >>>>                       pci_load_and_free_saved_state(pdev, &vdev->pm_save);
> >>>> @@ -322,6 +329,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
> >>>>       /* For needs_reset */
> >>>>       lockdep_assert_held(&vdev->vdev.dev_set->lock);
> >>>>
> >>>> +     /*
> >>>> +      * This function can be invoked while the power state is non-D0,
> >>>> +      * Change the device power state to D0 first.  
> >>>
> >>> I think we need to describe more why we're doing this than what we're
> >>> doing.  We need to make sure the device is in D0 in case we have a
> >>> reset method that depends on that directly, ex. pci_pm_reset(), or
> >>> possibly device specific resets that may access device BAR resources.
> >>> I think it's placed here in the function so that the config space
> >>> changes below aren't overwritten by restoring the saved state and maybe
> >>> also because the set_irqs_ioctl() call might access device MMIO space.
> >>>  
> >>
> >>  Thanks Alex.
> >>  I will add more details here in the comment.
> >>  
> >>>> +      */
> >>>> +     vfio_pci_set_power_state(vdev, PCI_D0);
> >>>> +
> >>>>       /* Stop the device from further DMA */
> >>>>       pci_clear_master(pdev);
> >>>>
> >>>> @@ -921,6 +934,13 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> >>>>                       return -EINVAL;
> >>>>
> >>>>               vfio_pci_zap_and_down_write_memory_lock(vdev);
> >>>> +
> >>>> +             /*
> >>>> +              * This function can be invoked while the power state is non-D0,
> >>>> +              * Change the device power state to D0 before doing reset.
> >>>> +              */  
> >>>
> >>> See below, reconsidering this...
> >>>  
> >>>> +             vfio_pci_set_power_state(vdev, PCI_D0);
> >>>> +
> >>>>               ret = pci_try_reset_function(vdev->pdev);
> >>>>               up_write(&vdev->memory_lock);
> >>>>
> >>>> @@ -2055,6 +2075,13 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> >>>>       }
> >>>>       cur_mem = NULL;
> >>>>
> >>>> +     /*
> >>>> +      * This function can be invoked while the power state is non-D0.
> >>>> +      * Change power state of all devices to D0 before doing reset.
> >>>> +      */  
> >>>
> >>> Here I have trouble convincing myself exactly what we're doing.  As you
> >>> note in patch 1/ of the RFC series, pci_reset_bus(), or more precisely
> >>> pci_dev_save_and_disable(), wakes the device to D0 before reset, so we
> >>> can't be doing this only to get the device into D0.  The function level
> >>> resets do the same.
> >>>
> >>> Actually, now I'm remembering and debugging where I got myself confused
> >>> previously with pci_pm_reset().  The scenario was a Windows guest with
> >>> an assigned Intel 82574L NIC.  When doing a shutdown from the guest the
> >>> device is placed in D3hot and we enter vfio_pci_core_disable() in that
> >>> state.  That function however uses __pci_reset_function_locked(), which
> >>> skips the pci_dev_save_and_disable() since much of it is redundant for
> >>> that call path (I think I generalized this to all flavors of
> >>> pci_reset_function() in my head).  
> >>
> >>  Thanks for providing the background related with the original issue.
> >>  
> >>>
> >>> The standard call to pci_try_reset_function(), as in the previous
> >>> chunk, will make use of pci_dev_save_and_disable(), so for either of
> >>> these latter cases the concern cannot be simply having the device in D0,
> >>> we need a reason that we want the previously saved state restored on the
> >>> device before the reset, and thus restored to the device after the
> >>> reset as the rationale for the change.
> >>>  
> >>
> >>  I will add this as a comment.
> >>  
> >>>> +     list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> >>>> +             vfio_pci_set_power_state(cur, PCI_D0);
> >>>> +
> >>>>       ret = pci_reset_bus(pdev);
> >>>>
> >>>>  err_undo:  
> >>>
> >>>
> >>> We also call pci_reset_bus() in vfio_pci_dev_set_try_reset().  In that
> >>> case, none of the other devices can be in use by the user, but they can
> >>> certainly be in D3hot with previous device state saved off into our
> >>> pm_save cache.  If we don't have a good reason to restore in that case,
> >>> I'm wondering if we really have a good reason to restore in the above
> >>> two cases.
> >>>
> >>> Perhaps we just need the first chunk above to resolve the memory leak,  
> >>
> >>  First chunk means only the changes done in vfio_pci_set_power_state()
> >>  which is calling kfree() before calling pci_store_saved_state().
> >>  Or I need to include more things in the first patch ?  
> > 
> > Correct, first chunk as is the first change in the patch.  Patch chunks
> > are delineated by the @@ offset lines.
> >   
> 
>  Thanks for confirming this.
> 
> >>
> >>  With the kfree(), the original memory leak issue should be solved.
> >>  
> >>> and the second chunk as a separate patch to resolve the issue with
> >>> devices entering vfio_pci_core_disable() in non-D0 state.  Sorry if I  
> >>
> >>  And this second patch will contain rest of the things where
> >>  we will call vfio_pci_set_power_state() explicitly for moving to
> >>  D0 state ?  
> > 
> > At least the first one in vfio_pci_core_disable(), the others need
> > justification.
> >   
> 
>  Yes. First one is needed.
> 
> >>  Also, We need to explore if setting to D0 state is really required at
> >>  all these places and If it is not required, then we don't need second
> >>  patch ?  
> > 
> > We need a second patch, I'm convinced that we don't otherwise wake the
> > device to D0 before we potentially get to pci_pm_reset() in
> > vfio_pci_core_disable().  It's the remaining cases of setting D0 that
> > I'm less clear on.  If it's the case that we need to restore config
> > space any time a NoSoftRst- device is woken from D3hot and the state
> > saved and restored around the reset is meaningless otherwise, that's a
> > valid justification, but is it accurate?  If so, we should recheck the
> > other case of calling pci_reset_bus() too.  Thanks,
> > 
> > Alex
> >   
> 
>  I was analyzing this part in detail and added some debug prints and
>  made user space program to understand it better. Also, I have gone
>  through the patch 51ef3a004b1e (“vfio/pci: Restore device state on
>  PM transition”). 
> 
>  We have 2 cases here:
> 
>  1. The devices which has NoSoftRst+  (needs_pm_restore is false).
>     This case should work fine for all the cases (Apart from vfio_pci_core_disable())
>     without waking-up the device explicitly.
> 
>  2. The devices which has NoSoftRst- (needs_pm_restore is true). 
> 
>  For case 2, let’s consider following example:
> 
>  a. The device is in D3hot. 
>  b. User made VFIO_DEVICE_RESET ioctl.
>  c. pci_try_reset_function() will be called which internally
>     invokes pci_dev_save_and_disable().
>  d. pci_set_power_state(dev, PCI_D0) will be called first.
>  e. pci_save_state() will happen then.
> 
>  Now, for the devices which has NoSoftRst-,
>  the pci_set_power_state() should trigger soft reset and
>  we may lose the original state at step (d) and this state
>  cannot be restored.
> 
>  For example, lets assume the case, where SBIOS or host
>  linux kernel (In the aspm.c) enables PCIe LTR setting for the
>  PCIe device. When this soft reset will be triggered, then this
>  LTR setting may be reset, and the device state saved at step (e)
>  will also have this setting cleared so it cannot be restored.
>  Same thing can be happened for other PCIe capabilities. Since the
>  vfio driver only exposes limited enhanced capabilities to its user
>  So, the vfio-driver user also won’t have option to save and
>  restore these capabilities state and these original
>  settings will be permanently lost. 

Yes, this is my concern, thanks for confirming.

>  So, it seems we need to always move the device explicitly to
>  D0 state by calling vfio_pci_set_power_state() before
>  any reset for the reset triggered by IOCTLs. This is mainly to
>  preserve the state around soft reset.

The other option would be to test for vdev->pm_save and if found do a
load-and-free + restore-state after reset.  For simplicity, I'd tend to
favor your approach to wake the device with vfio_pci_set_power_state()
before reset.  Either approach seems roughly equal to me.

>  For vfio_pci_dev_set_try_reset() also, we can have the above
>  mentioned situation. The other functions/devices can be in D3hot
>  state and the D0 transition can cause soft reset there also.
>  For example, in my case, NVIDIA GPU has VGA (func 0) and audio
>  (func 1) function. I added debug print to dump the current state
>  before and after pci_reset_bus(). Before pci_reset_bus() the func 1
>  state was D3hot and after pci_reset_bus() the func 1 state got
>  changed to D0. This pci_reset_bus() was called during closing of
>  func 0 device so there are chances of soft reset for other 
>  function/devices.

Ok, so we'll always wakeup devices for both the pci_reset_function()
class of resets and bus resets.  This seems to logically fit with the
fix to wakeup the device on release, so shall we do patch 1 of the
fixes series includes the kfree only and patch 2 resolves all the cases
of waking devices before reset?  Thanks,

Alex

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ