[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4d7beb74-8958-2ec6-fcd6-1ac40e4d29cd@nvidia.com>
Date: Wed, 8 Jun 2022 15:42:43 +0530
From: Abhishek Sahu <abhsahu@...dia.com>
To: Alex Williamson <alex.williamson@...hat.com>
Cc: Jason Gunthorpe <jgg@...dia.com>,
Cornelia Huck <cohuck@...hat.com>,
Yishai Hadas <yishaih@...dia.com>,
Shameer Kolothum <shameerali.kolothum.thodi@...wei.com>,
Kevin Tian <kevin.tian@...el.com>,
"Rafael J . Wysocki" <rafael@...nel.org>,
Max Gurtovoy <mgurtovoy@...dia.com>,
Bjorn Helgaas <bhelgaas@...gle.com>,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
linux-pm@...r.kernel.org, linux-pci@...r.kernel.org
Subject: Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
On 6/8/2022 3:20 AM, Alex Williamson wrote:
> On Fri, 3 Jun 2022 15:49:27 +0530
> Abhishek Sahu <abhsahu@...dia.com> wrote:
>
>> On 6/2/2022 11:14 PM, Alex Williamson wrote:
>>> On Thu, 2 Jun 2022 17:22:03 +0530
>>> Abhishek Sahu <abhsahu@...dia.com> wrote:
>>>
>>>> On 6/1/2022 9:51 PM, Alex Williamson wrote:
>>>>> On Wed, 1 Jun 2022 15:19:07 +0530
>>>>> Abhishek Sahu <abhsahu@...dia.com> wrote:
>>>>>
>>>>>> On 6/1/2022 4:22 AM, Alex Williamson wrote:
>>>>>>> On Tue, 31 May 2022 16:43:04 -0300
>>>>>>> Jason Gunthorpe <jgg@...dia.com> wrote:
>>>>>>>
>>>>>>>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:
>>>>>>>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:
>>>>>>>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
>>>>>>>>>>
>>>>>>>>>>> 1. In real use case, config or any other ioctl should not come along
>>>>>>>>>>> with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
>>>>>>>>>>>
>>>>>>>>>>> 2. Maintain some 'access_count' which will be incremented when we
>>>>>>>>>>> do any config space access or ioctl.
>>>>>>>>>>
>>>>>>>>>> Please don't open code locks - if you need a lock then write a proper
>>>>>>>>>> lock. You can use the 'try' variants to bail out in cases where that
>>>>>>>>>> is appropriate.
>>>>>>>>>>
>>>>>>>>>> Jason
>>>>>>>>>
>>>>>>>>> Thanks Jason for providing your inputs.
>>>>>>>>>
>>>>>>>>> In that case, should I introduce new rw_semaphore (For example
>>>>>>>>> power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?
>>>>>>>>
>>>>>>>> Possibly, this is better than an atomic at least
>>>>>>>>
>>>>>>>>> 1. At the beginning of config space access or ioctl, we can take the
>>>>>>>>> lock
>>>>>>>>>
>>>>>>>>> down_read(&vdev->power_lock);
>>>>>>>>
>>>>>>>> You can also do down_read_trylock() here and bail out as you were
>>>>>>>> suggesting with the atomic.
>>>>>>>>
>>>>>>>> trylock doesn't have lock odering rules because it can't sleep so it
>>>>>>>> gives a bit more flexability when designing the lock ordering.
>>>>>>>>
>>>>>>>> Though userspace has to be able to tolerate the failure, or never make
>>>>>>>> the request.
>>>>>>>>
>>>>>>
>>>>>> Thanks Alex and Jason for providing your inputs.
>>>>>>
>>>>>> Using down_read_trylock() along with Alex suggestion seems fine.
>>>>>> In real use case, config space access should not happen when the
>>>>>> device is in low power state so returning error should not
>>>>>> cause any issue in this case.
>>>>>>
>>>>>>>>> down_write(&vdev->power_lock);
>>>>>>>>> ...
>>>>>>>>> switch (vfio_pm.low_power_state) {
>>>>>>>>> case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
>>>>>>>>> ...
>>>>>>>>> vfio_pci_zap_and_down_write_memory_lock(vdev);
>>>>>>>>> vdev->power_state_d3 = true;
>>>>>>>>> up_write(&vdev->memory_lock);
>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>> up_write(&vdev->power_lock);
>>>>>>>>
>>>>>>>> And something checks the power lock before allowing the memor to be
>>>>>>>> re-enabled?
>>>>>>>>
>>>>>>>>> 4. For ioctl access, as mentioned previously I need to add two
>>>>>>>>> callbacks functions (one for start and one for end) in the struct
>>>>>>>>> vfio_device_ops and call the same at start and end of ioctl from
>>>>>>>>> vfio_device_fops_unl_ioctl().
>>>>>>>>
>>>>>>>> Not sure I followed this..
>>>>>>>
>>>>>>> I'm kinda lost here too.
>>>>>>
>>>>>>
>>>>>> I have summarized the things below
>>>>>>
>>>>>> 1. In the current patch (v3 8/8), if config space access or ioctl was
>>>>>> being made by the user when the device is already in low power state,
>>>>>> then it was waking the device. This wake up was happening with
>>>>>> pm_runtime_resume_and_get() API in vfio_pci_config_rw() and
>>>>>> vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series).
>>>>>>
>>>>>> 2. Now, it has been decided to return error instead of waking the
>>>>>> device if the device is already in low power state.
>>>>>>
>>>>>> 3. Initially I thought to add following code in config space path
>>>>>> (and similar in ioctl)
>>>>>>
>>>>>> vfio_pci_config_rw() {
>>>>>> ...
>>>>>> down_read(&vdev->memory_lock);
>>>>>> if (vdev->platform_pm_engaged)
>>>>>> {
>>>>>> up_read(&vdev->memory_lock);
>>>>>> return -EIO;
>>>>>> }
>>>>>> ...
>>>>>> }
>>>>>>
>>>>>> And then there was a possibility that the physical config happens
>>>>>> when the device in D3cold in case of race condition.
>>>>>>
>>>>>> 4. So, I wanted to add some mechanism so that the low power entry
>>>>>> ioctl will be serialized with other ioctl or config space. With this
>>>>>> if low power entry gets scheduled first then config/other ioctls will
>>>>>> get failure, otherwise low power entry will wait.
>>>>>>
>>>>>> 5. For serializing this access, I need to ensure that lock is held
>>>>>> throughout the operation. For config space I can add the code in
>>>>>> vfio_pci_config_rw(). But for ioctls, I was not sure what is the best
>>>>>> way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION,
>>>>>> VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the
>>>>>> vfio core layer itself.
>>>>>>
>>>>>> The memory_lock and the variables to track low power in specific to
>>>>>> vfio-pci so I need some mechanism by which I add low power check for
>>>>>> each ioctl. For serialization, I need to call function implemented in
>>>>>> vfio-pci before vfio core layer makes the actual ioctl to grab the
>>>>>> locks. Similarly, I need to release the lock once vfio core layer
>>>>>> finished the actual ioctl. I have mentioned about this problem in the
>>>>>> above point (point 4 in my earlier mail).
>>>>>>
>>>>>>> A couple replies back there was some concern
>>>>>>> about race scenarios with multiple user threads accessing the device.
>>>>>>> The ones concerning non-deterministic behavior if a user is
>>>>>>> concurrently changing power state and performing other accesses are a
>>>>>>> non-issue, imo.
>>>>>>
>>>>>> What does non-deterministic behavior here mean.
>>>>>> Is it for user side that user will see different result
>>>>>> (failure or success) during race condition or in the kernel side
>>>>>> (as explained in point 3 above where physical config access
>>>>>> happens when the device in D3cold) ? My concern here is for later
>>>>>> part where this config space access in D3cold can cause fatal error
>>>>>> on the system side as we have seen for memory disablement.
>>>>>
>>>>> Yes, our only concern should be to prevent such an access. The user
>>>>> seeing non-deterministic behavior, such as during concurrent power
>>>>> control and config space access, all combinations of success/failure
>>>>> are possible, is par for the course when we decide to block accesses
>>>>> across the life of the low power state.
>>>>>
>>>>>>> I think our goal is only to expand the current
>>>>>>> memory_lock to block accesses, including config space, while the device
>>>>>>> is in low power, or some approximation bounded by the entry/exit ioctl.
>>>>>>>
>>>>>>> I think the remaining issues is how to do that relative to the fact
>>>>>>> that config space access can change the memory enable state and would
>>>>>>> therefore need to upgrade the memory_lock read-lock to a write-lock.
>>>>>>> For that I think we can simply drop the read-lock, acquire the
>>>>>>> write-lock, and re-test the low power state. If it has changed, that
>>>>>>> suggests the user has again raced changing power state with another
>>>>>>> access and we can simply drop the lock and return -EIO.
>>>>>>>
>>>>>>
>>>>>> Yes. This looks better option. So, just to confirm, I can take the
>>>>>> memory_lock read-lock at the starting of vfio_pci_config_rw() and
>>>>>> release it just before returning from vfio_pci_config_rw() and
>>>>>> for memory related config access, we will release this lock and
>>>>>> re-aquiring again write version of this. Once memory write happens,
>>>>>> then we can downgrade this write lock to read lock ?
>>>>>
>>>>> We only need to lock for the device access, so if you've finished that
>>>>> access after acquiring the write-lock, there'd be no point to then
>>>>> downgrade that to a read-lock. The access should be finished by that
>>>>> point.
>>>>>
>>>>
>>>> I was planning to take memory_lock read-lock at the beginning of
>>>> vfio_pci_config_rw() and release the same just before returning from
>>>> this function. If I don't downgrade it back to read-lock, then the
>>>> release in the end will be called for the lock which has not taken.
>>>> Also, user can specify count to any number of bytes and then the
>>>> vfio_config_do_rw() will be invoked multiple times and then in
>>>> the second call, it will be without lock.
>>>
>>> Ok, yes, I can imagine how it might result in a cleaner exit path to do
>>> a downgrade_write().
>>>
>>>>>> Also, what about IOCTLs. How can I take and release memory_lock for
>>>>>> ioctl. is it okay to go with Patch 7 where we call
>>>>>> pm_runtime_resume_and_get() before each ioctl or we need to do the
>>>>>> same low power check for ioctl also ?
>>>>>> In Later case, I am not sure how should I do the implementation so
>>>>>> that all other ioctl are covered from vfio core layer itself.
>>>>>
>>>>> Some ioctls clearly cannot occur while the device is in low power, such
>>>>> as resets and interrupt control, but even less obvious things like
>>>>> getting region info require device access. Migration also provides a
>>>>> channel to device access. Do we want to manage a list of ioctls that
>>>>> are allowed in low power, or do we only want to allow the ioctl to exit
>>>>> low power?
>>>>>
>>>>
>>>> In previous version of this patch, you mentioned that maintaining the
>>>> safe ioctl list will be tough to maintain. So, currently we wanted to
>>>> allow the ioctl for low power exit.
>>>
>>> Yes, I'm still conflicted in how that would work.
>>>
>>>>> I'm also still curious how we're going to handle devices that cannot
>>>>> return to low power such as the self-refresh mode on the GPU. We can
>>>>> potentially prevent any wake-ups from the vfio device interface, but
>>>>> that doesn't preclude a wake-up via an external lspci. I think we need
>>>>> to understand how we're going to handle such devices before we can
>>>>> really complete the design. AIUI, we cannot disable the self-refresh
>>>>> sleep mode without imposing unreasonable latency and memory
>>>>> requirements on the guest and we cannot retrigger the self-refresh
>>>>> low-power mode without non-trivial device specific code. Thanks,
>>>>>
>>>>> Alex
>>>>>
>>>>
>>>> I am working on adding support to notify guest through virtual PME
>>>> whenever there is any wake-up triggered by the host and the guest has
>>>> already put the device into runtime suspended state. This virtual PME
>>>> will be similar to physical PME. Normally, if PCI device need power
>>>> management transition, then it sends PME event which will be
>>>> ultimately handled by host OS. In virtual PME case, if host need power
>>>> management transition, then it sends event to guest and then guest OS
>>>> handles these virtual PME events. Following is summary:
>>>>
>>>> 1. Add the support for one more event like VFIO_PCI_ERR_IRQ_INDEX
>>>> named VFIO_PCI_PME_IRQ_INDEX and add the required code for this
>>>> virtual PME event.
>>>>
>>>> 2. From the guest side, when the PME_IRQ is enabled then we will
>>>> set event_fd for PME.
>>>>
>>>> 3. In the vfio driver, the PME support bits are already
>>>> virtualized and currently set to 0. We can set PME capability support
>>>> for D3cold so that in guest, it looks like
>>>>
>>>> Capabilities: [60] Power Management version 3
>>>> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
>>>> PME(D0-,D1-,D2-,D3hot-,D3cold+)
>>>>
>>>> 4. From the guest side, it can do PME enable (PME_En bit in Power
>>>> Management Control/Status Register) which will be again virtualized.
>>>>
>>>> 5. When host gets request for resuming the device other than from
>>>> low power ioctl, then device pm usage count will be incremented, the
>>>> PME status (PME_Status bit in Power Management Control/Status Register)
>>>> will be set and then we can do the event_fd signal.
>>>>
>>>> 6. In the PCIe, the PME events will be handled by root port. For
>>>> using low power D3cold feature, it is required to create virtual root
>>>> port in hypervisor side and when hypervisor receives this PME event,
>>>> then it can send virtual interrupt to root port.
>>>>
>>>> 7. If we take example of Linux kernel, then pcie_pme_irq() will
>>>> handle this and then do the runtime resume on the guest side. Also, it
>>>> will clear the PME status bit here. Then guest can put the device
>>>> again into suspended state.
>>>>
>>>> 8. I did prototype changes in QEMU for above logic and was getting wake-up
>>>> in the guest whenever I do lspci on the host side.
>>>>
>>>> 9. Since currently only nvidia GPU has this limitation to require
>>>> driver interaction each time before going into D3cold so we can allow
>>>> the reentry for other device. We can have nvidia vendor (along with
>>>> VGA/3D controller class code). In future, if any other device also has
>>>> similar requirement then we can update this list. For other device
>>>> host can put the device into D3cold in case of any wake-up.
>>>>
>>>> 10. In the vfio driver, we can put all these restriction for
>>>> enabling PME and return error if user tries to make low power entry
>>>> ioctl without enabling the PME related things.
>>>>
>>>> 11. The virtual PME can help in handling physical PME also for all
>>>> the devices. The PME logic is not dependent upon nvidia GPU
>>>> restriction. If virtual PME is enabled by hypervisor, then when
>>>> physical PME wakes the device, then it will resume on the guest side
>>>> also.
>>>
>>> So if host accesses through things like lspci are going to wake the
>>> device and we can't prevent that, and the solution to that is to notify
>>> the guest to put the device back to low power, then it seems a lot less
>>> important to try to prevent the user from waking the device through
>>> random accesses. In that context, maybe we do simply wrap all accesses
>>> with pm_runtime_get/put() put calls, which eliminates the problem of
>>> maintaining a list of safe ioctls in low power.
>>>
>>
>> So wrap all access with pm_runtime_get()/put() will only be applicable
>> for IOCTLs. Correct ?
>> For config space, we can go with the approach discussed earlier in which
>> we return error ?
>
> If we need to handle arbitrarily induced wakes from the host, it
> doesn't make much sense to restrict those same sort of accesses by the
> user through the vfio-device. It also seems a lot easier to simply do
> a pm_get/put() around not only ioctls, but all region accesses to avoid
> the sorts of races you previously identified. Access through mmap
> should still arguably fault given that there is no discrete end to such
> an access like we have for read/write operations. Thanks,
>
> Alex
>
Thanks Alex for confirming.
I will do the same.
Regards,
Abhishek
Powered by blists - more mailing lists