[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f54794e-0278-e665-7a3c-c0aa7655002e@huawei.com>
Date: Tue, 15 Nov 2022 09:38:22 +0800
From: "Longpeng (Mike, Cloud Infrastructure Service Product Dept.)"
<longpeng2@...wei.com>
To: Leon Romanovsky <leon@...nel.org>
CC: <bhelgaas@...gle.com>, <linux-pci@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <jianjay.zhou@...wei.com>,
<zhuangshengen@...wei.com>, <arei.gonglei@...wei.com>,
<yechuan@...wei.com>, <huangzhichao@...wei.com>,
<xiehong@...wei.com>
Subject: Re: [RFC 0/4] pci/sriov: support VFs dynamic addition
在 2022/11/14 22:20, Leon Romanovsky 写道:
> On Mon, Nov 14, 2022 at 10:06:49PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>
>>
>> 在 2022/11/14 21:09, Leon Romanovsky 写道:
>>> On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>>>
>>>>
>>>> 在 2022/11/14 15:04, Leon Romanovsky 写道:
>>>>> On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>>>>> Hi leon,
>>>>>>
>>>>>> 在 2022/11/12 0:39, Leon Romanovsky 写道:
>>>>>>> On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
>>>>>>>> From: Longpeng <longpeng2@...wei.com>
>>>>>>>>
>>>>>>>> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
>>>>>>>> this operation needs to spend lots of time if there has a large amount of VFs.
>>>>>>>> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
>>>>>>>> concurrently would cost about 200-250ms. However most of them are not need to be
>>>>>>>> used at the moment, so we can enable SRIOV first but add VFs on demand.
>>>>>>>
>>>>>>> It is unclear what took 200-250ms, is it physical VF creation or bind of
>>>>>>> the driver to these VFs?
>>>>>>>
>>>>>> It is neither. In our test, we already created physical VFs before, so we
>>>>>> skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only
>>>>>> probes PF, it just returns an error if the function is VF.
>>>>>
>>>>> It means that you didn't try sriov_drivers_autoprobe. Once it is set to
>>>>> true, It won't even try to probe VFs.
>>>>>
>>>>>>
>>>>>> The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a
>>>>>> long procedure. Each step costs only a little, but the total cost is not
>>>>>> acceptable in some time-sensitive cases.
>>>>>
>>>>> This is also cryptic to me. In standard SR-IOV deployment, all VFs are
>>>>> created and configured while operator booted the machine with sriov_drivers_autoprobe
>>>>> set to false. Once this machine is ready, VFs are assigned to relevant VMs/users
>>>>> through orchestration SW (IMHO, it is supported by all orchestration SW).
>>>>>
>>>>> And only last part (assigning to users) is time-sensitive operation.
>>>>>
>>>> The VF creation and configuration are also time-sensitive in some cases, for
>>>> example, the hypervisor live update case (such as [1]):
>>>> save VMs -> kexec -> restore VMs
>>>>
>>>> After the new kernel starts, the VFs must be added into the system, and then
>>>> assign the original VFs to the QEMU. This means we must enable all 2K+ VFs
>>>> at once and increase the downtime.
>>>>
>>>> If we can enable the VFs that are used by existing VMs then restore the VMs
>>>> and enable other unused VFs at last, the downtime would be significantly
>>>> reduced.
>>>>
>>>> [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf
>>>
>>> Like it is written in presentation, the standard way of doing it is done
>>> by VFIO live migration feature, where 2K+ VMs are migrated to another server
>>> at the time first server is scheduled for maintenance.
>>>
>> Live migration is not the best choice in production environment, it's too
>> heavy. Some cloud providers prefer to using hypervisor live update in their
>> system, such as AWS's nitro hypervisor.
>
> How is AWS nitro relevant to our discussion about adding sysfs file to Linux?
> Can you please point us to the source code of that hypervisor? Does it even
> run on Linux?
>
Um...You can google for more information about the AWS nitro system.
Yes, it's digressive, so let's back to the discussion about adding sysfs
file.
> Anyway, I'm aware of big cloud providers who are pretty happy with live
> migration in production.
>
We're having trouble coming to an agreement on this point, but it does't
matter. Please see below.
>>
>>> However, even in live update case mentioned in the presentation, you
>>> should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time,
>>> so you don't need per-VF id enable knob.
>>>
>> The presentation is just a reference, some points could be optimized
>> including disable PFs/VFs and enable PFs/VFs.
>>
>> Hypervisor live update can finish in less than 1 second, so the cost of
>> disabling PFs/VFs and enabling PFs/VFs (~200-250ms or even worst) is too
>> high.
>>
>>>>
>>>>>>
>>>>>> What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can
>>>>>> mostly support 10 concurrent calls if there has 10 PFs.
>>>>>
>>>>> I wondered, are you using real HW? or QEMU SR-IOV? What is your server
>>>>> that supports such large number of VFs?
>>>>>
>>>> Physical device. Some devices in the market support the large number of VFs,
>>>> especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV
>>>> software should keep pace with times too.
>>>
>>> Our devices (and Intel too) support many VFs too. The thing is that
>>> servers are unlikely to be able to support 10 physical devices with 2K+
>>> VFs. There are many limitations that will make such is not usable.
>>> Like, global MSI-X pool and PCI bandwidth to support all these devices.
>>>
>>>>
>>>>> BTW, Your change will probably break all SR-IOV devices in the market as
>>>>> they rely on PCI subsystem to have VFs ready and configured.
>>>>>
>>>> I see, but maybe this change could be a choice for some users.
>>>
>>> It should come with relevant driver changes and very strong justification why
>>> such functionality is needed now and can't be achieved by anything else
>>> except user-facing sysfs.
>>>
>> Adding 2K+ VFs to the sysfs need too much time.
>>
>> Look at the bottomhalf of the hypervisor live update:
>> kexec --> add 2K VFs --> restore VMs
>>
>> The downtime can be reduced if the sequence is:
>> kexec --> add 100 VFs(the VMs used) --> resotre VMs --> add 1.9K VFs
>
> Addition of VFs is serial operation, you can fire your VMs once you
> counted 100 VFs in sysfs directory.
>
According to the current implementation, the addition of VFs must be in
order, so this can not properly work.
For example, the VM uses VF200, VF202, VF204, but the sriov_add_vfs can
only add VFs in the order VF0, VF1, VF2 ... The limitation is introduced
by the software, not the PCI spec.
>>
>>
>>> I don't see anything in this presentation and discussion that supports
>>> need of such UAPI.
>>> > Thanks
>>>
>>>>
>>>>> Thanks
>>>>> .
>>> .
> .
Powered by blists - more mailing lists