linux-kernel - Re: [RFC 0/4] pci/sriov: support VFs dynamic addition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3a8efc92-eda8-9c61-50c5-5ec97e2e2342@huawei.com>
Date:   Mon, 14 Nov 2022 22:06:49 +0800
From:   "Longpeng (Mike, Cloud Infrastructure Service Product Dept.)" 
        <longpeng2@...wei.com>
To:     Leon Romanovsky <leon@...nel.org>
CC:     <bhelgaas@...gle.com>, <linux-pci@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>, <jianjay.zhou@...wei.com>,
        <zhuangshengen@...wei.com>, <arei.gonglei@...wei.com>,
        <yechuan@...wei.com>, <huangzhichao@...wei.com>,
        <xiehong@...wei.com>
Subject: Re: [RFC 0/4] pci/sriov: support VFs dynamic addition



在 2022/11/14 21:09, Leon Romanovsky 写道:
> On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>
>>
>> 在 2022/11/14 15:04, Leon Romanovsky 写道:
>>> On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>>> Hi leon,
>>>>
>>>> 在 2022/11/12 0:39, Leon Romanovsky 写道:
>>>>> On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
>>>>>> From: Longpeng <longpeng2@...wei.com>
>>>>>>
>>>>>> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
>>>>>> this operation needs to spend lots of time if there has a large amount of VFs.
>>>>>> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
>>>>>> concurrently would cost about 200-250ms. However most of them are not need to be
>>>>>> used at the moment, so we can enable SRIOV first but add VFs on demand.
>>>>>
>>>>> It is unclear what took 200-250ms, is it physical VF creation or bind of
>>>>> the driver to these VFs?
>>>>>
>>>> It is neither. In our test, we already created physical VFs before, so we
>>>> skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only
>>>> probes PF, it just returns an error if the function is VF.
>>>
>>> It means that you didn't try sriov_drivers_autoprobe. Once it is set to
>>> true, It won't even try to probe VFs.
>>>
>>>>
>>>> The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a
>>>> long procedure. Each step costs only a little, but the total cost is not
>>>> acceptable in some time-sensitive cases.
>>>
>>> This is also cryptic to me. In standard SR-IOV deployment, all VFs are
>>> created and configured while operator booted the machine with sriov_drivers_autoprobe
>>> set to false. Once this machine is ready, VFs are assigned to relevant VMs/users
>>> through orchestration SW (IMHO, it is supported by all orchestration SW).
>>>
>>> And only last part (assigning to users) is time-sensitive operation.
>>>
>> The VF creation and configuration are also time-sensitive in some cases, for
>> example, the hypervisor live update case (such as [1]):
>>   save VMs -> kexec -> restore VMs
>>
>> After the new kernel starts, the VFs must be added into the system, and then
>> assign the original VFs to the QEMU. This means we must enable all 2K+ VFs
>> at once and increase the downtime.
>>
>> If we can enable the VFs that are used by existing VMs then restore the VMs
>> and enable other unused VFs at last, the downtime would be significantly
>> reduced.
>>
>> [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf
> 
> Like it is written in presentation, the standard way of doing it is done
> by VFIO live migration feature, where 2K+ VMs are migrated to another server
> at the time first server is scheduled for maintenance.
> 
Live migration is not the best choice in production environment, it's 
too heavy. Some cloud providers prefer to using hypervisor live update 
in their system, such as AWS's nitro hypervisor.

> However, even in live update case mentioned in the presentation, you
> should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time,
> so you don't need per-VF id enable knob.
> 
The presentation is just a reference, some points could be optimized 
including disable PFs/VFs and enable PFs/VFs.

Hypervisor live update can finish in less than 1 second, so the cost of 
disabling PFs/VFs and enabling PFs/VFs (~200-250ms or even worst) is too 
high.

>>
>>>>
>>>> What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can
>>>> mostly support 10 concurrent calls if there has 10 PFs.
>>>
>>> I wondered, are you using real HW? or QEMU SR-IOV? What is your server
>>> that supports such large number of VFs?
>>>
>> Physical device. Some devices in the market support the large number of VFs,
>> especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV
>> software should keep pace with times too.
> 
> Our devices (and Intel too) support many VFs too. The thing is that
> servers are unlikely to be able to support 10 physical devices with 2K+
> VFs. There are many limitations that will make such is not usable.
> Like, global MSI-X pool and PCI bandwidth to support all these devices.
> 
>>
>>> BTW, Your change will probably break all SR-IOV devices in the market as
>>> they rely on PCI subsystem to have VFs ready and configured.
>>>
>> I see, but maybe this change could be a choice for some users.
> 
> It should come with relevant driver changes and very strong justification why
> such functionality is needed now and can't be achieved by anything else
> except user-facing sysfs.
> 
Adding 2K+ VFs to the sysfs need too much time.

Look at the bottomhalf of the hypervisor live update:
kexec --> add 2K VFs --> restore VMs

The downtime can be reduced if the sequence is:
kexec --> add 100 VFs（the VMs used） --> resotre VMs --> add 1.9K VFs


> I don't see anything in this presentation and discussion that supports
> need of such UAPI.
>  > Thanks
> 
>>
>>> Thanks
>>> .
> .