linux-kernel - Re: [PATCH 0/4] vDPA: dev config export via "vdpa dev show" command

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACGkMEvQhfenW9-StMwaJ887eatb_qtRYdDn=E12Wwkc6HU-sQ@mail.gmail.com>
Date:   Tue, 18 Oct 2022 15:59:54 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     Si-Wei Liu <si-wei.liu@...cle.com>
Cc:     Sean Mooney <smooney@...hat.com>, mst@...hat.com,
        virtualization@...ts.linux-foundation.org,
        linux-kernel@...r.kernel.org, Daniel Berrange <berrange@...hat.com>
Subject: Re: [PATCH 0/4] vDPA: dev config export via "vdpa dev show" command

On Tue, Oct 18, 2022 at 7:35 AM Si-Wei Liu <si-wei.liu@...cle.com> wrote:
>
>
>
> On 10/17/2022 5:28 AM, Sean Mooney wrote:
> > On Mon, 2022-10-17 at 15:08 +0800, Jason Wang wrote:
> >> Adding Sean and Daniel for more thoughts.
> >>
> >> On Sat, Oct 15, 2022 at 9:33 AM Si-Wei Liu <si-wei.liu@...cle.com> wrote:
> >>> Live migration of vdpa would typically require re-instate vdpa
> >>> device with an idential set of configs on the destination node,
> >>> same way as how source node created the device in the first place.
> >>>
> >>> In order to allow live migration orchestration software to export the
> >>> initial set of vdpa attributes with which the device was created, it
> >>> will be useful if the vdpa tool can report the config on demand with
> >>> simple query.
> >> For live migration, I think the management layer should have this
> >> knowledge and they can communicate directly without bothering the vdpa
> >> tool on the source. If I was not wrong this is the way libvirt is
> >> doing now.
> > At least form a openstack(nova) perspective we are not expecting to do any vdpa device configuration
> > at the openstack level. To use a vdpa device in openstack the oeprator when installing openstack
> > need to create a udev/systemd script to precreatre the vdpa devices.
> This seems to correlate vdpa device creation with the static allocation
> of SR-IOV VF devices. Perhaps OpenStack doesn't have a plan to support
> dynamic vdpa creation, but conceptionally vdpa creation can be on demand
> for e.g. over Mellanox SubFunction or Intel Scalable IOV device.

Yes, it's not specific to vDPA but something that openstack needs to consider.

>
> >
> > nova will query libvirt for the list avaiable vdpa devices at start up and record them in our database.
> > when schudling we select a host that has a free vdpa device and on that host we generate a xml snipit
> > that refernce the vdpa device and proivde that to libvirt and it will in turn program the mac.
> >
> > """
> > <interface type="vdpa">
> >      <mac address="b5:bc:2e:e7:51:ee"/>
> >      <source dev="/dev/vhost-vdpa-3"/>
> > </interface>
> > """
> >
> > when live migrating the workflow is similar. we ask our schduler for a host that should have enough avaiable
> > resouces, then we make an rpc call "pre_live_migrate" which makes a number of assterions such as cpu compatiablity

A migration compatibility check for vDPA should be done as well here.

> > but also computes cpu pinning and device passthough asignemnts. i.e. in pre_live_migate we select wich cpu cores, pcie
> > devices and in this case vdpa devices to use on the destination host
> In the case of vdpa, does it (the pre_live_migrate rpc) now just selects
> the parent mgmtdev for creating vdpa in later phase, or it ends up with
> a vdpa device being created? Be noted by now there's only a few
> properties for vdpa creation e.g. mtu and mac, that it doesn't need
> special reservation of resources for creating a vdpa device. But that
> may well change in the future.
>
> > and return that in our rpc result.
> >
> > we then use that information to udpate the libvirt domain xml with the new host specific information and start
> > the migration at the libvirt level.
> >
> > today in openstack we use a hack i came up with to workaroudn that fact that you cant migrate with sriov/pci passthough
> > devices to support live migration with vdpa. basically before we call libvirt to live migrate we hot unplug the vdpa nic
> > form the guest and add them back after the migration is complte. if you dont bound the vdpa nics wiht a transparently migratable
> > nic in the guest that obvioulsy result in a loss of network connectivity while the migration is happenign which is not ideal
> > so a normal virtio-net interface on ovs is what we recommend as the fallback interface for the bound.
> Do you need to preserve the mac address when falling back to the normal
> virtio-net interface, and similarly any other network config/state?
> Basically vDPA doesn't support live migration for the moment.

Basic shadow vq based live migration can work now. Eugenio is working
to make it fully ready in the near future.

>This
> doesn't like to be a technically correct solution for it to work.

I agree.

> >
> > obviouly when vdpa supprot transparent live migration we can just skip this workaround which woudl be a very nice ux improvement.
> > one of the sideeffct of the hack however is you can start with an intel nic and end up with a melonox nic becasue we dont need
> > to preserve the device capablies sicne we are hotplugging.
> Exactly. This is the issue.
> >
> > with vdpa we will at least have a virtaul virtio-net-pci frontend in qemu to provide some level of abstraction.
> > i guess the point you are raising is that for live migration we cant start with 4 queue paris and vq_size=256
> > and select a device with 2 queue pairs and vq_size of 512 and expect that to just work.
> Not exactly, the vq_size comes from QEMU that has nothing to do with
> vDPA tool. And live migrating from 4 queue pairs to 2 queue pairs won't
> work for the guest driver. Change of queue pair numbers would need
> device reset which  won't happen transparently during live migration.
> Basically libvirt has to match the exact queue pair number and queue
> length on destination node.
>
> >
> > There are two ways to adress that. 1 we can start recording this infor in our db and schdule only ot hosts with the same
> > configuration values, or 2 we can record the capablities i.e. the max vaulues that are support by a devcice and schdule to a host
> > where its >= the current value and rely on libvirt to reconfigure the device.
> >
> > libvirt required very little input today to consume a vdpa interface
> > https://libvirt.org/formatdomain.html#vdpa-devices

So a question here, if we need to create vDPA on demand (e.g with the
features and configs from the source) who will do the provision? Is it
libvirt?

Thanks

> > there are some generic virtio device optiosn we could set https://libvirt.org/formatdomain.html#virtio-related-options
> > and some generic options like the mtu that the interface element supportr
> >
> > but the miniumal valide xml snipit is litrally just the source dev path.
> >
> > <devices>
> >    <interface type='vdpa'>
> >      <source dev='/dev/vhost-vdpa-0'/>
> >    </interface>
> > </devices>
> >
> > nova only add the mac address and MTU today although i have some untested code that will try to also set the vq size.
> > https://github.com/openstack/nova/blob/11cb31258fa5b429ea9881c92b2d745fd127cdaf/nova/virt/libvirt/designer.py#L154-L167
> >
> > The basic supprot we have today assumes however that the vq_size is either the same on all host or it does not matter because we do
> > not support transparent live migration today so its ok for it to change form host to host.
> > in any case we do not track the vq_size or vq count today so we cant schdule based on it or comunicate it to libvirt via our
> > pre_live_migration rpc result. that means libvirt shoudl check if the dest device has the same cofnig or update it if posible
> > before starting the destination qemu instance and begining the migration.
> >
> >>> This will ease the orchestration software implementation
> >>> so that it doesn't have to keep track of vdpa config change, or have
> >>> to persist vdpa attributes across failure and recovery, in fear of
> >>> being killed due to accidental software error.
> > the vdpa device config is not somethign we do today so this woudl make our lives more complex
> It's regarding use case whether to support or not. These configs well
> exist before my change.
>
> > depending on
> > what that info is. at least in the case of nova we do not use the vdpa cli at all, we use libvirt as an indirection layer.
> > so libvirt would need to support this interface, we would have to then add it to our db and modify our RPC interface
> > to then update the libvirt xml with addtional info we dont need today.
>
> Yes. You can follow libvirt when the corresponding support is done, but
> I think it's orthogonal with my changes. Basically my change won't
> affect libvirt's implementation at all.
>
> Thanks,
> -Siwei
>
>
> >>> In this series, the initial device config for vdpa creation will be
> >>> exported via the "vdpa dev show" command.
> >>> This is unlike the "vdpa
> >>> dev config show" command that usually goes with the live value in
> >>> the device config space, which is not reliable subject to the dynamics
> >>> of feature negotiation and possible change in device config space.
> >>>
> >>> Examples:
> >>>
> >>> 1) Create vDPA by default without any config attribute
> >>>
> >>> $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0
> >>> $ vdpa dev show vdpa0
> >>> vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> >>> $ vdpa dev -jp show vdpa0
> >>> {
> >>>      "dev": {
> >>>          "vdpa0": {
> >>>              "type": "network",
> >>>              "mgmtdev": "pci/0000:41:04.2",
> >>>              "vendor_id": 5555,
> >>>              "max_vqs": 9,
> >>>              "max_vq_size": 256,
> >>>          }
> >>>      }
> >>> }
> > This is how openstack works today. this step is done statically at boot time typiccly via a udev script or systemd servic file.
> > the mac adress is udpate don the vdpa interface by libvirt when its asigined to the qemu process.
> > if we wanted to suport multi queue or vq size configuration it would also happen at that time not during device creation.
> >>> 2) Create vDPA with config attribute(s) specified
> >>>
> >>> $ vdpa dev add mgmtdev pci/0000:41:04.2 name vdpa0 \
> >>>      mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> >>> $ vdpa dev show
> >>> vdpa0: type network mgmtdev pci/0000:41:04.2 vendor_id 5555 max_vqs 9 max_vq_size 256
> >>>    mac e4:11:c6:d3:45:f0 max_vq_pairs 4
> >>> $ vdpa dev -jp show
> >>> {
> >>>      "dev": {
> >>>          "vdpa0": {
> >>>              "type": "network",
> >>>              "mgmtdev": "pci/0000:41:04.2",
> >> So "mgmtdev" looks not necessary for live migration.
> >>
> >> Thanks
> >>
> >>>              "vendor_id": 5555,
> >>>              "max_vqs": 9,
> >>>              "max_vq_size": 256,
> >>>              "mac": "e4:11:c6:d3:45:f0",
> >>>              "max_vq_pairs": 4
> >>>          }
> >>>      }
> >>> }
> > dynmaicaly creating vdpa device at runtime while possible is not an approch we are plannign to supprot.
> >
> > currntly in nova we perefer to do allcoation of staticically provsioned resouces in nova.
> > for persitent memory, sriov/pci passthorgh, dedciated cpus, hugepages and vdpa devices we manage inventories
> > of resouce that the operator has configured on the platform.
> >
> > we have one excption to this static aproch which is semi dynmaic that is how we manage vifo mediated devices.
> > for reasons that are not important we currrnly track the partent devices that are capable of providing MDEVs
> > and we directlly write to /sys/... to create teh mdev instance  of a requested mdev on demand.
> >
> > This has proven ot be quite problematic as we have encountered caching bugs due to the delay between device
> > creation and when the /sys interface expost the direcotry stucture for the mdev. This has lead ot libvirt and as a result
> > nova getting out of sync with the actual state of the host. There are also issue with host reboots.
> >
> > while we do see the advantage of beign able to create vdpa interface on demad espicaly if we can do finer grained resouce
> > partioning by allcoating one mdev with 4 vqs adn another with 8 ectra, or experice with dynmic mdev management gives us
> > pause. we can and will fix our bugs with mdevs but we have found that most of our customers that use feature  like this
> > are telcos or other similar industries that typiclly have very static wrokloads. while there is  some interest in making
> > there clouds more dynmaic they typically file a host and run the same worklaod on that host form months to years at a
> > time and plan there hardware and acordingly so they are well seved by the static usecase "1) Create vDPA by default without any config attribute".
> >
> >>> ---
> >>>
> >>> Si-Wei Liu (4):
> >>>    vdpa: save vdpa_dev_set_config in struct vdpa_device
> >>>    vdpa: pass initial config to _vdpa_register_device()
> >>>    vdpa: show dev config as-is in "vdpa dev show" output
> >>>    vdpa: fix improper error message when adding vdpa dev
> >>>
> >>>   drivers/vdpa/ifcvf/ifcvf_main.c      |  2 +-
> >>>   drivers/vdpa/mlx5/net/mlx5_vnet.c    |  2 +-
> >>>   drivers/vdpa/vdpa.c                  | 63 +++++++++++++++++++++++++++++++++---
> >>>   drivers/vdpa/vdpa_sim/vdpa_sim_blk.c |  2 +-
> >>>   drivers/vdpa/vdpa_sim/vdpa_sim_net.c |  2 +-
> >>>   drivers/vdpa/vdpa_user/vduse_dev.c   |  2 +-
> >>>   drivers/vdpa/virtio_pci/vp_vdpa.c    |  3 +-
> >>>   include/linux/vdpa.h                 | 26 ++++++++-------
> >>>   8 files changed, 80 insertions(+), 22 deletions(-)
> >>>
> >>> --
> >>> 1.8.3.1
> >>>
>