linux-kernel - Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4f11e0bb-e090-bf9b-4f98-578273865200@nvidia.com>
Date:   Wed, 7 Dec 2022 12:59:00 +0200
From:   Max Gurtovoy <mgurtovoy@...dia.com>
To:     Christoph Hellwig <hch@....de>, Jason Gunthorpe <jgg@...pe.ca>
Cc:     Lei Rao <lei.rao@...el.com>, kbusch@...nel.org, axboe@...com,
        kch@...dia.com, sagi@...mberg.me, alex.williamson@...hat.com,
        cohuck@...hat.com, yishaih@...dia.com,
        shameerali.kolothum.thodi@...wei.com, kevin.tian@...el.com,
        mjrosato@...ux.ibm.com, linux-kernel@...r.kernel.org,
        linux-nvme@...ts.infradead.org, kvm@...r.kernel.org,
        eddie.dong@...el.com, yadong.li@...el.com, yi.l.liu@...el.com,
        Konrad.wilk@...cle.com, stephen@...eticom.com, hang.yuan@...el.com
Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to
 issue admin commands for VF driver.

On 12/7/2022 9:54 AM, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
>> What the kernel is doing is providing the abstraction to link the
>> controlling function to the VFIO device in a general way.
>>
>> We don't want to just punt this problem to user space and say 'good
>> luck finding the right cdev for migration control'. If the kernel
>> struggles to link them then userspace will not fare better on its own.
> Yes.  But the right interface for that is to issue the userspace
> commands for anything that is not normal PCIe function level
> to the controlling funtion, and to discover the controlled functions
> based on the controlling functions.
>
> In other words:  there should be absolutely no need to have any
> special kernel support for the controlled function.  Instead the
> controlling function enumerates all the function it controls exports
> that to userspace and exposes the functionality to save state from
> and restore state to the controlled functions.

Why is it preferred that the migration SW will talk directly to the PF 
and not via VFIO interface ?

It's just an implementation detail.

I feel like it's even sounds more reasonable to have a common API like 
we have today to save_state/resume_state/quiesce_device/freeze_device 
and each device implementation will translate this functionality to its 
own SPEC.

If I understand your direction is to have QEMU code to talk to 
nvmecli/new_mlx5cli/my_device_cli to do that and I'm not sure it's needed.

The controlled device is not aware of any of the migration process. Only 
the migration SW, system admin and controlling device.

I see 2 orthogonal discussions here: NVMe standardization for LM and 
Linux implementation for LM.

For the NVMe standardization: I think we all agree, in high level, that 
primary controller manages the LM of the secondary controllers. Primary 
controller can list the secondary controllers. Primary controller expose 
APIs using its admin_queue to manage LM process of its secondary 
controllers. LM Capabilities will be exposed using identify_ctrl admin 
cmd of the primary controller.

For the Linux implementation: the direction we started last year is to 
have vendor specific (mlx5/hisi/..) or protocol specific 
(nvme/virtio/..) vfio drivers. We built an infrastructure to do that by 
dividing the vfio_pci driver to vfio_pci and vfio_pci_core and updated 
uAPIs as well to support the P2P case for live migration. Dirty page 
tracking is also progressing. More work is still to be done to improve 
this infrastructure for sure.
I hope that all the above efforts are going to be used also with NVMe LM 
implementation unless there is something NVMe specific in the way of 
migrating PCI functions that I can't see now.
If there is something that is NVMe specific for LM then the migration SW 
and QEMU will need to be aware of that, and in this awareness we lose 
the benefit of generic VFIO interface.

>
>> Especially, we do not want every VFIO device to have its own crazy way
>> for userspace to link the controlling/controlled functions
>> together. This is something the kernel has to abstract away.
> Yes.  But the direction must go controlling to controlled, not the
> other way around.

So in the source:

1. We enable SRIOV on the NVMe driver

2. We list all the secondary controllers: nvme1, nvme2, nvme3

3. We allow migrating nvme1, nvme2, nvme3 - now these VFs are migratable 
(controlling to controlled).

4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver

5. We pass these functions to VM

6. We start migration process.

And in the destination:

1. We enable SRIOV on the NVMe driver

2. We list all the secondary controllers: nvme1, nvme2, nvme3

3. We allow migration resume to nvme1, nvme2, nvme3 - now these VFs are 
resumable (controlling to controlled).

4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver

5. We pass these functions to VM

6. We start migration resume process.

>   

>