linux-kernel - Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Mon, 12 Dec 2022 09:20:09 +0800
From:   "Rao, Lei" <lei.rao@...el.com>
To:     Max Gurtovoy <mgurtovoy@...dia.com>,
        Christoph Hellwig <hch@....de>, Jason Gunthorpe <jgg@...pe.ca>
CC:     <kbusch@...nel.org>, <axboe@...com>, <kch@...dia.com>,
        <sagi@...mberg.me>, <alex.williamson@...hat.com>,
        <cohuck@...hat.com>, <yishaih@...dia.com>,
        <shameerali.kolothum.thodi@...wei.com>, <kevin.tian@...el.com>,
        <mjrosato@...ux.ibm.com>, <linux-kernel@...r.kernel.org>,
        <linux-nvme@...ts.infradead.org>, <kvm@...r.kernel.org>,
        <eddie.dong@...el.com>, <yadong.li@...el.com>,
        <yi.l.liu@...el.com>, <Konrad.wilk@...cle.com>,
        <stephen@...eticom.com>, <hang.yuan@...el.com>,
        Oren Duer <oren@...dia.com>
Subject: Re: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device



On 12/11/2022 10:51 PM, Max Gurtovoy wrote:
> 
> On 12/11/2022 3:21 PM, Rao, Lei wrote:
>>
>>
>> On 12/11/2022 8:05 PM, Max Gurtovoy wrote:
>>>
>>> On 12/6/2022 5:01 PM, Christoph Hellwig wrote:
>>>> On Tue, Dec 06, 2022 at 10:48:22AM -0400, Jason Gunthorpe wrote:
>>>>> Sadly in Linux we don't have a SRIOV VF lifecycle model that is any
>>>>> use.
>>>> Beward:  The secondary function might as well be a physical function
>>>> as well.  In fact one of the major customers for "smart" multifunction
>>>> nvme devices prefers multi-PF devices over SR-IOV VFs. (and all the
>>>> symmetric dual ported devices are multi-PF as well).
>>>>
>>>> So this isn't really about a VF live cycle, but how to manage life
>>>> migration, especially on the receive / restore side.  And restoring
>>>> the entire controller state is extremely invasive and can't be done
>>>> on a controller that is in any classic form live.  In fact a lot
>>>> of the state is subsystem-wide, so without some kind of virtualization
>>>> of the subsystem it is impossible to actually restore the state.
>>>
>>> ohh, great !
>>>
>>> I read this subsystem virtualization proposal of yours after I sent my proposal for subsystem virtualization in patch 1/5 thread.
>>> I guess this means that this is the right way to go.
>>> Lets continue brainstorming this idea. I think this can be the way to migrate NVMe controllers in a standard way.
>>>
>>>>
>>>> To cycle back to the hardware that is posted here, I'm really confused
>>>> how it actually has any chance to work and no one has even tried
>>>> to explain how it is supposed to work.
>>>
>>> I guess in vendor specific implementation you can assume some things that we are discussing now for making it as a standard.
>>
>> Yes, as I wrote in the cover letter, this is a reference implementation to
>> start a discussion and help drive standardization efforts, but this series
>> works well for Intel IPU NVMe. As Jason said, there are two use cases:
>> shared medium and local medium. I think the live migration of the local medium
>> is complicated due to the large amount of user data that needs to be migrated.
>> I don't have a good idea to deal with this situation. But for Intel IPU NVMe,
>> each VF can connect to remote storage via the NVMF protocol to achieve storage
>> offloading. This is the shared medium. In this case, we don't need to migrate
>> the user data, which will significantly simplify the work of live migration.
> 
> I don't think that medium migration should be part of the SPEC. We can specify it's out of scope.
> 
> All the idea of live migration is to have a short downtime and I don't think we can guarantee short downtime if we need to copy few terabytes throw the networking.
> If the media copy is taking few seconds, there is no need to do live migration of few milisecs downtime. Just do regular migration of a
> 
>>
>> The series tries to solve the problem of live migration of shared medium.
>> But it still lacks dirty page tracking and P2P support, we are also developing
>> these features.
>>
>> About the nvme device state, As described in my document, the VF states include
>> VF CSR registers, Every IO Queue Pair state, and the AdminQ state. During the
>> implementation, I found that the device state data is small per VF. So, I decided
>> to use the admin queue of the Primary controller to send the live migration
>> commands to save and restore the VF states like MLX5.
> 
> I think and hope we all agree that the AdminQ of the controlling NVMe function will be used to migrate the controlled NVMe function.

Fully agree.

> 
> which document are you refereeing to ?

The fifth patch includes the definition of these commands and how the firmware handles
these live migration commands. It's the documentation that I referenced.

>>
>>>
>>>