linux-kernel - Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6198d35c-f810-cab1-8b43-2f817de2c1ea@oracle.com>
Date:   Tue, 15 Feb 2022 16:00:35 +0000
From:   Joao Martins <joao.m.martins@...cle.com>
To:     Jason Gunthorpe <jgg@...dia.com>
Cc:     Shameerali Kolothum Thodi <shameerali.kolothum.thodi@...wei.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
        "alex.williamson@...hat.com" <alex.williamson@...hat.com>,
        "mgurtovoy@...dia.com" <mgurtovoy@...dia.com>,
        Linuxarm <linuxarm@...wei.com>,
        liulongfang <liulongfang@...wei.com>,
        "Zengtao (B)" <prime.zeng@...ilicon.com>,
        yuzenghui <yuzenghui@...wei.com>,
        Jonathan Cameron <jonathan.cameron@...wei.com>,
        "Wangzhou (B)" <wangzhou1@...ilicon.com>
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On 2/14/22 14:06, Jason Gunthorpe wrote:
> On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:
> 
>> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
>> and then write the new valid entry. Not sure I understood correctly that this
>> is the 'break-before-make' thingie.
> 
> Doesn't that explode if the invalid entry is DMA'd to?
> 
Yes, IIUC. Also, the manual has this note:

"Note: For example, to split a block into constituent granules
(or to merge a span of granules into an equivalent block), VMSA
requires the region to be made invalid, a TLB invalidate
performed, then to make the region take the new configuration.

Note: The requirement for a break-before-make sequence can cause
problems for unrelated I/O streams that might use addresses
overlapping a region of interest, because the I/O streams cannot
always be conveniently stopped and might not tolerate translation
faults. It is advantageous to perform live update of a block into
smaller translations, or a set of translations into a larger block
size."

Probably why the original SMMUv3.2 dirty track series requires FEAT_BBM
as it had to do in-place atomic updates to split/collapse IO pgtables.
Not enterily clear if HTTU Dirty access requires the same.

>>>> I wonder if we could start progressing the dirty tracking as a first initial series and
>>>> then have the split + collapse handling as a second part? That would be quite
>>>> nice to get me going! :D
>>>
>>> I think so, and I think we should. It is such a big problem space, it
>>> needs to get broken up.
>>
>> OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the
>> past except with the x86 support only[*]. And we poke holes there I guess.
>>
>> [*] I might include Intel too, albeit emulated only.
> 
> Like I said, I'd prefer we not build more on the VFIO type 1 code
> until we have a conclusion for iommufd..
> 

I didn't quite understand what you mean by conclusion.

If by conclusion you mean the whole thing to be merged, how can the work be
broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
in terms of direction...

I can build on top of iommufd -- Just trying to understand how this is
going to work out.

> While returning the dirty data looks straight forward, it is hard to
> see an obvious path to enabling and controlling the system iommu the
> way vfio is now.

It seems strange to have a whole UAPI for userspace [*] meant to
return dirty data to userspace, when dirty right now means the whole
pinned page set and so copying the whole guest ... and the guest is
running so we might be racing with the device changing guest pages with the
VMM/CPU unaware of it. Even with no dwelling of IOMMU pagetables (i.e. split/collapse
IO base pages) it would still help greatly the current status quo of copying
the entire thing :(

Hence my thinking was that the patches /if small/ would let us see how dirty
tracking might work for iommu kAPI (and iommufd) too.

Would it be better to do more iterative steps (when possible) as opposed to
scratch and rebuild VFIO type1 IOMMU handling?

	Joao

[*] VFIO_IOMMU_DIRTY_PAGES{_FLAG_START,_FLAG_STOP,_FLAG_GET_BITMAP}