linux-kernel - Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220211174933.GQ4160@nvidia.com>
Date:   Fri, 11 Feb 2022 13:49:33 -0400
From:   Jason Gunthorpe <jgg@...dia.com>
To:     Joao Martins <joao.m.martins@...cle.com>
Cc:     Shameerali Kolothum Thodi <shameerali.kolothum.thodi@...wei.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
        "alex.williamson@...hat.com" <alex.williamson@...hat.com>,
        "mgurtovoy@...dia.com" <mgurtovoy@...dia.com>,
        Linuxarm <linuxarm@...wei.com>,
        liulongfang <liulongfang@...wei.com>,
        "Zengtao (B)" <prime.zeng@...ilicon.com>,
        yuzenghui <yuzenghui@...wei.com>,
        Jonathan Cameron <jonathan.cameron@...wei.com>,
        "Wangzhou (B)" <wangzhou1@...ilicon.com>
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

On Fri, Feb 11, 2022 at 05:28:22PM +0000, Joao Martins wrote:

> But well, at the end of the day for an IOMMU driver the domain ops are
> the important stuff, maybe IO pgtable framework isn't as critical
> (Intel for example, doesn't use that at all).

Right, it doesn't matter what library was used to implement the
domain..

> > User space page tables will not be abstracted and the userspace must
> > know the direct HW format of the IOMMU they is being used.
> > 
> That's countering earlier sentence? Because hw format (for me at least)
> means PTE and protection domain config format too. And if iommufd
> abstracts the HW format, modelling after the IOMMU domain and its ops,
> then it's abstracting userspace from those details e.g. it works over
> IOVAs, but not over its vendor representation of how that IOVA is set up.
> 
> I am probably being dense.

It is both ways, one kind of domain provides a kernel supplied
map/unmap that implements the IO PTE manipulation in kernel memory

But if you want the IOPTE to be in user memory then user must
read/write it and it cannot use that - so a user domain will not have
map/unmap.

> > It is basically the same, almost certainly the user API in iommufd
> > will be some 'get dirty bits' and 'unmap and give me the dirty bits'
> > just like vfio has.
> > 
> 
> The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
> migration flow. 

It is essential to implement any kind of viommu behavior where
map/unmap is occuring while the dirty tracking is running. It should
never make a difference except in some ugly edge cases where the dma
and unmap are racing.

> supposed to be happening (excluding P2P)? So perhaps the unmap is
> unneeded after quiescing the VF.

Yes, you don't need to unmap for migration, a simple fully synchronous
read and clear is sufficient. But that final read, while DMA is quite,
must obtain the latest dirty bit data.
 
> We have a bitmap based interface in KVM, but there's also a recent ring
> interface for dirty tracking, which has some probably more determinism than
> a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
> and breaking its entries on-demand IIRC, whereas Intel resembles something
> closer to a 512 entries 'ring' with VMX PML, which tells what has been
> dirtied.

KVM has an advantage that it can throttle the rate of dirty generation
so it can rely on logging. The IOMMU can't throttle DMA, so we are
stuck with a bitmap

> > I don't know if mmap should be involed here, the dirty bitmaps are not
> > so big, I suspect a simple get_user_pages_fast() would be entirely OK.
> > 
> Considering that is 32MB of a bitmap per TB maybe it is cheap.

Rigt. GUP fasting a couple huge pages is nothing compared to scanning
1TB of IO page table.

> >> Give me some time (few days only, as I gotta sort some things) and I'll
> >> respond here as follow up with link to a branch with the WIP/PoC patches.
> > 
> > Great!
> >
> Here it is. "A few days" turn into a week sorry :/
> 
> https://github.com/jpemartins/qemu  amd-iommu-hdsup-wip
> https://github.com/jpemartins/linux  amd-vfio-iommu-hdsup-wip
> 
> Note, it is an early PoC. I still need to get the split/collapse thing going,
> and fix the FIXMEs there, and have a second good look at the iommu page tables.

Oh I'll try to look a this thanks

> > You have to mark it as non-present to do the final read out if
> > something unmaps while the tracker is on - eg emulating a viommu or
> > something. Then you mark non-present, flush the iotlb and read back
> > the dirty bit.
> > 
> You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
> too :) without needing VMM intervention (that's also not supported
> in Linux).

I'm sure, but dirty tracking has to happen on the kernel owned page
table, not the user owned one I think..

> > Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
> > then read and clear them.
> > 
> It's the other way around AIUI. The dirty bits are sticky, so you flush
> the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
> again on the next memory transaction (or ATS translation).

I guess it depends on how the HW works, if it writes the dirty bit
back to ram instantly on the first dirty, or if it only writes the
dirty bit when flushing the iotlb.

In any case you have to synchronize with the HW in some way to ensure
that all dirty bits are 'pulled back' into system memory to resolve
races (ie you need the iommu HW to release and the CPU to acquire on
the dirty data) before trying to read them, at least for the final
quieced system read.

> I am not entirely sure we need to unmap + mark non-present for non-viommu
> That would actually mean something is not properly quiscieing the VF DMA.
> Maybe we should .. to gate whether if we should actually continue with LM
> if something kept doing DMA when it shouldn't have.

It is certainly an edge case. A device would be misbehaving to
continue DMA.

> > This seems like it would be some interesting amount of driver work,
> > but yes it could be a generic new iommu_domina op.
> 
> I am slightly at odds that .split and .collapse at .switch() are enough.
> But, with iommu if we are working on top of an IOMMU domain object and
> .split and .collapse are iommu_ops perhaps that looks to be enough
> flexibility to give userspace the ability to decide what it wants to
> split, if it starts eargerly/warming-up tracking dirty pages.
> 
> The split and collapsing is something I wanted to work on next, to get
> to a stage closer to that of an RFC on the AMD side.

split/collapse seems kind of orthogonal to me it doesn't really
connect to dirty tracking other than being mostly useful during dirty
tracking.

And I wonder how hard split is when trying to atomically preserve any
dirty bit..

> Hmmm, judging how the IOMMU works I am not sure this is particularly
> affecting DMA performance (not sure yet about RDMA, it's something I
> curious to see how it gets to perform with 4K IOPTEs, and with dirty
> tracking always enabled). Considering how the bits are sticky, and
> unless CPU clears it, it's short of a nop? Unless of course the checking
> for A^D during an atomic memory transaction is expensive. Needs some
> performance testing nonetheless.

If you leave the bits all dirty then why do it? The point is to see
the dirties, which means the iommu is generating a workload of dirty
cachelines while operating it didn't do before.

> I forgot to mention, but the early enablement of IOMMU dirty tracking
> was also meant to fully know since guest creation what needs to be
> sent to the destination. Otherwise, wouldn't we need to send the whole
> pinned set to destination, if we only start tracking dirty pages during
> migration?

? At the start of migration you have to send everything. Dirty
tracking is intended to allow the VM to be stopped and then have only
a small set of data to xfer.
 
> Also, this is probably a differentiator for iommufd, if we were to provide
> split and collapse semantics to IOMMU domain objects that userspace can use.
> That would get more freedom, to switch dirty-tracking, and then do the warm
> up thingie and piggy back on what it wants to split before migration.
> perhaps the switch() should get some flag to pick where to split, I guess.

Yes, right. Split/collapse should be completely seperate from dirty
tracking.

Jason