lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200205041817-mutt-send-email-mst@kernel.org>
Date:   Wed, 5 Feb 2020 04:22:28 -0500
From:   "Michael S. Tsirkin" <mst@...hat.com>
To:     Jason Wang <jasowang@...hat.com>
Cc:     Tiwei Bie <tiwei.bie@...el.com>, linux-kernel@...r.kernel.org,
        kvm@...r.kernel.org, virtualization@...ts.linux-foundation.org,
        netdev@...r.kernel.org, shahafs@...lanox.com, jgg@...lanox.com,
        rob.miller@...adcom.com, haotian.wang@...ive.com,
        eperezma@...hat.com, lulu@...hat.com, parav@...lanox.com,
        rdunlap@...radead.org, hch@...radead.org, jiri@...lanox.com,
        hanand@...inx.com, mhabets@...arflare.com,
        maxime.coquelin@...hat.com, lingshan.zhu@...el.com,
        dan.daly@...el.com, cunming.liang@...el.com, zhihong.wang@...el.com
Subject: Re: [PATCH] vhost: introduce vDPA based backend

On Wed, Feb 05, 2020 at 03:42:18PM +0800, Jason Wang wrote:
> 
> On 2020/2/5 下午3:16, Michael S. Tsirkin wrote:
> > On Wed, Feb 05, 2020 at 02:49:31PM +0800, Jason Wang wrote:
> > > On 2020/2/5 下午2:30, Michael S. Tsirkin wrote:
> > > > On Wed, Feb 05, 2020 at 01:50:28PM +0800, Jason Wang wrote:
> > > > > On 2020/2/5 下午1:31, Michael S. Tsirkin wrote:
> > > > > > On Wed, Feb 05, 2020 at 11:12:21AM +0800, Jason Wang wrote:
> > > > > > > On 2020/2/5 上午10:05, Tiwei Bie wrote:
> > > > > > > > On Tue, Feb 04, 2020 at 02:46:16PM +0800, Jason Wang wrote:
> > > > > > > > > On 2020/2/4 下午2:01, Michael S. Tsirkin wrote:
> > > > > > > > > > On Tue, Feb 04, 2020 at 11:30:11AM +0800, Jason Wang wrote:
> > > > > > > > > > > 5) generate diffs of memory table and using IOMMU API to setup the dma
> > > > > > > > > > > mapping in this method
> > > > > > > > > > Frankly I think that's a bunch of work. Why not a MAP/UNMAP interface?
> > > > > > > > > > 
> > > > > > > > > Sure, so that basically VHOST_IOTLB_UPDATE/INVALIDATE I think?
> > > > > > > > Do you mean we let userspace to only use VHOST_IOTLB_UPDATE/INVALIDATE
> > > > > > > > to do the DMA mapping in vhost-vdpa case? When vIOMMU isn't available,
> > > > > > > > userspace will set msg->iova to GPA, otherwise userspace will set
> > > > > > > > msg->iova to GIOVA, and vhost-vdpa module will get HPA from msg->uaddr?
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Tiwei
> > > > > > > I think so. Michael, do you think this makes sense?
> > > > > > > 
> > > > > > > Thanks
> > > > > > to make sure, could you post the suggested argument format for
> > > > > > these ioctls?
> > > > > > 
> > > > > It's the existed uapi:
> > > > > 
> > > > > /* no alignment requirement */
> > > > > struct vhost_iotlb_msg {
> > > > >       __u64 iova;
> > > > >       __u64 size;
> > > > >       __u64 uaddr;
> > > > > #define VHOST_ACCESS_RO      0x1
> > > > > #define VHOST_ACCESS_WO      0x2
> > > > > #define VHOST_ACCESS_RW      0x3
> > > > >       __u8 perm;
> > > > > #define VHOST_IOTLB_MISS           1
> > > > > #define VHOST_IOTLB_UPDATE         2
> > > > > #define VHOST_IOTLB_INVALIDATE     3
> > > > > #define VHOST_IOTLB_ACCESS_FAIL    4
> > > > >       __u8 type;
> > > > > };
> > > > > 
> > > > > #define VHOST_IOTLB_MSG 0x1
> > > > > #define VHOST_IOTLB_MSG_V2 0x2
> > > > > 
> > > > > struct vhost_msg {
> > > > >       int type;
> > > > >       union {
> > > > >           struct vhost_iotlb_msg iotlb;
> > > > >           __u8 padding[64];
> > > > >       };
> > > > > };
> > > > > 
> > > > > struct vhost_msg_v2 {
> > > > >       __u32 type;
> > > > >       __u32 reserved;
> > > > >       union {
> > > > >           struct vhost_iotlb_msg iotlb;
> > > > >           __u8 padding[64];
> > > > >       };
> > > > > };
> > > > Oh ok.  So with a real device, I suspect we do not want to wait for each
> > > > change to be processed by device completely, so we might want an asynchronous variant
> > > > and then some kind of flush that tells device "you better apply these now".
> > > 
> > > Let me explain:
> > > 
> > > There are two types of devices:
> > > 
> > > 1) device without on-chip IOMMU, DMA was done via IOMMU API which only
> > > support incremental map/unmap
> > Most IOMMUs have queues nowdays though. Whether APIs within kernel
> > expose that matters but we are better off on emulating
> > hardware not specific guest behaviour.
> 
> 
> Last time I checked Intel IOMMU driver, I see the async QI is not used
> there. And I'm not sure how queue will help much here. Qemu still need to
> wait for all the DMA is setup to let guest work.
> 
> > 
> > > 2) device with on-chip IOMMU, DMA could be done by device driver itself, and
> > > we could choose to pass the whole mappings to the driver at one time through
> > > vDPA bus operation (set_map)
> > > 
> > > For vhost-vpda, there're two types of memory mapping:
> > > 
> > > a) memory table, setup by userspace through VHOST_SET_MEM_TABLE, the whole
> > > mapping is updated in this way
> > > b) IOTLB API, incrementally done by userspace through vhost message
> > > (IOTLB_UPDATE/IOTLB_INVALIDATE)
> > > 
> > > The current design is:
> > > 
> > > - Reuse VHOST_SET_MEM_TABLE, and for type 1), we can choose to send diffs
> > > through IOMMU API or flush all the mappings then map new ones. For type 2),
> > > just send the whole mapping through set_map()
> > I know that at least for RDMA based things, you can't change
> > a mapping if it's active. So drivers will need to figure out the
> > differences which just looks ugly: userspace knows what
> > it was changing (really just adding/removing some guest memory).
> 
> 
> Two methods:
> 
> 1) using IOTLB message VHOST_IOTLB_UPDATE/INVALIDATE
> 2) let vhost differs from two memory tables which should not be too hard
> (compare two rb trees)


Right but 2 is just such an obvious waste of cyclces. userspace knows what changed
why does vhost need to re-calculate it? No?

> 
> > 
> > 
> > 
> > > - Reuse vhost IOTLB, so for type 1), simply forward update/invalidate
> > > request via IOMMU API, for type 2), send IOTLB to vDPA device driver via
> > > set_map(), device driver may choose to send diffs or rebuild all mapping at
> > > their will
> > > 
> > > Technically we can use vhost IOTLB API (map/umap) for building
> > > VHOST_SET_MEM_TABLE, but to avoid device to process the each request, it
> > > looks to me we need new UAPI which seems sub optimal.
> > > 
> > > What's you thought?
> > > 
> > > Thanks
> > I suspect we can't completely avoid a new UAPI.
> 
> 
> AFAIK, memory table usually contain just few entries, the performance cost
> should be fine. (At least should be the same as the case of VFIO).
> 
> So in qemu, simply hooking add_region/remove_region to
> VHOST_IOTLB_UPDATE/VHOST_IOTLB_INVALIDATE should work?
> 
> If we introduce API like you proposed previously (memory listener style):
> 
> begin
> add
> remove
> commit
> 
> I suspect it will be too heavweight for the case of vIOMMU and for the
> driver that want to build new mapping, we need addnop etc...
> 
> Thanks
> 

I feel this can help some workloads but this can wait, for sure.


> > 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ