[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <26ea3a3d-b06b-6256-7243-8ca9eae61bce@redhat.com>
Date: Thu, 24 Dec 2020 11:01:34 +0800
From: Jason Wang <jasowang@...hat.com>
To: Yongji Xie <xieyongji@...edance.com>
Cc: "Michael S. Tsirkin" <mst@...hat.com>,
Stefan Hajnoczi <stefanha@...hat.com>, sgarzare@...hat.com,
Parav Pandit <parav@...dia.com>, akpm@...ux-foundation.org,
Randy Dunlap <rdunlap@...radead.org>,
Matthew Wilcox <willy@...radead.org>, viro@...iv.linux.org.uk,
axboe@...nel.dk, bcrl@...ck.org, corbet@....net,
virtualization@...ts.linux-foundation.org, netdev@...r.kernel.org,
kvm@...r.kernel.org, linux-aio@...ck.org,
linux-fsdevel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace
On 2020/12/23 下午10:17, Yongji Xie wrote:
> On Wed, Dec 23, 2020 at 4:08 PM Jason Wang <jasowang@...hat.com> wrote:
>>
>> On 2020/12/22 下午10:52, Xie Yongji wrote:
>>> This VDUSE driver enables implementing vDPA devices in userspace.
>>> Both control path and data path of vDPA devices will be able to
>>> be handled in userspace.
>>>
>>> In the control path, the VDUSE driver will make use of message
>>> mechnism to forward the config operation from vdpa bus driver
>>> to userspace. Userspace can use read()/write() to receive/reply
>>> those control messages.
>>>
>>> In the data path, the VDUSE driver implements a MMU-based on-chip
>>> IOMMU driver which supports mapping the kernel dma buffer to a
>>> userspace iova region dynamically. Userspace can access those
>>> iova region via mmap(). Besides, the eventfd mechanism is used to
>>> trigger interrupt callbacks and receive virtqueue kicks in userspace
>>>
>>> Now we only support virtio-vdpa bus driver with this patch applied.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@...edance.com>
>>> ---
>>> Documentation/driver-api/vduse.rst | 74 ++
>>> Documentation/userspace-api/ioctl/ioctl-number.rst | 1 +
>>> drivers/vdpa/Kconfig | 8 +
>>> drivers/vdpa/Makefile | 1 +
>>> drivers/vdpa/vdpa_user/Makefile | 5 +
>>> drivers/vdpa/vdpa_user/eventfd.c | 221 ++++
>>> drivers/vdpa/vdpa_user/eventfd.h | 48 +
>>> drivers/vdpa/vdpa_user/iova_domain.c | 442 ++++++++
>>> drivers/vdpa/vdpa_user/iova_domain.h | 93 ++
>>> drivers/vdpa/vdpa_user/vduse.h | 59 ++
>>> drivers/vdpa/vdpa_user/vduse_dev.c | 1121 ++++++++++++++++++++
>>> include/uapi/linux/vdpa.h | 1 +
>>> include/uapi/linux/vduse.h | 99 ++
>>> 13 files changed, 2173 insertions(+)
>>> create mode 100644 Documentation/driver-api/vduse.rst
>>> create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>> create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>>> create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>>> create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>>> create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>>> create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>>> create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>> create mode 100644 include/uapi/linux/vduse.h
>>>
>>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
>>> new file mode 100644
>>> index 000000000000..da9b3040f20a
>>> --- /dev/null
>>> +++ b/Documentation/driver-api/vduse.rst
>>> @@ -0,0 +1,74 @@
>>> +==================================
>>> +VDUSE - "vDPA Device in Userspace"
>>> +==================================
>>> +
>>> +vDPA (virtio data path acceleration) device is a device that uses a
>>> +datapath which complies with the virtio specifications with vendor
>>> +specific control path. vDPA devices can be both physically located on
>>> +the hardware or emulated by software. VDUSE is a framework that makes it
>>> +possible to implement software-emulated vDPA devices in userspace.
>>> +
>>> +How VDUSE works
>>> +------------
>>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
>>> +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
>>> +to the new resources will be returned, which can be used to implement the
>>> +userspace vDPA device's control path and data path.
>>> +
>>> +To implement control path, the read/write operations to the file descriptor
>>> +will be used to receive/reply the control messages from/to VDUSE driver.
>>> +Those control messages are based on the vdpa_config_ops which defines a
>>> +unified interface to control different types of vDPA device.
>>> +
>>> +The following types of messages are provided by the VDUSE framework now:
>>> +
>>> +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
>>> +
>>> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
>>> +
>>> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
>>> +
>>> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
>>> +
>>> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
>>> +
>>> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
>>> +
>>> +- VDUSE_SET_STATUS: Set the device status
>>> +
>>> +- VDUSE_GET_STATUS: Get the device status
>>> +
>>> +- VDUSE_SET_CONFIG: Write to device specific configuration space
>>> +
>>> +- VDUSE_GET_CONFIG: Read from device specific configuration space
>>> +
>>> +Please see include/linux/vdpa.h for details.
>>> +
>>> +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
>>> +driver which supports mapping the kernel dma buffer to a userspace iova
>>> +region dynamically. The userspace iova region can be created by passing
>>> +the userspace vDPA device fd to mmap(2).
>>> +
>>> +Besides, the eventfd mechanism is used to trigger interrupt callbacks and
>>> +receive virtqueue kicks in userspace. The following ioctls on the userspace
>>> +vDPA device fd are provided to support that:
>>> +
>>> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
>>> + by VDUSE driver to notify userspace to consume the vring.
>>> +
>>> +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
>>> + by userspace to notify VDUSE driver to trigger interrupt callbacks.
>>> +
>>> +MMU-based IOMMU Driver
>>> +----------------------
>>> +The basic idea behind the IOMMU driver is treating MMU (VA->PA) as
>>> +IOMMU (IOVA->PA). This driver will set up MMU mapping instead of IOMMU mapping
>>> +for the DMA transfer so that the userspace process is able to use its virtual
>>> +address to access the dma buffer in kernel.
>>> +
>>> +And to avoid security issue, a bounce-buffering mechanism is introduced to
>>> +prevent userspace accessing the original buffer directly which may contain other
>>> +kernel data. During the mapping, unmapping, the driver will copy the data from
>>> +the original buffer to the bounce buffer and back, depending on the direction of
>>> +the transfer. And the bounce-buffer addresses will be mapped into the user address
>>> +space instead of the original one.
>>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> index a4c75a28c839..71722e6f8f23 100644
>>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> @@ -300,6 +300,7 @@ Code Seq# Include File Comments
>>> 'z' 10-4F drivers/s390/crypto/zcrypt_api.h conflict!
>>> '|' 00-7F linux/media.h
>>> 0x80 00-1F linux/fb.h
>>> +0x81 00-1F linux/vduse.h
>>> 0x89 00-06 arch/x86/include/asm/sockios.h
>>> 0x89 0B-DF linux/sockios.h
>>> 0x89 E0-EF linux/sockios.h SIOCPROTOPRIVATE range
>>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
>>> index 4be7be39be26..211cc449cbd3 100644
>>> --- a/drivers/vdpa/Kconfig
>>> +++ b/drivers/vdpa/Kconfig
>>> @@ -21,6 +21,14 @@ config VDPA_SIM
>>> to RX. This device is used for testing, prototyping and
>>> development of vDPA.
>>>
>>> +config VDPA_USER
>>> + tristate "VDUSE (vDPA Device in Userspace) support"
>>> + depends on EVENTFD && MMU && HAS_DMA
>>> + default n
>>
>> The "default n" is not necessary.
>>
> OK.
>>> + help
>>> + With VDUSE it is possible to emulate a vDPA Device
>>> + in a userspace program.
>>> +
>>> config IFCVF
>>> tristate "Intel IFC VF vDPA driver"
>>> depends on PCI_MSI
>>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
>>> index d160e9b63a66..66e97778ad03 100644
>>> --- a/drivers/vdpa/Makefile
>>> +++ b/drivers/vdpa/Makefile
>>> @@ -1,5 +1,6 @@
>>> # SPDX-License-Identifier: GPL-2.0
>>> obj-$(CONFIG_VDPA) += vdpa.o
>>> obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
>>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>>> obj-$(CONFIG_IFCVF) += ifcvf/
>>> obj-$(CONFIG_MLX5_VDPA) += mlx5/
>>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
>>> new file mode 100644
>>> index 000000000000..b7645e36992b
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/Makefile
>>> @@ -0,0 +1,5 @@
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +
>>> +vduse-y := vduse_dev.o iova_domain.o eventfd.o
>>
>> Do we really need eventfd.o here consider we've selected it.
>>
> Do you mean the file "drivers/vdpa/vdpa_user/eventfd.c"?
My bad, I confuse this with the common eventfd. So the code is fine here.
>
>>> +
>>> +obj-$(CONFIG_VDPA_USER) += vduse.o
>>> diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
>>> new file mode 100644
>>> index 000000000000..dbffddb08908
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/eventfd.c
>>> @@ -0,0 +1,221 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Eventfd support for VDUSE
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@...edance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/eventfd.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/file.h>
>>> +#include <uapi/linux/vduse.h>
>>> +
>>> +#include "eventfd.h"
>>> +
>>> +static struct workqueue_struct *vduse_irqfd_cleanup_wq;
>>> +
>>> +static void vduse_virqfd_shutdown(struct work_struct *work)
>>> +{
>>> + u64 cnt;
>>> + struct vduse_virqfd *virqfd = container_of(work,
>>> + struct vduse_virqfd, shutdown);
>>> +
>>> + eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
>>> + flush_work(&virqfd->inject);
>>> + eventfd_ctx_put(virqfd->ctx);
>>> + kfree(virqfd);
>>> +}
>>> +
>>> +static void vduse_virqfd_inject(struct work_struct *work)
>>> +{
>>> + struct vduse_virqfd *virqfd = container_of(work,
>>> + struct vduse_virqfd, inject);
>>> + struct vduse_virtqueue *vq = virqfd->vq;
>>> +
>>> + spin_lock_irq(&vq->irq_lock);
>>> + if (vq->ready && vq->cb)
>>> + vq->cb(vq->private);
>>> + spin_unlock_irq(&vq->irq_lock);
>>> +}
>>> +
>>> +static void virqfd_deactivate(struct vduse_virqfd *virqfd)
>>> +{
>>> + queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
>>> +}
>>> +
>>> +static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
>>> + int sync, void *key)
>>> +{
>>> + struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
>>> + struct vduse_virtqueue *vq = virqfd->vq;
>>> +
>>> + __poll_t flags = key_to_poll(key);
>>> +
>>> + if (flags & EPOLLIN)
>>> + schedule_work(&virqfd->inject);
>>> +
>>> + if (flags & EPOLLHUP) {
>>> + spin_lock(&vq->irq_lock);
>>> + if (vq->virqfd == virqfd) {
>>> + vq->virqfd = NULL;
>>> + virqfd_deactivate(virqfd);
>>> + }
>>> + spin_unlock(&vq->irq_lock);
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static void vduse_virqfd_ptable_queue_proc(struct file *file,
>>> + wait_queue_head_t *wqh, poll_table *pt)
>>> +{
>>> + struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
>>> +
>>> + add_wait_queue(wqh, &virqfd->wait);
>>> +}
>>> +
>>> +int vduse_virqfd_setup(struct vduse_dev *dev,
>>> + struct vduse_vq_eventfd *eventfd)
>>> +{
>>> + struct vduse_virqfd *virqfd;
>>> + struct fd irqfd;
>>> + struct eventfd_ctx *ctx;
>>> + struct vduse_virtqueue *vq;
>>> + __poll_t events;
>>> + int ret;
>>> +
>>> + if (eventfd->index >= dev->vq_num)
>>> + return -EINVAL;
>>> +
>>> + vq = &dev->vqs[eventfd->index];
>>> + virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
>>> + if (!virqfd)
>>> + return -ENOMEM;
>>> +
>>> + INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
>>> + INIT_WORK(&virqfd->inject, vduse_virqfd_inject);
>>
>> Any reason that a workqueue is must here?
>>
> Mainly for performance considerations. Make sure the push() and pop()
> for used vring can be asynchronous.
I see.
>
>>> +
>>> + ret = -EBADF;
>>> + irqfd = fdget(eventfd->fd);
>>> + if (!irqfd.file)
>>> + goto err_fd;
>>> +
>>> + ctx = eventfd_ctx_fileget(irqfd.file);
>>> + if (IS_ERR(ctx)) {
>>> + ret = PTR_ERR(ctx);
>>> + goto err_ctx;
>>> + }
>>> +
>>> + virqfd->vq = vq;
>>> + virqfd->ctx = ctx;
>>> + spin_lock(&vq->irq_lock);
>>> + if (vq->virqfd)
>>> + virqfd_deactivate(virqfd);
>>> + vq->virqfd = virqfd;
>>> + spin_unlock(&vq->irq_lock);
>>> +
>>> + init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
>>> + init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
>>> +
>>> + events = vfs_poll(irqfd.file, &virqfd->pt);
>>> +
>>> + /*
>>> + * Check if there was an event already pending on the eventfd
>>> + * before we registered and trigger it as if we didn't miss it.
>>> + */
>>> + if (events & EPOLLIN)
>>> + schedule_work(&virqfd->inject);
>>> +
>>> + fdput(irqfd);
>>> +
>>> + return 0;
>>> +err_ctx:
>>> + fdput(irqfd);
>>> +err_fd:
>>> + kfree(virqfd);
>>> + return ret;
>>> +}
>>> +
>>> +void vduse_virqfd_release(struct vduse_dev *dev)
>>> +{
>>> + int i;
>>> +
>>> + for (i = 0; i < dev->vq_num; i++) {
>>> + struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> + spin_lock(&vq->irq_lock);
>>> + if (vq->virqfd) {
>>> + virqfd_deactivate(vq->virqfd);
>>> + vq->virqfd = NULL;
>>> + }
>>> + spin_unlock(&vq->irq_lock);
>>> + }
>>> + flush_workqueue(vduse_irqfd_cleanup_wq);
>>> +}
>>> +
>>> +int vduse_virqfd_init(void)
>>> +{
>>> + vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
>>> + WQ_UNBOUND, 0);
>>> + if (!vduse_irqfd_cleanup_wq)
>>> + return -ENOMEM;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +void vduse_virqfd_exit(void)
>>> +{
>>> + destroy_workqueue(vduse_irqfd_cleanup_wq);
>>> +}
>>> +
>>> +void vduse_vq_kick(struct vduse_virtqueue *vq)
>>> +{
>>> + spin_lock(&vq->kick_lock);
>>> + if (vq->ready && vq->kickfd)
>>> + eventfd_signal(vq->kickfd, 1);
>>> + spin_unlock(&vq->kick_lock);
>>> +}
>>> +
>>> +int vduse_kickfd_setup(struct vduse_dev *dev,
>>> + struct vduse_vq_eventfd *eventfd)
>>> +{
>>> + struct eventfd_ctx *ctx;
>>> + struct vduse_virtqueue *vq;
>>> +
>>> + if (eventfd->index >= dev->vq_num)
>>> + return -EINVAL;
>>> +
>>> + vq = &dev->vqs[eventfd->index];
>>> + ctx = eventfd_ctx_fdget(eventfd->fd);
>>> + if (IS_ERR(ctx))
>>> + return PTR_ERR(ctx);
>>> +
>>> + spin_lock(&vq->kick_lock);
>>> + if (vq->kickfd)
>>> + eventfd_ctx_put(vq->kickfd);
>>> + vq->kickfd = ctx;
>>> + spin_unlock(&vq->kick_lock);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +void vduse_kickfd_release(struct vduse_dev *dev)
>>> +{
>>> + int i;
>>> +
>>> + for (i = 0; i < dev->vq_num; i++) {
>>> + struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> + spin_lock(&vq->kick_lock);
>>> + if (vq->kickfd) {
>>> + eventfd_ctx_put(vq->kickfd);
>>> + vq->kickfd = NULL;
>>> + }
>>> + spin_unlock(&vq->kick_lock);
>>> + }
>>> +}
>>> diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
>>> new file mode 100644
>>> index 000000000000..14269ff27f47
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/eventfd.h
>>> @@ -0,0 +1,48 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +/*
>>> + * Eventfd support for VDUSE
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@...edance.com>
>>> + *
>>> + */
>>> +
>>> +#ifndef _VDUSE_EVENTFD_H
>>> +#define _VDUSE_EVENTFD_H
>>> +
>>> +#include <linux/eventfd.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/wait.h>
>>> +#include <uapi/linux/vduse.h>
>>> +
>>> +#include "vduse.h"
>>> +
>>> +struct vduse_dev;
>>> +
>>> +struct vduse_virqfd {
>>> + struct eventfd_ctx *ctx;
>>> + struct vduse_virtqueue *vq;
>>> + struct work_struct inject;
>>> + struct work_struct shutdown;
>>> + wait_queue_entry_t wait;
>>> + poll_table pt;
>>> +};
>>> +
>>> +int vduse_virqfd_setup(struct vduse_dev *dev,
>>> + struct vduse_vq_eventfd *eventfd);
>>> +
>>> +void vduse_virqfd_release(struct vduse_dev *dev);
>>> +
>>> +int vduse_virqfd_init(void);
>>> +
>>> +void vduse_virqfd_exit(void);
>>> +
>>> +void vduse_vq_kick(struct vduse_virtqueue *vq);
>>> +
>>> +int vduse_kickfd_setup(struct vduse_dev *dev,
>>> + struct vduse_vq_eventfd *eventfd);
>>> +
>>> +void vduse_kickfd_release(struct vduse_dev *dev);
>>> +
>>> +#endif /* _VDUSE_EVENTFD_H */
>>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
>>> new file mode 100644
>>> index 000000000000..27022157abc6
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
>>> @@ -0,0 +1,442 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * MMU-based IOMMU implementation
>>> + *
>>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@...edance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/wait.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/genalloc.h>
>>> +#include <linux/dma-mapping.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define IOVA_CHUNK_SHIFT 26
>>> +#define IOVA_CHUNK_SIZE (_AC(1, UL) << IOVA_CHUNK_SHIFT)
>>> +#define IOVA_CHUNK_MASK (~(IOVA_CHUNK_SIZE - 1))
>>> +
>>> +#define IOVA_MIN_SIZE (IOVA_CHUNK_SIZE << 1)
>>> +
>>> +#define IOVA_ALLOC_ORDER 12
>>> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
>>> +
>>> +struct vduse_mmap_vma {
>>> + struct vm_area_struct *vma;
>>> + struct list_head list;
>>> +};
>>> +
>>> +static inline struct page *
>>> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
>>> + unsigned long iova)
>>> +{
>>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT;
>>> + unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
>>> + unsigned long pgindex = chunkoff >> PAGE_SHIFT;
>>> +
>>> + return domain->chunks[index].bounce_pages[pgindex];
>>> +}
>>> +
>>> +static inline void
>>> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
>>> + unsigned long iova, struct page *page)
>>> +{
>>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT;
>>> + unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
>>> + unsigned long pgindex = chunkoff >> PAGE_SHIFT;
>>> +
>>> + domain->chunks[index].bounce_pages[pgindex] = page;
>>> +}
>>> +
>>> +static inline struct vduse_iova_map *
>>> +vduse_domain_get_iova_map(struct vduse_iova_domain *domain,
>>> + unsigned long iova)
>>> +{
>>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT;
>>> + unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
>>> + unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
>>> +
>>> + return domain->chunks[index].iova_map[mapindex];
>>> +}
>>> +
>>> +static inline void
>>> +vduse_domain_set_iova_map(struct vduse_iova_domain *domain,
>>> + unsigned long iova, struct vduse_iova_map *map)
>>> +{
>>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT;
>>> + unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
>>> + unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
>>> +
>>> + domain->chunks[index].iova_map[mapindex] = map;
>>> +}
>>> +
>>> +static int
>>> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
>>> + unsigned long iova, size_t size)
>>> +{
>>> + struct page *page;
>>> + size_t walk_sz = 0;
>>> + int frees = 0;
>>> +
>>> + while (walk_sz < size) {
>>> + page = vduse_domain_get_bounce_page(domain, iova);
>>> + if (page) {
>>> + vduse_domain_set_bounce_page(domain, iova, NULL);
>>> + put_page(page);
>>> + frees++;
>>> + }
>>> + iova += PAGE_SIZE;
>>> + walk_sz += PAGE_SIZE;
>>> + }
>>> +
>>> + return frees;
>>> +}
>>> +
>>> +int vduse_domain_add_vma(struct vduse_iova_domain *domain,
>>> + struct vm_area_struct *vma)
>>> +{
>>> + unsigned long size = vma->vm_end - vma->vm_start;
>>> + struct vduse_mmap_vma *mmap_vma;
>>> +
>>> + if (WARN_ON(size != domain->size))
>>> + return -EINVAL;
>>> +
>>> + mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
>>> + if (!mmap_vma)
>>> + return -ENOMEM;
>>> +
>>> + mmap_vma->vma = vma;
>>> + mutex_lock(&domain->vma_lock);
>>> + list_add(&mmap_vma->list, &domain->vma_list);
>>> + mutex_unlock(&domain->vma_lock);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
>>> + struct vm_area_struct *vma)
>>> +{
>>> + struct vduse_mmap_vma *mmap_vma;
>>> +
>>> + mutex_lock(&domain->vma_lock);
>>> + list_for_each_entry(mmap_vma, &domain->vma_list, list) {
>>> + if (mmap_vma->vma == vma) {
>>> + list_del(&mmap_vma->list);
>>> + kfree(mmap_vma);
>>> + break;
>>> + }
>>> + }
>>> + mutex_unlock(&domain->vma_lock);
>>> +}
>>> +
>>> +int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
>>> + unsigned long iova, unsigned long orig,
>>> + size_t size, enum dma_data_direction dir)
>>> +{
>>> + struct vduse_iova_map *map;
>>> + unsigned long last = iova + size;
>>> +
>>> + map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
>>> + if (!map)
>>> + return -ENOMEM;
>>> +
>>> + map->iova = iova;
>>> + map->orig = orig;
>>> + map->size = size;
>>> + map->dir = dir;
>>> +
>>> + while (iova < last) {
>>> + vduse_domain_set_iova_map(domain, iova, map);
>>> + iova += IOVA_ALLOC_SIZE;
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +struct vduse_iova_map *
>>> +vduse_domain_get_mapping(struct vduse_iova_domain *domain,
>>> + unsigned long iova)
>>> +{
>>> + return vduse_domain_get_iova_map(domain, iova);
>>> +}
>>> +
>>> +void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
>>> + struct vduse_iova_map *map)
>>> +{
>>> + unsigned long iova = map->iova;
>>> + unsigned long last = iova + map->size;
>>> +
>>> + while (iova < last) {
>>> + vduse_domain_set_iova_map(domain, iova, NULL);
>>> + iova += IOVA_ALLOC_SIZE;
>>> + }
>>> +}
>>> +
>>> +void vduse_domain_unmap(struct vduse_iova_domain *domain,
>>> + unsigned long iova, size_t size)
>>> +{
>>> + struct vduse_mmap_vma *mmap_vma;
>>> + unsigned long uaddr;
>>> +
>>> + mutex_lock(&domain->vma_lock);
>>> + list_for_each_entry(mmap_vma, &domain->vma_list, list) {
>>> + mmap_read_lock(mmap_vma->vma->vm_mm);
>>> + uaddr = iova + mmap_vma->vma->vm_start;
>>> + zap_page_range(mmap_vma->vma, uaddr, size);
>>> + mmap_read_unlock(mmap_vma->vma->vm_mm);
>>> + }
>>> + mutex_unlock(&domain->vma_lock);
>>> +}
>>> +
>>> +int vduse_domain_direct_map(struct vduse_iova_domain *domain,
>>> + struct vm_area_struct *vma, unsigned long iova)
>>> +{
>>> + unsigned long uaddr = iova + vma->vm_start;
>>> + unsigned long start = iova & PAGE_MASK;
>>> + unsigned long last = start + PAGE_SIZE - 1;
>>> + unsigned long offset;
>>> + struct vduse_iova_map *map;
>>> + struct page *page = NULL;
>>> +
>>> + map = vduse_domain_get_iova_map(domain, iova);
>>> + if (map) {
>>> + offset = last - map->iova;
>>> + page = virt_to_page(map->orig + offset);
>>> + }
>>> +
>>> + return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
>>> +}
>>
>> So as we discussed before, we need to find way to make vhost work. And
>> it's better to make vhost transparent to VDUSE. One idea is to implement
>> shadow virtqueue here, that is, instead of trying to insert the pages to
>> VDUSE userspace, we use the shadow virtqueue to relay the descriptors to
>> userspace. With this, we don't need stuffs like shmfd etc.
>>
> Good idea! The disadvantage is performance will go down (one more
> thread switch overhead and vhost-liked kworker will become bottleneck
> without multi-thread support).
Yes, the disadvantage is the performance. But it should be simpler (I
guess) and we know it can succeed.
> I think I can try this in v3. And the
> MMU-based IOMMU implementation can be a future optimization in the
> virtio-vdpa case. What's your opinion?
Maybe I was wrong, but I think we can try as what has been proposed here
first and use shadow virtqueue as backup plan if we fail.
>
>>> +
>>> +void vduse_domain_bounce(struct vduse_iova_domain *domain,
>>> + unsigned long iova, unsigned long orig,
>>> + size_t size, enum dma_data_direction dir)
>>> +{
>>> + unsigned int offset = offset_in_page(iova);
>>> +
>>> + while (size) {
>>> + struct page *p = vduse_domain_get_bounce_page(domain, iova);
>>> + size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
>>> + void *addr;
>>> +
>>> + if (p) {
>>> + addr = page_address(p) + offset;
>>> + if (dir == DMA_TO_DEVICE)
>>> + memcpy(addr, (void *)orig, copy_len);
>>> + else if (dir == DMA_FROM_DEVICE)
>>> + memcpy((void *)orig, addr, copy_len);
>>> + }
>>
>> I think I miss something, for DMA_FROM_DEVICE, if p doesn't exist how is
>> it expected to work? Or do we need to warn here in this case?
>>
> Yes, I think we need a WARN_ON here.
Ok.
>
>
>>> + size -= copy_len;
>>> + orig += copy_len;
>>> + iova += copy_len;
>>> + offset = 0;
>>> + }
>>> +}
>>> +
>>> +int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
>>> + struct vm_area_struct *vma, unsigned long iova)
>>> +{
>>> + unsigned long uaddr = iova + vma->vm_start;
>>> + unsigned long start = iova & PAGE_MASK;
>>> + unsigned long offset = 0;
>>> + bool found = false;
>>> + struct vduse_iova_map *map;
>>> + struct page *page;
>>> +
>>> + mutex_lock(&domain->map_lock);
>>> +
>>> + page = vduse_domain_get_bounce_page(domain, iova);
>>> + if (page)
>>> + goto unlock;
>>> +
>>> + page = alloc_page(GFP_KERNEL);
>>> + if (!page)
>>> + goto unlock;
>>> +
>>> + while (offset < PAGE_SIZE) {
>>> + unsigned int src_offset = 0, dst_offset = 0;
>>> + void *src, *dst;
>>> + size_t copy_len;
>>> +
>>> + map = vduse_domain_get_iova_map(domain, start + offset);
>>> + if (!map) {
>>> + offset += IOVA_ALLOC_SIZE;
>>> + continue;
>>> + }
>>> +
>>> + found = true;
>>> + offset += map->size;
>>> + if (map->dir == DMA_FROM_DEVICE)
>>> + continue;
>>> +
>>> + if (start > map->iova)
>>> + src_offset = start - map->iova;
>>> + else
>>> + dst_offset = map->iova - start;
>>> +
>>> + src = (void *)(map->orig + src_offset);
>>> + dst = page_address(page) + dst_offset;
>>> + copy_len = min_t(size_t, map->size - src_offset,
>>> + PAGE_SIZE - dst_offset);
>>> + memcpy(dst, src, copy_len);
>>> + }
>>> + if (!found) {
>>> + put_page(page);
>>> + page = NULL;
>>> + }
>>> + vduse_domain_set_bounce_page(domain, iova, page);
>>> +unlock:
>>> + mutex_unlock(&domain->map_lock);
>>> +
>>> + return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
>>> +}
>>> +
>>> +bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
>>> + unsigned long iova)
>>> +{
>>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT;
>>> + struct vduse_iova_chunk *chunk = &domain->chunks[index];
>>> +
>>> + return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP;
>>> +}
>>> +
>>> +unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
>>> + size_t size, enum iova_map_type type)
>>> +{
>>> + struct vduse_iova_chunk *chunk;
>>> + unsigned long iova = 0;
>>> + int align = (type == TYPE_DIRECT_MAP) ? PAGE_SIZE : IOVA_ALLOC_SIZE;
>>> + struct genpool_data_align data = { .align = align };
>>> + int i;
>>> +
>>> + for (i = 0; i < domain->chunk_num; i++) {
>>> + chunk = &domain->chunks[i];
>>> + if (unlikely(atomic_read(&chunk->map_type) == TYPE_NONE))
>>> + atomic_cmpxchg(&chunk->map_type, TYPE_NONE, type);
>>> +
>>> + if (atomic_read(&chunk->map_type) != type)
>>> + continue;
>>> +
>>> + iova = gen_pool_alloc_algo(chunk->pool, size,
>>> + gen_pool_first_fit_align, &data);
>>> + if (iova)
>>> + break;
>>> + }
>>> +
>>> + return iova;
>>
>> I wonder why not just reuse the iova domain implements in
>> driver/iommu/iova.c
>>
> The iova domain in driver/iommu/iova.c is only an iova allocator which
> is implemented by the genpool memory allocator in our case. The other
> part in our iova domain is chunk management and iova_map management.
> We need different chunks to distinguish different dma mapping types:
> consistent mapping or streaming mapping. We can only use
> bouncing-mechanism in the streaming mapping case.
To differ dma mappings, you can use two iova domains with different
ranges. It looks simpler than the gen_pool. (AFAIK most IOMMU driver is
using iova domain).
Thanks
Powered by blists - more mailing lists