lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 24 Dec 2020 16:34:20 +0800
From:   Yongji Xie <xieyongji@...edance.com>
To:     Jason Wang <jasowang@...hat.com>
Cc:     "Michael S. Tsirkin" <mst@...hat.com>,
        Stefan Hajnoczi <stefanha@...hat.com>, sgarzare@...hat.com,
        Parav Pandit <parav@...dia.com>, akpm@...ux-foundation.org,
        Randy Dunlap <rdunlap@...radead.org>,
        Matthew Wilcox <willy@...radead.org>, viro@...iv.linux.org.uk,
        axboe@...nel.dk, bcrl@...ck.org, corbet@....net,
        virtualization@...ts.linux-foundation.org, netdev@...r.kernel.org,
        kvm@...r.kernel.org, linux-aio@...ck.org,
        linux-fsdevel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace

On Thu, Dec 24, 2020 at 11:01 AM Jason Wang <jasowang@...hat.com> wrote:
>
>
> On 2020/12/23 下午10:17, Yongji Xie wrote:
> > On Wed, Dec 23, 2020 at 4:08 PM Jason Wang <jasowang@...hat.com> wrote:
> >>
> >> On 2020/12/22 下午10:52, Xie Yongji wrote:
> >>> This VDUSE driver enables implementing vDPA devices in userspace.
> >>> Both control path and data path of vDPA devices will be able to
> >>> be handled in userspace.
> >>>
> >>> In the control path, the VDUSE driver will make use of message
> >>> mechnism to forward the config operation from vdpa bus driver
> >>> to userspace. Userspace can use read()/write() to receive/reply
> >>> those control messages.
> >>>
> >>> In the data path, the VDUSE driver implements a MMU-based on-chip
> >>> IOMMU driver which supports mapping the kernel dma buffer to a
> >>> userspace iova region dynamically. Userspace can access those
> >>> iova region via mmap(). Besides, the eventfd mechanism is used to
> >>> trigger interrupt callbacks and receive virtqueue kicks in userspace
> >>>
> >>> Now we only support virtio-vdpa bus driver with this patch applied.
> >>>
> >>> Signed-off-by: Xie Yongji <xieyongji@...edance.com>
> >>> ---
> >>>    Documentation/driver-api/vduse.rst                 |   74 ++
> >>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >>>    drivers/vdpa/Kconfig                               |    8 +
> >>>    drivers/vdpa/Makefile                              |    1 +
> >>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >>>    drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
> >>>    drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
> >>>    drivers/vdpa/vdpa_user/iova_domain.c               |  442 ++++++++
> >>>    drivers/vdpa/vdpa_user/iova_domain.h               |   93 ++
> >>>    drivers/vdpa/vdpa_user/vduse.h                     |   59 ++
> >>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1121 ++++++++++++++++++++
> >>>    include/uapi/linux/vdpa.h                          |    1 +
> >>>    include/uapi/linux/vduse.h                         |   99 ++
> >>>    13 files changed, 2173 insertions(+)
> >>>    create mode 100644 Documentation/driver-api/vduse.rst
> >>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >>>    create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
> >>>    create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
> >>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
> >>>    create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
> >>>    create mode 100644 drivers/vdpa/vdpa_user/vduse.h
> >>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >>>    create mode 100644 include/uapi/linux/vduse.h
> >>>
> >>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
> >>> new file mode 100644
> >>> index 000000000000..da9b3040f20a
> >>> --- /dev/null
> >>> +++ b/Documentation/driver-api/vduse.rst
> >>> @@ -0,0 +1,74 @@
> >>> +==================================
> >>> +VDUSE - "vDPA Device in Userspace"
> >>> +==================================
> >>> +
> >>> +vDPA (virtio data path acceleration) device is a device that uses a
> >>> +datapath which complies with the virtio specifications with vendor
> >>> +specific control path. vDPA devices can be both physically located on
> >>> +the hardware or emulated by software. VDUSE is a framework that makes it
> >>> +possible to implement software-emulated vDPA devices in userspace.
> >>> +
> >>> +How VDUSE works
> >>> +------------
> >>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> >>> +the VDUSE character device (/dev/vduse). Then a file descriptor pointing
> >>> +to the new resources will be returned, which can be used to implement the
> >>> +userspace vDPA device's control path and data path.
> >>> +
> >>> +To implement control path, the read/write operations to the file descriptor
> >>> +will be used to receive/reply the control messages from/to VDUSE driver.
> >>> +Those control messages are based on the vdpa_config_ops which defines a
> >>> +unified interface to control different types of vDPA device.
> >>> +
> >>> +The following types of messages are provided by the VDUSE framework now:
> >>> +
> >>> +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
> >>> +
> >>> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> >>> +
> >>> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> >>> +
> >>> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> >>> +
> >>> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> >>> +
> >>> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> >>> +
> >>> +- VDUSE_SET_STATUS: Set the device status
> >>> +
> >>> +- VDUSE_GET_STATUS: Get the device status
> >>> +
> >>> +- VDUSE_SET_CONFIG: Write to device specific configuration space
> >>> +
> >>> +- VDUSE_GET_CONFIG: Read from device specific configuration space
> >>> +
> >>> +Please see include/linux/vdpa.h for details.
> >>> +
> >>> +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
> >>> +driver which supports mapping the kernel dma buffer to a userspace iova
> >>> +region dynamically. The userspace iova region can be created by passing
> >>> +the userspace vDPA device fd to mmap(2).
> >>> +
> >>> +Besides, the eventfd mechanism is used to trigger interrupt callbacks and
> >>> +receive virtqueue kicks in userspace. The following ioctls on the userspace
> >>> +vDPA device fd are provided to support that:
> >>> +
> >>> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> >>> +  by VDUSE driver to notify userspace to consume the vring.
> >>> +
> >>> +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
> >>> +  by userspace to notify VDUSE driver to trigger interrupt callbacks.
> >>> +
> >>> +MMU-based IOMMU Driver
> >>> +----------------------
> >>> +The basic idea behind the IOMMU driver is treating MMU (VA->PA) as
> >>> +IOMMU (IOVA->PA). This driver will set up MMU mapping instead of IOMMU mapping
> >>> +for the DMA transfer so that the userspace process is able to use its virtual
> >>> +address to access the dma buffer in kernel.
> >>> +
> >>> +And to avoid security issue, a bounce-buffering mechanism is introduced to
> >>> +prevent userspace accessing the original buffer directly which may contain other
> >>> +kernel data. During the mapping, unmapping, the driver will copy the data from
> >>> +the original buffer to the bounce buffer and back, depending on the direction of
> >>> +the transfer. And the bounce-buffer addresses will be mapped into the user address
> >>> +space instead of the original one.
> >>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> index a4c75a28c839..71722e6f8f23 100644
> >>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> >>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
> >>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
> >>>    '|'   00-7F  linux/media.h
> >>>    0x80  00-1F  linux/fb.h
> >>> +0x81  00-1F  linux/vduse.h
> >>>    0x89  00-06  arch/x86/include/asm/sockios.h
> >>>    0x89  0B-DF  linux/sockios.h
> >>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> >>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> >>> index 4be7be39be26..211cc449cbd3 100644
> >>> --- a/drivers/vdpa/Kconfig
> >>> +++ b/drivers/vdpa/Kconfig
> >>> @@ -21,6 +21,14 @@ config VDPA_SIM
> >>>          to RX. This device is used for testing, prototyping and
> >>>          development of vDPA.
> >>>
> >>> +config VDPA_USER
> >>> +     tristate "VDUSE (vDPA Device in Userspace) support"
> >>> +     depends on EVENTFD && MMU && HAS_DMA
> >>> +     default n
> >>
> >> The "default n" is not necessary.
> >>
> > OK.
> >>> +     help
> >>> +       With VDUSE it is possible to emulate a vDPA Device
> >>> +       in a userspace program.
> >>> +
> >>>    config IFCVF
> >>>        tristate "Intel IFC VF vDPA driver"
> >>>        depends on PCI_MSI
> >>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> >>> index d160e9b63a66..66e97778ad03 100644
> >>> --- a/drivers/vdpa/Makefile
> >>> +++ b/drivers/vdpa/Makefile
> >>> @@ -1,5 +1,6 @@
> >>>    # SPDX-License-Identifier: GPL-2.0
> >>>    obj-$(CONFIG_VDPA) += vdpa.o
> >>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> >>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >>>    obj-$(CONFIG_IFCVF)    += ifcvf/
> >>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
> >>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> >>> new file mode 100644
> >>> index 000000000000..b7645e36992b
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/Makefile
> >>> @@ -0,0 +1,5 @@
> >>> +# SPDX-License-Identifier: GPL-2.0
> >>> +
> >>> +vduse-y := vduse_dev.o iova_domain.o eventfd.o
> >>
> >> Do we really need eventfd.o here consider we've selected it.
> >>
> > Do you mean the file "drivers/vdpa/vdpa_user/eventfd.c"?
>
>
> My bad, I confuse this with the common eventfd. So the code is fine here.
>
>
> >
> >>> +
> >>> +obj-$(CONFIG_VDPA_USER) += vduse.o
> >>> diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
> >>> new file mode 100644
> >>> index 000000000000..dbffddb08908
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/eventfd.c
> >>> @@ -0,0 +1,221 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * Eventfd support for VDUSE
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@...edance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#include <linux/eventfd.h>
> >>> +#include <linux/poll.h>
> >>> +#include <linux/wait.h>
> >>> +#include <linux/slab.h>
> >>> +#include <linux/file.h>
> >>> +#include <uapi/linux/vduse.h>
> >>> +
> >>> +#include "eventfd.h"
> >>> +
> >>> +static struct workqueue_struct *vduse_irqfd_cleanup_wq;
> >>> +
> >>> +static void vduse_virqfd_shutdown(struct work_struct *work)
> >>> +{
> >>> +     u64 cnt;
> >>> +     struct vduse_virqfd *virqfd = container_of(work,
> >>> +                                     struct vduse_virqfd, shutdown);
> >>> +
> >>> +     eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
> >>> +     flush_work(&virqfd->inject);
> >>> +     eventfd_ctx_put(virqfd->ctx);
> >>> +     kfree(virqfd);
> >>> +}
> >>> +
> >>> +static void vduse_virqfd_inject(struct work_struct *work)
> >>> +{
> >>> +     struct vduse_virqfd *virqfd = container_of(work,
> >>> +                                     struct vduse_virqfd, inject);
> >>> +     struct vduse_virtqueue *vq = virqfd->vq;
> >>> +
> >>> +     spin_lock_irq(&vq->irq_lock);
> >>> +     if (vq->ready && vq->cb)
> >>> +             vq->cb(vq->private);
> >>> +     spin_unlock_irq(&vq->irq_lock);
> >>> +}
> >>> +
> >>> +static void virqfd_deactivate(struct vduse_virqfd *virqfd)
> >>> +{
> >>> +     queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
> >>> +}
> >>> +
> >>> +static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
> >>> +                             int sync, void *key)
> >>> +{
> >>> +     struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
> >>> +     struct vduse_virtqueue *vq = virqfd->vq;
> >>> +
> >>> +     __poll_t flags = key_to_poll(key);
> >>> +
> >>> +     if (flags & EPOLLIN)
> >>> +             schedule_work(&virqfd->inject);
> >>> +
> >>> +     if (flags & EPOLLHUP) {
> >>> +             spin_lock(&vq->irq_lock);
> >>> +             if (vq->virqfd == virqfd) {
> >>> +                     vq->virqfd = NULL;
> >>> +                     virqfd_deactivate(virqfd);
> >>> +             }
> >>> +             spin_unlock(&vq->irq_lock);
> >>> +     }
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +static void vduse_virqfd_ptable_queue_proc(struct file *file,
> >>> +                     wait_queue_head_t *wqh, poll_table *pt)
> >>> +{
> >>> +     struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
> >>> +
> >>> +     add_wait_queue(wqh, &virqfd->wait);
> >>> +}
> >>> +
> >>> +int vduse_virqfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd)
> >>> +{
> >>> +     struct vduse_virqfd *virqfd;
> >>> +     struct fd irqfd;
> >>> +     struct eventfd_ctx *ctx;
> >>> +     struct vduse_virtqueue *vq;
> >>> +     __poll_t events;
> >>> +     int ret;
> >>> +
> >>> +     if (eventfd->index >= dev->vq_num)
> >>> +             return -EINVAL;
> >>> +
> >>> +     vq = &dev->vqs[eventfd->index];
> >>> +     virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
> >>> +     if (!virqfd)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
> >>> +     INIT_WORK(&virqfd->inject, vduse_virqfd_inject);
> >>
> >> Any reason that a workqueue is must here?
> >>
> > Mainly for performance considerations. Make sure the push() and pop()
> > for used vring can be asynchronous.
>
>
> I see.
>
>
> >
> >>> +
> >>> +     ret = -EBADF;
> >>> +     irqfd = fdget(eventfd->fd);
> >>> +     if (!irqfd.file)
> >>> +             goto err_fd;
> >>> +
> >>> +     ctx = eventfd_ctx_fileget(irqfd.file);
> >>> +     if (IS_ERR(ctx)) {
> >>> +             ret = PTR_ERR(ctx);
> >>> +             goto err_ctx;
> >>> +     }
> >>> +
> >>> +     virqfd->vq = vq;
> >>> +     virqfd->ctx = ctx;
> >>> +     spin_lock(&vq->irq_lock);
> >>> +     if (vq->virqfd)
> >>> +             virqfd_deactivate(virqfd);
> >>> +     vq->virqfd = virqfd;
> >>> +     spin_unlock(&vq->irq_lock);
> >>> +
> >>> +     init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
> >>> +     init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
> >>> +
> >>> +     events = vfs_poll(irqfd.file, &virqfd->pt);
> >>> +
> >>> +     /*
> >>> +      * Check if there was an event already pending on the eventfd
> >>> +      * before we registered and trigger it as if we didn't miss it.
> >>> +      */
> >>> +     if (events & EPOLLIN)
> >>> +             schedule_work(&virqfd->inject);
> >>> +
> >>> +     fdput(irqfd);
> >>> +
> >>> +     return 0;
> >>> +err_ctx:
> >>> +     fdput(irqfd);
> >>> +err_fd:
> >>> +     kfree(virqfd);
> >>> +     return ret;
> >>> +}
> >>> +
> >>> +void vduse_virqfd_release(struct vduse_dev *dev)
> >>> +{
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             spin_lock(&vq->irq_lock);
> >>> +             if (vq->virqfd) {
> >>> +                     virqfd_deactivate(vq->virqfd);
> >>> +                     vq->virqfd = NULL;
> >>> +             }
> >>> +             spin_unlock(&vq->irq_lock);
> >>> +     }
> >>> +     flush_workqueue(vduse_irqfd_cleanup_wq);
> >>> +}
> >>> +
> >>> +int vduse_virqfd_init(void)
> >>> +{
> >>> +     vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
> >>> +                                             WQ_UNBOUND, 0);
> >>> +     if (!vduse_irqfd_cleanup_wq)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +void vduse_virqfd_exit(void)
> >>> +{
> >>> +     destroy_workqueue(vduse_irqfd_cleanup_wq);
> >>> +}
> >>> +
> >>> +void vduse_vq_kick(struct vduse_virtqueue *vq)
> >>> +{
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (vq->ready && vq->kickfd)
> >>> +             eventfd_signal(vq->kickfd, 1);
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +}
> >>> +
> >>> +int vduse_kickfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd)
> >>> +{
> >>> +     struct eventfd_ctx *ctx;
> >>> +     struct vduse_virtqueue *vq;
> >>> +
> >>> +     if (eventfd->index >= dev->vq_num)
> >>> +             return -EINVAL;
> >>> +
> >>> +     vq = &dev->vqs[eventfd->index];
> >>> +     ctx = eventfd_ctx_fdget(eventfd->fd);
> >>> +     if (IS_ERR(ctx))
> >>> +             return PTR_ERR(ctx);
> >>> +
> >>> +     spin_lock(&vq->kick_lock);
> >>> +     if (vq->kickfd)
> >>> +             eventfd_ctx_put(vq->kickfd);
> >>> +     vq->kickfd = ctx;
> >>> +     spin_unlock(&vq->kick_lock);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +void vduse_kickfd_release(struct vduse_dev *dev)
> >>> +{
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < dev->vq_num; i++) {
> >>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
> >>> +
> >>> +             spin_lock(&vq->kick_lock);
> >>> +             if (vq->kickfd) {
> >>> +                     eventfd_ctx_put(vq->kickfd);
> >>> +                     vq->kickfd = NULL;
> >>> +             }
> >>> +             spin_unlock(&vq->kick_lock);
> >>> +     }
> >>> +}
> >>> diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
> >>> new file mode 100644
> >>> index 000000000000..14269ff27f47
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/eventfd.h
> >>> @@ -0,0 +1,48 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0-only */
> >>> +/*
> >>> + * Eventfd support for VDUSE
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@...edance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#ifndef _VDUSE_EVENTFD_H
> >>> +#define _VDUSE_EVENTFD_H
> >>> +
> >>> +#include <linux/eventfd.h>
> >>> +#include <linux/poll.h>
> >>> +#include <linux/wait.h>
> >>> +#include <uapi/linux/vduse.h>
> >>> +
> >>> +#include "vduse.h"
> >>> +
> >>> +struct vduse_dev;
> >>> +
> >>> +struct vduse_virqfd {
> >>> +     struct eventfd_ctx *ctx;
> >>> +     struct vduse_virtqueue *vq;
> >>> +     struct work_struct inject;
> >>> +     struct work_struct shutdown;
> >>> +     wait_queue_entry_t wait;
> >>> +     poll_table pt;
> >>> +};
> >>> +
> >>> +int vduse_virqfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd);
> >>> +
> >>> +void vduse_virqfd_release(struct vduse_dev *dev);
> >>> +
> >>> +int vduse_virqfd_init(void);
> >>> +
> >>> +void vduse_virqfd_exit(void);
> >>> +
> >>> +void vduse_vq_kick(struct vduse_virtqueue *vq);
> >>> +
> >>> +int vduse_kickfd_setup(struct vduse_dev *dev,
> >>> +                     struct vduse_vq_eventfd *eventfd);
> >>> +
> >>> +void vduse_kickfd_release(struct vduse_dev *dev);
> >>> +
> >>> +#endif /* _VDUSE_EVENTFD_H */
> >>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
> >>> new file mode 100644
> >>> index 000000000000..27022157abc6
> >>> --- /dev/null
> >>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c
> >>> @@ -0,0 +1,442 @@
> >>> +// SPDX-License-Identifier: GPL-2.0-only
> >>> +/*
> >>> + * MMU-based IOMMU implementation
> >>> + *
> >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
> >>> + *
> >>> + * Author: Xie Yongji <xieyongji@...edance.com>
> >>> + *
> >>> + */
> >>> +
> >>> +#include <linux/wait.h>
> >>> +#include <linux/slab.h>
> >>> +#include <linux/genalloc.h>
> >>> +#include <linux/dma-mapping.h>
> >>> +
> >>> +#include "iova_domain.h"
> >>> +
> >>> +#define IOVA_CHUNK_SHIFT 26
> >>> +#define IOVA_CHUNK_SIZE (_AC(1, UL) << IOVA_CHUNK_SHIFT)
> >>> +#define IOVA_CHUNK_MASK (~(IOVA_CHUNK_SIZE - 1))
> >>> +
> >>> +#define IOVA_MIN_SIZE (IOVA_CHUNK_SIZE << 1)
> >>> +
> >>> +#define IOVA_ALLOC_ORDER 12
> >>> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
> >>> +
> >>> +struct vduse_mmap_vma {
> >>> +     struct vm_area_struct *vma;
> >>> +     struct list_head list;
> >>> +};
> >>> +
> >>> +static inline struct page *
> >>> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova)
> >>> +{
> >>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> >>> +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> >>> +     unsigned long pgindex = chunkoff >> PAGE_SHIFT;
> >>> +
> >>> +     return domain->chunks[index].bounce_pages[pgindex];
> >>> +}
> >>> +
> >>> +static inline void
> >>> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova, struct page *page)
> >>> +{
> >>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> >>> +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> >>> +     unsigned long pgindex = chunkoff >> PAGE_SHIFT;
> >>> +
> >>> +     domain->chunks[index].bounce_pages[pgindex] = page;
> >>> +}
> >>> +
> >>> +static inline struct vduse_iova_map *
> >>> +vduse_domain_get_iova_map(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova)
> >>> +{
> >>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> >>> +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> >>> +     unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
> >>> +
> >>> +     return domain->chunks[index].iova_map[mapindex];
> >>> +}
> >>> +
> >>> +static inline void
> >>> +vduse_domain_set_iova_map(struct vduse_iova_domain *domain,
> >>> +                     unsigned long iova, struct vduse_iova_map *map)
> >>> +{
> >>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> >>> +     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
> >>> +     unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
> >>> +
> >>> +     domain->chunks[index].iova_map[mapindex] = map;
> >>> +}
> >>> +
> >>> +static int
> >>> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova, size_t size)
> >>> +{
> >>> +     struct page *page;
> >>> +     size_t walk_sz = 0;
> >>> +     int frees = 0;
> >>> +
> >>> +     while (walk_sz < size) {
> >>> +             page = vduse_domain_get_bounce_page(domain, iova);
> >>> +             if (page) {
> >>> +                     vduse_domain_set_bounce_page(domain, iova, NULL);
> >>> +                     put_page(page);
> >>> +                     frees++;
> >>> +             }
> >>> +             iova += PAGE_SIZE;
> >>> +             walk_sz += PAGE_SIZE;
> >>> +     }
> >>> +
> >>> +     return frees;
> >>> +}
> >>> +
> >>> +int vduse_domain_add_vma(struct vduse_iova_domain *domain,
> >>> +                             struct vm_area_struct *vma)
> >>> +{
> >>> +     unsigned long size = vma->vm_end - vma->vm_start;
> >>> +     struct vduse_mmap_vma *mmap_vma;
> >>> +
> >>> +     if (WARN_ON(size != domain->size))
> >>> +             return -EINVAL;
> >>> +
> >>> +     mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
> >>> +     if (!mmap_vma)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     mmap_vma->vma = vma;
> >>> +     mutex_lock(&domain->vma_lock);
> >>> +     list_add(&mmap_vma->list, &domain->vma_list);
> >>> +     mutex_unlock(&domain->vma_lock);
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
> >>> +                             struct vm_area_struct *vma)
> >>> +{
> >>> +     struct vduse_mmap_vma *mmap_vma;
> >>> +
> >>> +     mutex_lock(&domain->vma_lock);
> >>> +     list_for_each_entry(mmap_vma, &domain->vma_list, list) {
> >>> +             if (mmap_vma->vma == vma) {
> >>> +                     list_del(&mmap_vma->list);
> >>> +                     kfree(mmap_vma);
> >>> +                     break;
> >>> +             }
> >>> +     }
> >>> +     mutex_unlock(&domain->vma_lock);
> >>> +}
> >>> +
> >>> +int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova, unsigned long orig,
> >>> +                             size_t size, enum dma_data_direction dir)
> >>> +{
> >>> +     struct vduse_iova_map *map;
> >>> +     unsigned long last = iova + size;
> >>> +
> >>> +     map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
> >>> +     if (!map)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     map->iova = iova;
> >>> +     map->orig = orig;
> >>> +     map->size = size;
> >>> +     map->dir = dir;
> >>> +
> >>> +     while (iova < last) {
> >>> +             vduse_domain_set_iova_map(domain, iova, map);
> >>> +             iova += IOVA_ALLOC_SIZE;
> >>> +     }
> >>> +
> >>> +     return 0;
> >>> +}
> >>> +
> >>> +struct vduse_iova_map *
> >>> +vduse_domain_get_mapping(struct vduse_iova_domain *domain,
> >>> +                     unsigned long iova)
> >>> +{
> >>> +     return vduse_domain_get_iova_map(domain, iova);
> >>> +}
> >>> +
> >>> +void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
> >>> +                             struct vduse_iova_map *map)
> >>> +{
> >>> +     unsigned long iova = map->iova;
> >>> +     unsigned long last = iova + map->size;
> >>> +
> >>> +     while (iova < last) {
> >>> +             vduse_domain_set_iova_map(domain, iova, NULL);
> >>> +             iova += IOVA_ALLOC_SIZE;
> >>> +     }
> >>> +}
> >>> +
> >>> +void vduse_domain_unmap(struct vduse_iova_domain *domain,
> >>> +                     unsigned long iova, size_t size)
> >>> +{
> >>> +     struct vduse_mmap_vma *mmap_vma;
> >>> +     unsigned long uaddr;
> >>> +
> >>> +     mutex_lock(&domain->vma_lock);
> >>> +     list_for_each_entry(mmap_vma, &domain->vma_list, list) {
> >>> +             mmap_read_lock(mmap_vma->vma->vm_mm);
> >>> +             uaddr = iova + mmap_vma->vma->vm_start;
> >>> +             zap_page_range(mmap_vma->vma, uaddr, size);
> >>> +             mmap_read_unlock(mmap_vma->vma->vm_mm);
> >>> +     }
> >>> +     mutex_unlock(&domain->vma_lock);
> >>> +}
> >>> +
> >>> +int vduse_domain_direct_map(struct vduse_iova_domain *domain,
> >>> +                     struct vm_area_struct *vma, unsigned long iova)
> >>> +{
> >>> +     unsigned long uaddr = iova + vma->vm_start;
> >>> +     unsigned long start = iova & PAGE_MASK;
> >>> +     unsigned long last = start + PAGE_SIZE - 1;
> >>> +     unsigned long offset;
> >>> +     struct vduse_iova_map *map;
> >>> +     struct page *page = NULL;
> >>> +
> >>> +     map = vduse_domain_get_iova_map(domain, iova);
> >>> +     if (map) {
> >>> +             offset = last - map->iova;
> >>> +             page = virt_to_page(map->orig + offset);
> >>> +     }
> >>> +
> >>> +     return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
> >>> +}
> >>
> >> So as we discussed before, we need to find way to make vhost work. And
> >> it's better to make vhost transparent to VDUSE. One idea is to implement
> >> shadow virtqueue here, that is, instead of trying to insert the pages to
> >> VDUSE userspace, we use the shadow virtqueue to relay the descriptors to
> >> userspace. With this, we don't need stuffs like shmfd etc.
> >>
> > Good idea! The disadvantage is performance will go down (one more
> > thread switch overhead and vhost-liked kworker will become bottleneck
> > without multi-thread support).
>
>
> Yes, the disadvantage is the performance. But it should be simpler (I
> guess) and we know it can succeed.
>

Yes, another advantage is that we can support the VM using anonymous memory.

>
>
> > I think I can try this in v3. And the
> > MMU-based IOMMU implementation can be a future optimization in the
> > virtio-vdpa case. What's your opinion?
>
>
> Maybe I was wrong, but I think we can try as what has been proposed here
> first and use shadow virtqueue as backup plan if we fail.
>

OK, I will continue to work on this proposal.

>
> >
> >>> +
> >>> +void vduse_domain_bounce(struct vduse_iova_domain *domain,
> >>> +                     unsigned long iova, unsigned long orig,
> >>> +                     size_t size, enum dma_data_direction dir)
> >>> +{
> >>> +     unsigned int offset = offset_in_page(iova);
> >>> +
> >>> +     while (size) {
> >>> +             struct page *p = vduse_domain_get_bounce_page(domain, iova);
> >>> +             size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
> >>> +             void *addr;
> >>> +
> >>> +             if (p) {
> >>> +                     addr = page_address(p) + offset;
> >>> +                     if (dir == DMA_TO_DEVICE)
> >>> +                             memcpy(addr, (void *)orig, copy_len);
> >>> +                     else if (dir == DMA_FROM_DEVICE)
> >>> +                             memcpy((void *)orig, addr, copy_len);
> >>> +             }
> >>
> >> I think I miss something, for DMA_FROM_DEVICE, if p doesn't exist how is
> >> it expected to work? Or do we need to warn here in this case?
> >>
> > Yes, I think we need a WARN_ON here.
>
>
> Ok.
>
>
> >
> >
> >>> +             size -= copy_len;
> >>> +             orig += copy_len;
> >>> +             iova += copy_len;
> >>> +             offset = 0;
> >>> +     }
> >>> +}
> >>> +
> >>> +int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
> >>> +                     struct vm_area_struct *vma, unsigned long iova)
> >>> +{
> >>> +     unsigned long uaddr = iova + vma->vm_start;
> >>> +     unsigned long start = iova & PAGE_MASK;
> >>> +     unsigned long offset = 0;
> >>> +     bool found = false;
> >>> +     struct vduse_iova_map *map;
> >>> +     struct page *page;
> >>> +
> >>> +     mutex_lock(&domain->map_lock);
> >>> +
> >>> +     page = vduse_domain_get_bounce_page(domain, iova);
> >>> +     if (page)
> >>> +             goto unlock;
> >>> +
> >>> +     page = alloc_page(GFP_KERNEL);
> >>> +     if (!page)
> >>> +             goto unlock;
> >>> +
> >>> +     while (offset < PAGE_SIZE) {
> >>> +             unsigned int src_offset = 0, dst_offset = 0;
> >>> +             void *src, *dst;
> >>> +             size_t copy_len;
> >>> +
> >>> +             map = vduse_domain_get_iova_map(domain, start + offset);
> >>> +             if (!map) {
> >>> +                     offset += IOVA_ALLOC_SIZE;
> >>> +                     continue;
> >>> +             }
> >>> +
> >>> +             found = true;
> >>> +             offset += map->size;
> >>> +             if (map->dir == DMA_FROM_DEVICE)
> >>> +                     continue;
> >>> +
> >>> +             if (start > map->iova)
> >>> +                     src_offset = start - map->iova;
> >>> +             else
> >>> +                     dst_offset = map->iova - start;
> >>> +
> >>> +             src = (void *)(map->orig + src_offset);
> >>> +             dst = page_address(page) + dst_offset;
> >>> +             copy_len = min_t(size_t, map->size - src_offset,
> >>> +                             PAGE_SIZE - dst_offset);
> >>> +             memcpy(dst, src, copy_len);
> >>> +     }
> >>> +     if (!found) {
> >>> +             put_page(page);
> >>> +             page = NULL;
> >>> +     }
> >>> +     vduse_domain_set_bounce_page(domain, iova, page);
> >>> +unlock:
> >>> +     mutex_unlock(&domain->map_lock);
> >>> +
> >>> +     return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
> >>> +}
> >>> +
> >>> +bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
> >>> +                             unsigned long iova)
> >>> +{
> >>> +     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
> >>> +     struct vduse_iova_chunk *chunk = &domain->chunks[index];
> >>> +
> >>> +     return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP;
> >>> +}
> >>> +
> >>> +unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
> >>> +                                     size_t size, enum iova_map_type type)
> >>> +{
> >>> +     struct vduse_iova_chunk *chunk;
> >>> +     unsigned long iova = 0;
> >>> +     int align = (type == TYPE_DIRECT_MAP) ? PAGE_SIZE : IOVA_ALLOC_SIZE;
> >>> +     struct genpool_data_align data = { .align = align };
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < domain->chunk_num; i++) {
> >>> +             chunk = &domain->chunks[i];
> >>> +             if (unlikely(atomic_read(&chunk->map_type) == TYPE_NONE))
> >>> +                     atomic_cmpxchg(&chunk->map_type, TYPE_NONE, type);
> >>> +
> >>> +             if (atomic_read(&chunk->map_type) != type)
> >>> +                     continue;
> >>> +
> >>> +             iova = gen_pool_alloc_algo(chunk->pool, size,
> >>> +                                     gen_pool_first_fit_align, &data);
> >>> +             if (iova)
> >>> +                     break;
> >>> +     }
> >>> +
> >>> +     return iova;
> >>
> >> I wonder why not just reuse the iova domain implements in
> >> driver/iommu/iova.c
> >>
> > The iova domain in driver/iommu/iova.c is only an iova allocator which
> > is implemented by the genpool memory allocator in our case. The other
> > part in our iova domain is chunk management and iova_map management.
> > We need different chunks to distinguish different dma mapping types:
> > consistent mapping or streaming mapping. We can only use
> > bouncing-mechanism in the streaming mapping case.
>
>
> To differ dma mappings, you can use two iova domains with different
> ranges. It looks simpler than the gen_pool. (AFAIK most IOMMU driver is
> using iova domain).
>

OK, I see. Will do it in v3.

Thanks,
Yongji

Powered by blists - more mailing lists