[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJaqyWdgGmnaQ3=j-6S9E4mfPSHc872FQZfVna_eX_UnipX2UA@mail.gmail.com>
Date: Mon, 17 Nov 2025 13:15:21 +0100
From: Eugenio Perez Martin <eperezma@...hat.com>
To: Jason Wang <jasowang@...hat.com>
Cc: "Michael S . Tsirkin" <mst@...hat.com>, Laurent Vivier <lvivier@...hat.com>, virtualization@...ts.linux.dev,
Maxime Coquelin <mcoqueli@...hat.com>, Cindy Lu <lulu@...hat.com>, linux-kernel@...r.kernel.org,
Yongji Xie <xieyongji@...edance.com>, Xuan Zhuo <xuanzhuo@...ux.alibaba.com>,
Stefano Garzarella <sgarzare@...hat.com>
Subject: Re: [PATCH v9 5/6] vduse: add vq group asid support
On Mon, Nov 17, 2025 at 5:23 AM Jason Wang <jasowang@...hat.com> wrote:
>
> On Fri, Nov 14, 2025 at 7:25 PM Eugenio Perez Martin
> <eperezma@...hat.com> wrote:
> >
> > On Fri, Nov 14, 2025 at 1:55 AM Jason Wang <jasowang@...hat.com> wrote:
> > >
> > > On Thu, Nov 13, 2025 at 7:56 PM Eugenio Pérez <eperezma@...hat.com> wrote:
> > > >
> > > > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > > > group. This enables mapping each group into a distinct memory space.
> > > >
> > > > Now that the driver can change ASID in the middle of operation, the
> > > > domain that each vq address point is also protected by domain_lock.
> > >
> > > Maybe it's better to document what is protected by RCU and how.
> > >
> >
> > I added the _rcu annotation but I can expand it for sure. I can also
> > modify the commit message.
> >
> > > More below.
> > >
> > > >
> > > > Acked-by: Jason Wang <jasowang@...hat.com>
> >
> > I forgot to remove this, my bad!
> >
> > > > Signed-off-by: Eugenio Pérez <eperezma@...hat.com>
> > > > ---
> > > > v9:
> > > > * Replace mutex with RCU, as the vdpa map_ops can run from atomic
> > > > context.
> > > >
> > > > v8:
> > > > * Revert the mutex to rwlock change, it needs proper profiling to
> > > > justify it.
> > > >
> > > > v7:
> > > > * Take write lock in the error path (Jason).
> > > >
> > > > v6:
> > > > * Make vdpa_dev_add use gotos for error handling (MST).
> > > > * s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
> > > > (MST).
> > > > * Fix struct name not matching in the doc.
> > > >
> > > > v5:
> > > > * Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
> > > > ioctl (Jason).
> > > > * Properly set domain bounce size to divide equally between nas (Jason).
> > > > * Exclude "padding" member from the only >V1 members in
> > > > vduse_dev_request.
> > > >
> > > > v4:
> > > > * Divide each domain bounce size between the device bounce size (Jason).
> > > > * revert unneeded addr = NULL assignment (Jason)
> > > > * Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
> > > > return; } (Jason)
> > > > * Change a bad multiline comment, using @ caracter instead of * (Jason).
> > > > * Consider config->nas == 0 as a fail (Jason).
> > > >
> > > > v3:
> > > > * Get the vduse domain through the vduse_as in the map functions
> > > > (Jason).
> > > > * Squash with the patch creating the vduse_as struct (Jason).
> > > > * Create VDUSE_DEV_MAX_AS instead of comparing agains a magic number
> > > > (Jason)
> > > >
> > > > v2:
> > > > * Convert the use of mutex to rwlock.
> > > >
> > > > RFC v3:
> > > > * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
> > > > value to reduce memory consumption, but vqs are already limited to
> > > > that value and userspace VDUSE is able to allocate that many vqs.
> > > > * Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
> > > > VDUSE_IOTLB_GET_INFO.
> > > > * Use of array_index_nospec in VDUSE device ioctls.
> > > > * Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
> > > > * Move the umem mutex to asid struct so there is no contention between
> > > > ASIDs.
> > > >
> > > > RFC v2:
> > > > * Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> > > > part of the struct is the same.
> > > > ---
> > > > drivers/vdpa/vdpa_user/vduse_dev.c | 370 ++++++++++++++++++++---------
> > > > include/uapi/linux/vduse.h | 53 ++++-
> > > > 2 files changed, 314 insertions(+), 109 deletions(-)
> > > >
> > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > index 97be04f73fbf..ff95ed56f22d 100644
> > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > @@ -11,6 +11,7 @@
> > > > #include "linux/virtio_net.h"
> > > > #include <linux/init.h>
> > > > #include <linux/module.h>
> > > > +#include <linux/rcupdate.h>
> > > > #include <linux/cdev.h>
> > > > #include <linux/device.h>
> > > > #include <linux/eventfd.h>
> > > > @@ -41,6 +42,7 @@
> > > >
> > > > #define VDUSE_DEV_MAX (1U << MINORBITS)
> > > > #define VDUSE_DEV_MAX_GROUPS 0xffff
> > > > +#define VDUSE_DEV_MAX_AS 0xffff
> > > > #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> > > > #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
> > > > #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> > > > @@ -86,7 +88,14 @@ struct vduse_umem {
> > > > struct mm_struct *mm;
> > > > };
> > > >
> > > > +struct vduse_as {
> > > > + struct vduse_iova_domain *domain;
> > > > + struct vduse_umem *umem;
> > > > + struct mutex mem_lock;
> > > > +};
> > > > +
> > > > struct vduse_vq_group {
> > > > + struct vduse_as *as __rcu;
> > > > struct vduse_dev *dev;
> > > > };
> > > >
> > > > @@ -94,7 +103,7 @@ struct vduse_dev {
> > > > struct vduse_vdpa *vdev;
> > > > struct device *dev;
> > > > struct vduse_virtqueue **vqs;
> > > > - struct vduse_iova_domain *domain;
> > > > + struct vduse_as *as;
> > > > char *name;
> > > > struct mutex lock;
> > > > spinlock_t msg_lock;
> > > > @@ -122,9 +131,8 @@ struct vduse_dev {
> > > > u32 vq_num;
> > > > u32 vq_align;
> > > > u32 ngroups;
> > > > - struct vduse_umem *umem;
> > > > + u32 nas;
> > > > struct vduse_vq_group *groups;
> > > > - struct mutex mem_lock;
> > > > unsigned int bounce_size;
> > > > struct mutex domain_lock;
> > > > };
> > > > @@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > > > return vduse_dev_msg_sync(dev, &msg);
> > > > }
> > > >
> > > > -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
> > > > u64 start, u64 last)
> > > > {
> > > > struct vduse_dev_msg msg = { 0 };
> > > > @@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > return -EINVAL;
> > > >
> > > > msg.req.type = VDUSE_UPDATE_IOTLB;
> > > > - msg.req.iova.start = start;
> > > > - msg.req.iova.last = last;
> > > > + if (dev->api_version < VDUSE_API_VERSION_1) {
> > > > + msg.req.iova.start = start;
> > > > + msg.req.iova.last = last;
> > > > + } else {
> > > > + msg.req.iova_v2.start = start;
> > > > + msg.req.iova_v2.last = last;
> > > > + msg.req.iova_v2.asid = asid;
> > > > + }
> > > >
> > > > return vduse_dev_msg_sync(dev, &msg);
> > > > }
> > > > @@ -436,14 +450,32 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > > > return mask;
> > > > }
> > > >
> > > > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > > > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > > > + unsigned int group, unsigned int asid)
> > > > +{
> > > > + /*
> > > > + * Two concurrent updates to this pointer are valid as they cannot
> > > > + * point to an invalid region. It is ok for them to race as long as
> > > > + * the readers see a consistent state through RCU.
> > > > + */
> > > > + rcu_assign_pointer(dev->groups[group].as, &dev->as[asid]);
> > >
> > > I'd expect at least a synchronize_rcu() here to wait for the read is done?
> > >
> >
> > What's the use? The only thing left here is to return from
> > vduse_set_group_asid_nomsg, and we don't need to wait for readers
> > here, do we?
>
> See below.
>
> >
> > > > +}
> > > > +
> > > > static void vduse_dev_reset(struct vduse_dev *dev)
> > > > {
> > > > int i;
> > > > - struct vduse_iova_domain *domain = dev->domain;
> > > >
> > > > /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > > - if (domain && domain->bounce_map)
> > > > - vduse_domain_reset_bounce_map(domain);
> > > > + for (i = 0; i < dev->nas; i++) {
> > > > + struct vduse_iova_domain *domain = dev->as[i].domain;
> > > > +
> > > > + if (domain && domain->bounce_map)
> > > > + vduse_domain_reset_bounce_map(domain);
> > > > + }
> > > > +
> > > > + for (i = 0; i < dev->ngroups; i++)
> > > > + vduse_set_group_asid_nomsg(dev, i, 0);
> > > >
> > > > down_write(&dev->rwsem);
> > > >
> > > > @@ -623,6 +655,29 @@ static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
> > > > return ret;
> > > > }
> > > >
> > > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > > + unsigned int asid)
> > > > +{
> > > > + struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > + struct vduse_dev_msg msg = { 0 };
> > > > + int r;
> > > > +
> > > > + if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > > + group >= dev->ngroups || asid >= dev->nas)
> > > > + return -EINVAL;
> > > > +
> > > > + msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > > + msg.req.vq_group_asid.group = group;
> > > > + msg.req.vq_group_asid.asid = asid;
> > > > +
> > > > + r = vduse_dev_msg_sync(dev, &msg);
> > > > + if (r < 0)
> > > > + return r;
> > > > +
> > > > + vduse_set_group_asid_nomsg(dev, group, asid);
> > > > + return 0;
> > > > +}
> > > > +
> > > > static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > > > struct vdpa_vq_state *state)
> > > > {
> > > > @@ -794,13 +849,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > > > struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > int ret;
> > > >
> > > > - ret = vduse_domain_set_map(dev->domain, iotlb);
> > > > + ret = vduse_domain_set_map(dev->as[asid].domain, iotlb);
> > > > if (ret)
> > > > return ret;
> > > >
> > > > - ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > > > + ret = vduse_dev_update_iotlb(dev, asid, 0ULL, ULLONG_MAX);
> > > > if (ret) {
> > > > - vduse_domain_clear_map(dev->domain, iotlb);
> > > > + vduse_domain_clear_map(dev->as[asid].domain, iotlb);
> > > > return ret;
> > > > }
> > > >
> > > > @@ -843,6 +898,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > > > .get_vq_affinity = vduse_vdpa_get_vq_affinity,
> > > > .reset = vduse_vdpa_reset,
> > > > .set_map = vduse_vdpa_set_map,
> > > > + .set_group_asid = vduse_set_group_asid,
> > > > .get_vq_map = vduse_get_vq_map,
> > > > .free = vduse_vdpa_free,
> > > > };
> > > > @@ -852,14 +908,17 @@ static void vduse_dev_sync_single_for_device(union virtio_map token,
> > > > enum dma_data_direction dir)
> > > > {
> > > > struct vduse_dev *vdev;
> > > > + struct vduse_as *as;
> > > > struct vduse_iova_domain *domain;
> > > >
> > > > if (!token.group)
> > > > return;
> > > >
> > > > vdev = token.group->dev;
> > > > - domain = vdev->domain;
> > > > -
> > > > + rcu_read_lock();
> > > > + as = rcu_dereference(token.group->as);
> > > > + domain = as->domain;
> > > > + rcu_read_unlock();
> > > > vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > >
> > > This is suspicious, at least we should do rcu_read_unlock() after
> > > vduse_domain_sync_single_for_device(), otherwise I don't see how RCU
> > > works.
> > >
> >
> > RCU is protecting that the address space pointer of the vq group is
> > not modified concurrently with the access. Ideally, this should be a
> > full lock, but just making sure that all accesses from the reader are
> > coherent is enough. Userspace should expect nothing if it uses the map
> > and modifies the vq group ASID at the same time anyway, but the kernel
> > needs to be sure that it does not see intermediate states. TBH, we
> > could move to a READ_ONCE / WRITE_ONCE, would that be more clear?
>
> Using READ_ONCE/WRITE_ONCE() needs to make sure the ordering is
> handled correctly.
>
> But I meant what happens if
>
> [cpu0]rcu_read_lock()
> [cpu0]as = rcu_dereference(token.group->as)
> [cpu0]...
> [cpu0]rcu_read_unlock()
> [cpu1]rcu_assign_pointer(token.group->as)
> [cpu0]vduse_domain_sync_single_for_device()
>
That should go ok. What I'm trying to protect here is the iterations
in vduse_domain_sync_single_for_device -> vduse_domain_bounce.
I'm going to embed that function here in
vduse_dev_sync_single_for_device and omit RCU and some details to make
the point easier:
vduse_dev_sync_single_for_device(union virtio_map token, dma_addr_t
iova, size_t size, ...) {
read_lock(&token.group->as->domain);
while (size)
map = token.group->as->domain->bounce_maps[iova];
sz = min_t(size_t, BOUNCE_MAP_SIZE, size);
...
page = token_group->as->domain->bounce_maps
addr = kmap_local_page(page);
do_bounce(map->orig_phys, addr, sz, dir);
kunmap_local(addr);
size -= sz;
iova += sz;
}
read_unlock(&token.group->as->domain);
}
Now, depending on the point where another execution thread changes
token_group->as and how the compiler has chosen to generate the
machine code, the outcome could be:
1) The domain read lock of one ASID is taken but the domain lock of
another as is unlocked.
2) We iterate until iova is ok for the ASID we're handling, but not
for the other one. So we access an invalid offset in
bounce_maps[iova].
And I guess there are other possible outcomes too.
So I need to make sure that the pointer accesses in all
vduse_domain_bounce is coherent. I'm ok if it takes the one before the
concurrent call to vduse_set_group_asid_nomsg or the one after that,
as the lifetime of all domains are bound to the device. But it cannot
change in the middle of the operation:
vduse_dev_sync_single_for_device(union virtio_map token, dma_addr_t
iova, size_t size, ...) {
as = token.group->as;
// Tell the compiler to never replace "as" by "token.group->as" after this.
read_lock(&as->domain);
while (size)
map = as->domain->bounce_maps[iova];
sz = min_t(size_t, BOUNCE_MAP_SIZE, size);
...
page = as->domain->bounce_maps
addr = kmap_local_page(page);
do_bounce(map->orig_phys, addr, sz, dir);
kunmap_local(addr);
size -= sz;
iova += sz;
}
read_unlock(&as->domain);
}
That can be done in many ways. Probably the read_lock is already
enough but it is not explicit that it is protecting token.group->as,
and future changes could remove it. To me, RCU is the most clear way
to do it, but even a volatile read (READ_ONCE?) would do.
> If this is not an issue, RCU is not a must, but please explain why.
> If this is an issue, we need to fix it.
>
> It's basically a question that
>
> 1) should we need to wait for the DMA to be completed before assigning
> to the new as
I don't think so, it is valid to assign a new as and let the ongoing
operation to continue. It is racy and the operation could fail, but
the kernel just returns an error and doesn't access invalid memory or
similar.
> 2) should we track the set_group_asid() for the group that has pending
> DMA to avoid potential issue
>
No, the group will outlive the operation as it is bound to the device.
> >
> > The function vduse_domain uses its own lock to protect concurrent
> > access to the maps of the ASID itself, as they were needed before
> > implementing ASID already.
>
> Thanks
>
> >
> > > > }
> > > >
> > > > @@ -868,14 +927,17 @@ static void vduse_dev_sync_single_for_cpu(union virtio_map token,
> > > > enum dma_data_direction dir)
> > > > {
> > > > struct vduse_dev *vdev;
> > > > + struct vduse_as *as;
> > > > struct vduse_iova_domain *domain;
> > > >
> > > > if (!token.group)
> > > > return;
> > > >
> > > > vdev = token.group->dev;
> > > > - domain = vdev->domain;
> > > > -
> > > > + rcu_read_lock();
> > > > + as = rcu_dereference(token.group->as);
> > > > + domain = as->domain;
> > > > + rcu_read_unlock();
> > > > vduse_domain_sync_single_for_cpu(domain, dma_addr, size, dir);
> > > > }
> > > >
> > > > @@ -885,15 +947,21 @@ static dma_addr_t vduse_dev_map_page(union virtio_map token, struct page *page,
> > > > unsigned long attrs)
> > > > {
> > > > struct vduse_dev *vdev;
> > > > + struct vduse_as *as;
> > > > struct vduse_iova_domain *domain;
> > > > + dma_addr_t r;
> > > >
> > > > if (!token.group)
> > > > return DMA_MAPPING_ERROR;
> > > >
> > > > vdev = token.group->dev;
> > > > - domain = vdev->domain;
> > > > + rcu_read_lock();
> > > > + as = rcu_dereference(token.group->as);
> > > > + domain = as->domain;
> > > > + rcu_read_unlock();
> > > > + r = vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> > >
> > > Same here.
> > >
> > > Thanks
> > >
> >
>
Powered by blists - more mailing lists