linux-kernel - Re: [PATCH v9 5/6] vduse: add vq group asid support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJaqyWeVCZOyBu2jPEc3iyP1uP=mzs2z7X=4JfYdBS6fLvPdxA@mail.gmail.com>
Date: Wed, 19 Nov 2025 10:26:43 +0100
From: Eugenio Perez Martin <eperezma@...hat.com>
To: Jason Wang <jasowang@...hat.com>
Cc: "Michael S . Tsirkin" <mst@...hat.com>, Laurent Vivier <lvivier@...hat.com>, virtualization@...ts.linux.dev, 
	Maxime Coquelin <mcoqueli@...hat.com>, Cindy Lu <lulu@...hat.com>, linux-kernel@...r.kernel.org, 
	Yongji Xie <xieyongji@...edance.com>, Xuan Zhuo <xuanzhuo@...ux.alibaba.com>, 
	Stefano Garzarella <sgarzare@...hat.com>
Subject: Re: [PATCH v9 5/6] vduse: add vq group asid support

On Wed, Nov 19, 2025 at 3:39 AM Jason Wang <jasowang@...hat.com> wrote:
>
> On Mon, Nov 17, 2025 at 8:16 PM Eugenio Perez Martin
> <eperezma@...hat.com> wrote:
> >
> > On Mon, Nov 17, 2025 at 5:23 AM Jason Wang <jasowang@...hat.com> wrote:
> > >
> > > On Fri, Nov 14, 2025 at 7:25 PM Eugenio Perez Martin
> > > <eperezma@...hat.com> wrote:
> > > >
> > > > On Fri, Nov 14, 2025 at 1:55 AM Jason Wang <jasowang@...hat.com> wrote:
> > > > >
> > > > > On Thu, Nov 13, 2025 at 7:56 PM Eugenio Pérez <eperezma@...hat.com> wrote:
> > > > > >
> > > > > > Add support for assigning Address Space Identifiers (ASIDs) to each VQ
> > > > > > group.  This enables mapping each group into a distinct memory space.
> > > > > >
> > > > > > Now that the driver can change ASID in the middle of operation, the
> > > > > > domain that each vq address point is also protected by domain_lock.
> > > > >
> > > > > Maybe it's better to document what is protected by RCU and how.
> > > > >
> > > >
> > > > I added the _rcu annotation but I can expand it for sure. I can also
> > > > modify the commit message.
> > > >
> > > > > More below.
> > > > >
> > > > > >
> > > > > > Acked-by: Jason Wang <jasowang@...hat.com>
> > > >
> > > > I forgot to remove this, my bad!
> > > >
> > > > > > Signed-off-by: Eugenio Pérez <eperezma@...hat.com>
> > > > > > ---
> > > > > > v9:
> > > > > > * Replace mutex with RCU, as the vdpa map_ops can run from atomic
> > > > > >   context.
> > > > > >
> > > > > > v8:
> > > > > > * Revert the mutex to rwlock change, it needs proper profiling to
> > > > > >   justify it.
> > > > > >
> > > > > > v7:
> > > > > > * Take write lock in the error path (Jason).
> > > > > >
> > > > > > v6:
> > > > > > * Make vdpa_dev_add use gotos for error handling (MST).
> > > > > > * s/(dev->api_version < 1) ?/(dev->api_version < VDUSE_API_VERSION_1) ?/
> > > > > >   (MST).
> > > > > > * Fix struct name not matching in the doc.
> > > > > >
> > > > > > v5:
> > > > > > * Properly return errno if copy_to_user returns >0 in VDUSE_IOTLB_GET_FD
> > > > > >   ioctl (Jason).
> > > > > > * Properly set domain bounce size to divide equally between nas (Jason).
> > > > > > * Exclude "padding" member from the only >V1 members in
> > > > > >   vduse_dev_request.
> > > > > >
> > > > > > v4:
> > > > > > * Divide each domain bounce size between the device bounce size (Jason).
> > > > > > * revert unneeded addr = NULL assignment (Jason)
> > > > > > * Change if (x && (y || z)) return to if (x) { if (y) return; if (z)
> > > > > >   return; } (Jason)
> > > > > > * Change a bad multiline comment, using @ caracter instead of * (Jason).
> > > > > > * Consider config->nas == 0 as a fail (Jason).
> > > > > >
> > > > > > v3:
> > > > > > * Get the vduse domain through the vduse_as in the map functions
> > > > > >   (Jason).
> > > > > > * Squash with the patch creating the vduse_as struct (Jason).
> > > > > > * Create VDUSE_DEV_MAX_AS instead of comparing agains a magic number
> > > > > >   (Jason)
> > > > > >
> > > > > > v2:
> > > > > > * Convert the use of mutex to rwlock.
> > > > > >
> > > > > > RFC v3:
> > > > > > * Increase VDUSE_MAX_VQ_GROUPS to 0xffff (Jason). It was set to a lower
> > > > > >   value to reduce memory consumption, but vqs are already limited to
> > > > > >   that value and userspace VDUSE is able to allocate that many vqs.
> > > > > > * Remove TODO about merging VDUSE_IOTLB_GET_FD ioctl with
> > > > > >   VDUSE_IOTLB_GET_INFO.
> > > > > > * Use of array_index_nospec in VDUSE device ioctls.
> > > > > > * Embed vduse_iotlb_entry into vduse_iotlb_entry_v2.
> > > > > > * Move the umem mutex to asid struct so there is no contention between
> > > > > >   ASIDs.
> > > > > >
> > > > > > RFC v2:
> > > > > > * Make iotlb entry the last one of vduse_iotlb_entry_v2 so the first
> > > > > >   part of the struct is the same.
> > > > > > ---
> > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 370 ++++++++++++++++++++---------
> > > > > >  include/uapi/linux/vduse.h         |  53 ++++-
> > > > > >  2 files changed, 314 insertions(+), 109 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > index 97be04f73fbf..ff95ed56f22d 100644
> > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > @@ -11,6 +11,7 @@
> > > > > >  #include "linux/virtio_net.h"
> > > > > >  #include <linux/init.h>
> > > > > >  #include <linux/module.h>
> > > > > > +#include <linux/rcupdate.h>
> > > > > >  #include <linux/cdev.h>
> > > > > >  #include <linux/device.h>
> > > > > >  #include <linux/eventfd.h>
> > > > > > @@ -41,6 +42,7 @@
> > > > > >
> > > > > >  #define VDUSE_DEV_MAX (1U << MINORBITS)
> > > > > >  #define VDUSE_DEV_MAX_GROUPS 0xffff
> > > > > > +#define VDUSE_DEV_MAX_AS 0xffff
> > > > > >  #define VDUSE_MAX_BOUNCE_SIZE (1024 * 1024 * 1024)
> > > > > >  #define VDUSE_MIN_BOUNCE_SIZE (1024 * 1024)
> > > > > >  #define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
> > > > > > @@ -86,7 +88,14 @@ struct vduse_umem {
> > > > > >         struct mm_struct *mm;
> > > > > >  };
> > > > > >
> > > > > > +struct vduse_as {
> > > > > > +       struct vduse_iova_domain *domain;
> > > > > > +       struct vduse_umem *umem;
> > > > > > +       struct mutex mem_lock;
> > > > > > +};
> > > > > > +
> > > > > >  struct vduse_vq_group {
> > > > > > +       struct vduse_as *as __rcu;
> > > > > >         struct vduse_dev *dev;
> > > > > >  };
> > > > > >
> > > > > > @@ -94,7 +103,7 @@ struct vduse_dev {
> > > > > >         struct vduse_vdpa *vdev;
> > > > > >         struct device *dev;
> > > > > >         struct vduse_virtqueue **vqs;
> > > > > > -       struct vduse_iova_domain *domain;
> > > > > > +       struct vduse_as *as;
> > > > > >         char *name;
> > > > > >         struct mutex lock;
> > > > > >         spinlock_t msg_lock;
> > > > > > @@ -122,9 +131,8 @@ struct vduse_dev {
> > > > > >         u32 vq_num;
> > > > > >         u32 vq_align;
> > > > > >         u32 ngroups;
> > > > > > -       struct vduse_umem *umem;
> > > > > > +       u32 nas;
> > > > > >         struct vduse_vq_group *groups;
> > > > > > -       struct mutex mem_lock;
> > > > > >         unsigned int bounce_size;
> > > > > >         struct mutex domain_lock;
> > > > > >  };
> > > > > > @@ -314,7 +322,7 @@ static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> > > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > > >  }
> > > > > >
> > > > > > -static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > > > +static int vduse_dev_update_iotlb(struct vduse_dev *dev, u32 asid,
> > > > > >                                   u64 start, u64 last)
> > > > > >  {
> > > > > >         struct vduse_dev_msg msg = { 0 };
> > > > > > @@ -323,8 +331,14 @@ static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> > > > > >                 return -EINVAL;
> > > > > >
> > > > > >         msg.req.type = VDUSE_UPDATE_IOTLB;
> > > > > > -       msg.req.iova.start = start;
> > > > > > -       msg.req.iova.last = last;
> > > > > > +       if (dev->api_version < VDUSE_API_VERSION_1) {
> > > > > > +               msg.req.iova.start = start;
> > > > > > +               msg.req.iova.last = last;
> > > > > > +       } else {
> > > > > > +               msg.req.iova_v2.start = start;
> > > > > > +               msg.req.iova_v2.last = last;
> > > > > > +               msg.req.iova_v2.asid = asid;
> > > > > > +       }
> > > > > >
> > > > > >         return vduse_dev_msg_sync(dev, &msg);
> > > > > >  }
> > > > > > @@ -436,14 +450,32 @@ static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> > > > > >         return mask;
> > > > > >  }
> > > > > >
> > > > > > +/* Force set the asid to a vq group without a message to the VDUSE device */
> > > > > > +static void vduse_set_group_asid_nomsg(struct vduse_dev *dev,
> > > > > > +                                      unsigned int group, unsigned int asid)
> > > > > > +{
> > > > > > +       /*
> > > > > > +        * Two concurrent updates to this pointer are valid as they cannot
> > > > > > +        * point to an invalid region.  It is ok for them to race as long as
> > > > > > +        * the readers see a consistent state through RCU.
> > > > > > +        */
> > > > > > +       rcu_assign_pointer(dev->groups[group].as, &dev->as[asid]);
> > > > >
> > > > > I'd expect at least a synchronize_rcu() here to wait for the read is done?
> > > > >
> > > >
> > > > What's the use? The only thing left here is to return from
> > > > vduse_set_group_asid_nomsg, and we don't need to wait for readers
> > > > here, do we?
> > >
> > > See below.
> > >
> > > >
> > > > > > +}
> > > > > > +
> > > > > >  static void vduse_dev_reset(struct vduse_dev *dev)
> > > > > >  {
> > > > > >         int i;
> > > > > > -       struct vduse_iova_domain *domain = dev->domain;
> > > > > >
> > > > > >         /* The coherent mappings are handled in vduse_dev_free_coherent() */
> > > > > > -       if (domain && domain->bounce_map)
> > > > > > -               vduse_domain_reset_bounce_map(domain);
> > > > > > +       for (i = 0; i < dev->nas; i++) {
> > > > > > +               struct vduse_iova_domain *domain = dev->as[i].domain;
> > > > > > +
> > > > > > +               if (domain && domain->bounce_map)
> > > > > > +                       vduse_domain_reset_bounce_map(domain);
> > > > > > +       }
> > > > > > +
> > > > > > +       for (i = 0; i < dev->ngroups; i++)
> > > > > > +               vduse_set_group_asid_nomsg(dev, i, 0);
> > > > > >
> > > > > >         down_write(&dev->rwsem);
> > > > > >
> > > > > > @@ -623,6 +655,29 @@ static union virtio_map vduse_get_vq_map(struct vdpa_device *vdpa, u16 idx)
> > > > > >         return ret;
> > > > > >  }
> > > > > >
> > > > > > +static int vduse_set_group_asid(struct vdpa_device *vdpa, unsigned int group,
> > > > > > +                               unsigned int asid)
> > > > > > +{
> > > > > > +       struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > > > +       struct vduse_dev_msg msg = { 0 };
> > > > > > +       int r;
> > > > > > +
> > > > > > +       if (dev->api_version < VDUSE_API_VERSION_1 ||
> > > > > > +           group >= dev->ngroups || asid >= dev->nas)
> > > > > > +               return -EINVAL;
> > > > > > +
> > > > > > +       msg.req.type = VDUSE_SET_VQ_GROUP_ASID;
> > > > > > +       msg.req.vq_group_asid.group = group;
> > > > > > +       msg.req.vq_group_asid.asid = asid;
> > > > > > +
> > > > > > +       r = vduse_dev_msg_sync(dev, &msg);
> > > > > > +       if (r < 0)
> > > > > > +               return r;
> > > > > > +
> > > > > > +       vduse_set_group_asid_nomsg(dev, group, asid);
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > >  static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> > > > > >                                 struct vdpa_vq_state *state)
> > > > > >  {
> > > > > > @@ -794,13 +849,13 @@ static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> > > > > >         struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> > > > > >         int ret;
> > > > > >
> > > > > > -       ret = vduse_domain_set_map(dev->domain, iotlb);
> > > > > > +       ret = vduse_domain_set_map(dev->as[asid].domain, iotlb);
> > > > > >         if (ret)
> > > > > >                 return ret;
> > > > > >
> > > > > > -       ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> > > > > > +       ret = vduse_dev_update_iotlb(dev, asid, 0ULL, ULLONG_MAX);
> > > > > >         if (ret) {
> > > > > > -               vduse_domain_clear_map(dev->domain, iotlb);
> > > > > > +               vduse_domain_clear_map(dev->as[asid].domain, iotlb);
> > > > > >                 return ret;
> > > > > >         }
> > > > > >
> > > > > > @@ -843,6 +898,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> > > > > >         .get_vq_affinity        = vduse_vdpa_get_vq_affinity,
> > > > > >         .reset                  = vduse_vdpa_reset,
> > > > > >         .set_map                = vduse_vdpa_set_map,
> > > > > > +       .set_group_asid         = vduse_set_group_asid,
> > > > > >         .get_vq_map             = vduse_get_vq_map,
> > > > > >         .free                   = vduse_vdpa_free,
> > > > > >  };
> > > > > > @@ -852,14 +908,17 @@ static void vduse_dev_sync_single_for_device(union virtio_map token,
> > > > > >                                              enum dma_data_direction dir)
> > > > > >  {
> > > > > >         struct vduse_dev *vdev;
> > > > > > +       struct vduse_as *as;
> > > > > >         struct vduse_iova_domain *domain;
> > > > > >
> > > > > >         if (!token.group)
> > > > > >                 return;
> > > > > >
> > > > > >         vdev = token.group->dev;
> > > > > > -       domain = vdev->domain;
> > > > > > -
> > > > > > +       rcu_read_lock();
> > > > > > +       as = rcu_dereference(token.group->as);
> > > > > > +       domain = as->domain;
> > > > > > +       rcu_read_unlock();
> > > > > >         vduse_domain_sync_single_for_device(domain, dma_addr, size, dir);
> > > > >
> > > > > This is suspicious, at least we should do rcu_read_unlock() after
> > > > > vduse_domain_sync_single_for_device(), otherwise I don't see how RCU
> > > > > works.
> > > > >
> > > >
> > > > RCU is protecting that the address space pointer of the vq group is
> > > > not modified concurrently with the access. Ideally, this should be a
> > > > full lock, but just making sure that all accesses from the reader are
> > > > coherent is enough. Userspace should expect nothing if it uses the map
> > > > and modifies the vq group ASID at the same time anyway, but the kernel
> > > > needs to be sure that it does not see intermediate states. TBH, we
> > > > could move to a READ_ONCE / WRITE_ONCE, would that be more clear?
> > >
> > > Using READ_ONCE/WRITE_ONCE() needs to make sure the ordering is
> > > handled correctly.
> > >
> > > But I meant what happens if
> > >
> > > [cpu0]rcu_read_lock()
> > > [cpu0]as = rcu_dereference(token.group->as)
> > > [cpu0]...
> > > [cpu0]rcu_read_unlock()
> > > [cpu1]rcu_assign_pointer(token.group->as)
> > > [cpu0]vduse_domain_sync_single_for_device()
> > >
> >
> > That should go ok. What I'm trying to protect here is the iterations
> > in vduse_domain_sync_single_for_device -> vduse_domain_bounce.
> >
> > I'm going to embed that function here in
> > vduse_dev_sync_single_for_device and omit RCU and some details to make
> > the point easier:
> >
> > vduse_dev_sync_single_for_device(union virtio_map token, dma_addr_t
> > iova, size_t size, ...) {
> >   read_lock(&token.group->as->domain);
> >   while (size)
> >     map = token.group->as->domain->bounce_maps[iova];
> >     sz = min_t(size_t, BOUNCE_MAP_SIZE, size);
> >
> >     ...
> >     page = token_group->as->domain->bounce_maps
> >     addr = kmap_local_page(page);
> >     do_bounce(map->orig_phys, addr, sz, dir);
> >     kunmap_local(addr);
> >     size -= sz;
> >     iova += sz;
> >   }
> >   read_unlock(&token.group->as->domain);
> > }
>
> Right, so I meant for rwlock like semantic (let's forget the sleeping here).
>
> vduse_set_group_asid_nomsg() should use "write lock" so it must wait
> for the "read lock" to be done.

No, it doesn't need to wait as long as the reader part uses its own copy.

> But this is not the logic that is
> implemented in this patch as there's no synchronize_rcu() in the
> vduse_set_group_asid_nomsg().

We only set the pointer on the writer's side, we do nothing like
freeing resources. Should we set the pointer before or after
syncrhonize_rcu()? What do we need to do on the other side of
syncrhonize_rcu()?

> We need to explain why set_group_asid()
> doesn't need to wait and if this is true, we probably don't need RCU
> but to make sure the load/store is atomic.
>

What about:

* It does not matter if other thread modify group->as as long as the
reader uses the same as for all its operation. It performs a local
copy for that reason.
* It does not matter if multiple threads modify group->as as long as
the update is atomic.

?

> >
> > Now, depending on the point where another execution thread changes
> > token_group->as and how the compiler has chosen to generate the
> > machine code, the outcome could be:
> > 1) The domain read lock of one ASID is taken but the domain lock of
> > another as is unlocked.
> > 2) We iterate until iova is ok for the ASID we're handling, but not
> > for the other one. So we access an invalid offset in
> > bounce_maps[iova].
> >
> > And I guess there are other possible outcomes too.
> >
> > So I need to make sure that the pointer accesses in all
> > vduse_domain_bounce is coherent.
>
> I'm not sure I got here, but it looks like it accepts a domain
> parameter and is protected by the bounce lock so we are probably fine
> here?
>

The bounce lock only protects the iotlb tree, not the pointer to that
iotlb tree.

> > I'm ok if it takes the one before the
> > concurrent call to vduse_set_group_asid_nomsg or the one after that,
> > as the lifetime of all domains are bound to the device. But it cannot
> > change in the middle of the operation:
> >
> > vduse_dev_sync_single_for_device(union virtio_map token, dma_addr_t
> > iova, size_t size, ...) {
> >   as = token.group->as;
> >   // Tell the compiler to never replace "as" by "token.group->as" after this.
> >   read_lock(&as->domain);
> >   while (size)
> >     map = as->domain->bounce_maps[iova];
> >     sz = min_t(size_t, BOUNCE_MAP_SIZE, size);
> >
> >     ...
> >     page = as->domain->bounce_maps
> >     addr = kmap_local_page(page);
> >     do_bounce(map->orig_phys, addr, sz, dir);
> >     kunmap_local(addr);
> >     size -= sz;
> >     iova += sz;
> >   }
> >   read_unlock(&as->domain);
> > }
> >
> > That can be done in many ways. Probably the read_lock is already
> > enough but it is not explicit that it is protecting token.group->as,
> > and future changes could remove it. To me, RCU is the most clear way
> > to do it, but even a volatile read (READ_ONCE?) would do.
>
> I wonder if another group rwlock is sufficient here:
>
> for set_group_as_id()
>
> write_lock(&dev->groups[group].lock);
> dev->groups[group].as = &dev->as[asid];
> write_unlock(&dev->groups[group].lock);
>
> for the case where we need defer as
>
> read_lock(&dev->groups[group].lock);
> as = dev->groups[group].as;
> //using as
> read_unlock(&dev->groups[group].lock);
>
> If this works, we don't need to bother with thinking if the
> wait/synchronizre_rcu() is really needed or not?
>

A rwlock is sufficient but we need to modify the allocation code
somehow. Also, I thought we wanted to avoid the overhead of taking the
read lock in the DMA ops too.

Another disadvantage of the lock vs RCU or READ_ONCE is that the vq
group ASID change needs to wait for the DMA operation to finish
instead of just applying for the next DMA ops. Not like vq group ASID
change would be in the hot path anyway, just pointing it out.

> >
> > > If this is not an issue, RCU is not a must, but please explain why.
> > > If this is an issue, we need to fix it.
> > >
> > > It's basically a question that
> > >
> > > 1) should we need to wait for the DMA to be completed before assigning
> > > to the new as
> >
> > I don't think so, it is valid to assign a new as and let the ongoing
> > operation to continue. It is racy and the operation could fail, but
> > the kernel just returns an error and doesn't access invalid memory or
> > similar.
>
> See below.
>
> >
> > > 2) should we track the set_group_asid() for the group that has pending
> > > DMA to avoid potential issue
> > >
> >
> > No, the group will outlive the operation as it is bound to the device.
>
> I meant e.g the DMA could be triggered by the device. For example, the
> device may try to trigger an interrupt when the kernel is trying to
> assign a new asid. So I wonder if guest can use this to poke Qemu's
> memory etc.

I'm not sure I get this point. If QEMU changes the ASID of the vq
group sent to the guest the race does not matter anymore: it is
explicitly opening the possibility from the guest to poke QEMU's
memory unless the guest is totally paused.

> But if you mean we depend on the IOTLB to guard against
> this, I'm fine, but let's document why we don't need it and how the
> IOTLB layer can help to eliminate such risk.
>

No, I'm not happy about letting iotlb lock to protect this too as
they're at different levels actually: One is protecting iotlb trees
modifications while they're being read and the other is protecting the
ASID assignment to different vq groups. To reuse them means a
modification in any tree blocks the change of vq group ASID, for
example.

> Anyhow, tracking and failing seems to be more robust.

I'm not sure I get this. If a DMA read starts in one ASID and then
QEMU changes the ASID of the vq group, do you prefer it to fail rather
than continue reading from the original ASID? It seems hard to
communicate that the ASID has changed to the DMA operation callback.

> For example,
> kernel swiotlb has a layer to track whether the address that is being
> unmapped is mapped before. In our case we should make sure when race
> happens, isolation via ASID won't be broken.
>

Right but it cannot protect if the iotlb is valid in the second ASID
but just points to a different physical / virtual page.