linux-kernel - Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20210204065329.GA82484@mtl-vdi-166.wap.labs.mlnx>
Date:   Thu, 4 Feb 2021 08:53:29 +0200
From:   Eli Cohen <elic@...dia.com>
To:     Si-Wei Liu <siwliu.kernel@...il.com>
CC:     Jason Wang <jasowang@...hat.com>, <mst@...hat.com>,
        <virtualization@...ts.linux-foundation.org>,
        <netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <lulu@...hat.com>, Si-Wei Liu <si-wei.liu@...cle.com>
Subject: Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue

On Wed, Feb 03, 2021 at 03:19:40PM -0800, Si-Wei Liu wrote:
> On Tue, Feb 2, 2021 at 9:16 PM Jason Wang <jasowang@...hat.com> wrote:
> >
> >
> > On 2021/2/3 上午1:54, Si-Wei Liu wrote:
> > > On Tue, Feb 2, 2021 at 1:23 AM Eli Cohen <elic@...dia.com> wrote:
> > >> On Tue, Feb 02, 2021 at 12:38:51AM -0800, Si-Wei Liu wrote:
> > >>> Thanks Eli and Jason for clarifications. See inline.
> > >>>
> > >>> On Mon, Feb 1, 2021 at 11:06 PM Eli Cohen <elic@...dia.com> wrote:
> > >>>> On Tue, Feb 02, 2021 at 02:02:25PM +0800, Jason Wang wrote:
> > >>>>> On 2021/2/2 下午12:15, Si-Wei Liu wrote:
> > >>>>>> On Mon, Feb 1, 2021 at 7:13 PM Jason Wang <jasowang@...hat.com> wrote:
> > >>>>>>> On 2021/2/2 上午3:17, Si-Wei Liu wrote:
> > >>>>>>>> On Mon, Feb 1, 2021 at 10:51 AM Si-Wei Liu <siwliu.kernel@...il.com> wrote:
> > >>>>>>>>> On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen <elic@...dia.com> wrote:
> > >>>>>>>>>> suspend_vq should only suspend the VQ on not save the current available
> > >>>>>>>>>> index. This is done when a change of map occurs when the driver calls
> > >>>>>>>>>> save_channel_info().
> > >>>>>>>>> Hmmm, suspend_vq() is also called by teardown_vq(), the latter of
> > >>>>>>>>> which doesn't save the available index as save_channel_info() doesn't
> > >>>>>>>>> get called in that path at all. How does it handle the case that
> > >>>>>>>>> aget_vq_state() is called from userspace (e.g. QEMU) while the
> > >>>>>>>>> hardware VQ object was torn down, but userspace still wants to access
> > >>>>>>>>> the queue index?
> > >>>>>>>>>
> > >>>>>>>>> Refer to https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei.liu@oracle.com/
> > >>>>>>>>>
> > >>>>>>>>> vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11)
> > >>>>>>>>> vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11)
> > >>>>>>>>>
> > >>>>>>>>> QEMU will complain with the above warning while VM is being rebooted
> > >>>>>>>>> or shut down.
> > >>>>>>>>>
> > >>>>>>>>> Looks to me either the kernel driver should cover this requirement, or
> > >>>>>>>>> the userspace has to bear the burden in saving the index and not call
> > >>>>>>>>> into kernel if VQ is destroyed.
> > >>>>>>>> Actually, the userspace doesn't have the insights whether virt queue
> > >>>>>>>> will be destroyed if just changing the device status via set_status().
> > >>>>>>>> Looking at other vdpa driver in tree i.e. ifcvf it doesn't behave like
> > >>>>>>>> so. Hence this still looks to me to be Mellanox specifics and
> > >>>>>>>> mlx5_vdpa implementation detail that shouldn't expose to userspace.
> > >>>>>>> So I think we can simply drop this patch?
> > >>>>>> Yep, I think so. To be honest I don't know why it has anything to do
> > >>>>>> with the memory hotplug issue.
> > >>>>>
> > >>>>> Eli may know more, my understanding is that, during memory hotplut, qemu
> > >>>>> need to propagate new memory mappings via set_map(). For mellanox, it means
> > >>>>> it needs to rebuild memory keys, so the virtqueue needs to be suspended.
> > >>>>>
> > >>>> I think Siwei was asking why the first patch was related to the hotplug
> > >>>> issue.
> > >>> I was thinking how consistency is assured when saving/restoring this
> > >>> h/w avail_index against the one in the virtq memory, particularly in
> > >>> the region_add/.region_del() context (e.g. the hotplug case). Problem
> > >>> is I don't see explicit memory barrier when guest thread updates the
> > >>> avail_index, how does the device make sure the h/w avail_index is
> > >>> uptodate while guest may race with updating the virtq's avail_index in
> > >>> the mean while? Maybe I completely miss something obvious?
> > >> DKIM-Signature: v1; arsa-sha256; crelaxed/relaxed; dnvidia.com; sn1;
> > >>          t 12257780; bhHnB0z4VEKwRS3WGY8d836MJgxu5Eln/jbFZlNXVxc08;
> > >>          hX-PGP-Universal:Date:From:To:CC:Subject:Message-ID:References:
> > >>           MIME-Version:Content-Type:Content-Disposition:
> > >>           Content-Transfer-Encoding:In-Reply-To:User-Agent:X-Originating-IP:
> > >>           X-ClientProxiedBy;
> > >>          bgGmb8+rcn3/rKzKQ/7QzSnghWzZ+FAU0XntsRZYGQ66sFvT7zsYPHogG3LIWNY77t
> > >>           wNHPw7GCJrNaH3nEXPbOp0FMOZw4Kv4W7UPuYPobbLeTkvuPAidjB8dM42vz+1X61t
> > >>           9IVQT9X4hnAxRjI5CqZOo41GS4Tl1X+ykGoA+VE80BR/R/+nQ3tXDVULfppzeB+vu3
> > >>           TWnnpaZ2GyoNyPlMiyVRkHdXzDVgA4uQHxwHn7otGK5J4lzyu8KrFyQtiP+f6hfu5v
> > >>           crJkYS8e9A+rfzUmKWuyHcKcmhPhAVJ4XdpzZcDXXlMHVxG7nR1o88xttC6D1+oNIP
> > >>           9xHI3DkNBpEuA
> > >> If you're asking about syncronization upon hot plug of memory, the
> > >> hardware always goes to read the available index from memory when a new
> > >> hardware object is associted with a virtqueue. You can argue then that
> > >> you don't need to restore the available index and you may be right but
> > >> this is the currect inteface to the firmware.
> > >>
> > >>
> > >> If you're asking on generally how sync is assured when the guest updates
> > >> the available index, can you please send a pointer to the code where you
> > >> see the update without a memory barrier?
> > > This is a snippet of virtqueue_add_split() where avail_index gets
> > > updated by guest:
> > >
> > >          /* Put entry in available array (but don't update avail->idx until they
> > >           * do sync). */
> > >          avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >          vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > >
> > >          /* Descriptors and available array need to be set before we expose the
> > >           * new available array entries. */
> > >          virtio_wmb(vq->weak_barriers);
> > >          vq->split.avail_idx_shadow++;
> > >          vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >                                                  vq->split.avail_idx_shadow);
> > >          vq->num_added++;
> > >
> > > There's memory barrier to make sure the update to descriptor and
> > > available ring is seen before writing to the avail->idx, but there
> > > seems no gurantee that this update would flush to the memory
> > > immmedately either before or after the mlx5-vdpa is suspened and gets
> > > the old avail_index value stashed somewhere. In this case, how does
> > > the hardware ensure the consistency between the guest virtio and host
> > > mlx5-vdpa? Or, it completly relies on guest to update the avail_index
> > > once the next buffer is available, so that the index will be in sync
> > > again?
> >
> >
> > I'm not sure I get the question but notice that the driver should check
> > and notify virtqueue when device want a notification. So there's a
> > virtio_wmb() e.g in:
> >
> > static bool virtqueue_kick_prepare_split(struct virtqueue *_vq)
> > {
> >      struct vring_virtqueue *vq = to_vvq(_vq);
> >      u16 new, old;
> >      bool needs_kick;
> >
> >      START_USE(vq);
> >      /* We need to expose available array entries before checking avail
> >       * event. */
> >      virtio_mb(vq->weak_barries);
> >
> >      old = vq->split.avail_idx_shadow - vq->num_added;
> >      new = vq->split.avail_idx_shadow;
> >      vq->num_added = 0;
> >
> > (See the comment above virtio_mb()). So the avail idx is guaranteed to
> > be committed to memroy(cache hierarchy) before the check of
> > notification. I think we sync through this.
> 
> Thanks for pointing it out! Indeed, the avail index is synced before
> kicking the device. However, even so I think the race might still be
> possible, see below:
> 
> 
>   mlx5_vdpa device                       virtio driver
> -------------------------------------------------------------------------
>                                   virtqueue_add_split
>                                     (bumped up avail_idx1 to
>                                     avail_idx2, but update
>                                     not yet committed to mem)
> 
> (hot plug memory...)
> mlx5_vdpa_set_map
>   mlx5_vdpa_change_map
>     suspend_vqs
>       suspend_vq
>         (avail_idx1 was saved)
>     save_channels_info
>     :
>     :                             virtqueue_kick_prepare_split
>     :                               (avail_idx2 committed to memory)
>     restore_channels_info
>     setup_driver
>       ...
>         create_virtqueue
>           (saved avail_idx1
>           getting restored;
>           whereas current
>           avail_idx2 in
>           memory is ignored)
> :
> :
>                                    virtqueue_notify
>                                      vp_notify
> mlx5_vdpa_kick_vq
>   (device processing up to
>   avail_idx1 rather than
>   avail_idx2?)
> 
> 
> According to Eli, the vdpa firmware does not load the latest value,
> i.e. avail_idx2 from memory when re-creating the virtqueue object.

I said exactly the opposite. The hardware always goes to read the
available index from memory. The requirement to configure it when
creating a new object is still a requirement defined by the interface so
I must not violate interface requirments.

> Instead it starts with a stale avail_idx1 that was saved in mlx5_vdpa
> driver private structure before the memory update is made. That is the
> way how the current firmware interface is designed. The thing I'm not
> sure though is if the firmware/device will resync and get to the
> latest avail_idx2 from memory while processing the kick request with
> mlx5_vdpa_kick_vq? If not, a few avail entries will be missing in the
> kick, which could cause odd behavior or implicit failure.
> 
> But if the firmware will resync on kick_vq (and get_vq_state), I think
> I completely lost the point of saving and restoring avail_idx when
> changing the memory map...
> 
> Thanks,
> -Siwei
> 
> >
> > Thanks
> >
> >
> > >
> > > Thanks,
> > > -Siwei
> > >
> > >>> Thanks,
> > >>> -Siwei
> > >>>
> > >>>> But you're correct. When memory is added, I get a new memory map. This
> > >>>> requires me to build a new memory key object which covers the new memory
> > >>>> map. Since the virtqueue objects are referencing this memory key, I need
> > >>>> to destroy them and build new ones referncing the new memory key.
> > >>>>
> > >>>>> Thanks
> > >>>>>
> > >>>>>
> > >>>>>> -Siwei
> > >>>>>>
> > >>>>>>> Thanks
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>>> -Siwei
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>> Signed-off-by: Eli Cohen <elic@...dia.com>
> > >>>>>>>>>> ---
> > >>>>>>>>>>     drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 --------
> > >>>>>>>>>>     1 file changed, 8 deletions(-)
> > >>>>>>>>>>
> > >>>>>>>>>> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > >>>>>>>>>> index 88dde3455bfd..549ded074ff3 100644
> > >>>>>>>>>> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > >>>>>>>>>> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > >>>>>>>>>> @@ -1148,8 +1148,6 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq)
> > >>>>>>>>>>
> > >>>>>>>>>>     static void suspend_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq)
> > >>>>>>>>>>     {
> > >>>>>>>>>> -       struct mlx5_virtq_attr attr;
> > >>>>>>>>>> -
> > >>>>>>>>>>            if (!mvq->initialized)
> > >>>>>>>>>>                    return;
> > >>>>>>>>>>
> > >>>>>>>>>> @@ -1158,12 +1156,6 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m
> > >>>>>>>>>>
> > >>>>>>>>>>            if (modify_virtqueue(ndev, mvq, MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
> > >>>>>>>>>>                    mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n");
> > >>>>>>>>>> -
> > >>>>>>>>>> -       if (query_virtqueue(ndev, mvq, &attr)) {
> > >>>>>>>>>> -               mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n");
> > >>>>>>>>>> -               return;
> > >>>>>>>>>> -       }
> > >>>>>>>>>> -       mvq->avail_idx = attr.available_index;
> > >>>>>>>>>>     }
> > >>>>>>>>>>
> > >>>>>>>>>>     static void suspend_vqs(struct mlx5_vdpa_net *ndev)
> > >>>>>>>>>> --
> > >>>>>>>>>> 2.29.2
> > >>>>>>>>>>
> >