linux-kernel - Re: [PATCH V3 19/19] virtio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACGkMEs0-FiRbUVTbtVyqPh=A8ufdW-Mh1XnR9hf8XQ7_Qf9sg@mail.gmail.com>
Date: Thu, 3 Jul 2025 11:13:54 +0800
From: Jason Wang <jasowang@...hat.com>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: xuanzhuo@...ux.alibaba.com, eperezma@...hat.com, 
	virtualization@...ts.linux.dev, linux-kernel@...r.kernel.org, 
	Jonah Palmer <jonah.palmer@...cle.com>
Subject: Re: [PATCH V3 19/19] virtio_ring: add in order support

On Wed, Jul 2, 2025 at 6:57 PM Michael S. Tsirkin <mst@...hat.com> wrote:
>
> On Wed, Jul 02, 2025 at 05:29:18PM +0800, Jason Wang wrote:
> > On Tue, Jul 1, 2025 at 2:57 PM Michael S. Tsirkin <mst@...hat.com> wrote:
> > >
> > > On Mon, Jun 16, 2025 at 04:25:17PM +0800, Jason Wang wrote:
> > > > This patch implements in order support for both split virtqueue and
> > > > packed virtqueue.
> > >
> > > I'd like to see more motivation for this work, documented.
> > > It's not really performance, not as it stands, see below:
> > >
> > > >
> > > > Benchmark with KVM guest + testpmd on the host shows:
> > > >
> > > > For split virtqueue: no obvious differences were noticed
> > > >
> > > > For packed virtqueue:
> > > >
> > > > 1) RX gets 3.1% PPS improvements from 6.3 Mpps to 6.5 Mpps
> > > > 2) TX gets 4.6% PPS improvements from 8.6 Mpps to 9.0 Mpps
> > > >
> > >
> > > That's a very modest improvement for a lot of code.
> > > I also note you put in some batching just for in-order.
> > > Which could also explain the gains maybe?
> > > What if you just put in a simple implementation with no
> > > batching tricks? do you still see a gain?
> >
> > It is used to implement the batch used updating.
> >
> > """
> > Some devices always use descriptors in the same order in which they
> > have been made available. These devices can offer the
> > VIRTIO_F_IN_ORDER feature. If negotiated, this knowledge allows
> > devices to notify the use of a batch of buffers to the driver by only
> > writing out a single used ring entry with the id corresponding to the
> > head entry of the descriptor chain describing the last buffer in the
> > batch.
> > """
> >
> > DPDK implements this behavior, so it's a must for the virtio driver.
> >
> > > Does any hardware implement this? Maybe that can demonstrate
> > > bigger gains.
> >
> > Maybe but I don't have one in my hand.
> >
> > For performance, I think it should be sufficient as a starter. I can
> > say in the next version something like "more optimizations could be
> > done on top"
>
> What are some optimizations you have in mind?

One thing in my mind, spec currently said:

"""
 If negotiated, this knowledge allows devices to notify the use of a
batch of buffers to the driver by only writing out a single used
descriptor with the Buffer ID corresponding to the last descriptor in
the batch.
"""

If the device writes the last descriptor ID instead of the buffer ID
and skip the number of descriptors in the used ring. For split
virtqueue, the avail ring is not needed anymore. Device knows the
availability of buffers via avail_idx. In this way, we completely
eliminate the access of the available ring. This reduces the memory
access which is expensive for both:

1) kernel vhost-net where small user space memory access is expensive
2) hardware PCI transactions

Does this make sense?

>
>
>
> > Note that the patch that introduces packed virtqueue, there's not even
> > any numbers:
> >
> > commit 1ce9e6055fa0a9043405c5604cf19169ec5379ff
> > Author: Tiwei Bie <tiwei.bie@...el.com>
> > Date:   Wed Nov 21 18:03:27 2018 +0800
> >
> >     virtio_ring: introduce packed ring support
> >
> >     Introduce the packed ring support. Packed ring can only be
> >     created by vring_create_virtqueue() and each chunk of packed
> >     ring will be allocated individually. Packed ring can not be
> >     created on preallocated memory by vring_new_virtqueue() or
> >     the likes currently.
> >
> >     Signed-off-by: Tiwei Bie <tiwei.bie@...el.com>
> >     Signed-off-by: David S. Miller <davem@...emloft.net>
>
>
> I think the assumption there was that intel has hardware that
> requires packed. That's why Dave merged this.

I think it should according to Qemu patch:

commit c03213fdc9d7b680cc575cd1e725750702a10b09
Author: Jonah Palmer <jonah.palmer@...cle.com>
Date:   Wed Jul 10 08:55:18 2024 -0400

    vhost,vhost-user: Add VIRTIO_F_IN_ORDER to vhost feature bits

    Add support for the VIRTIO_F_IN_ORDER feature across a variety of vhost
    devices.

    The inclusion of VIRTIO_F_IN_ORDER in the feature bits arrays for these
    devices ensures that the backend is capable of offering and providing
    support for this feature, and that it can be disabled if the backend
    does not support it.

    Acked-by: Eugenio Pérez <eperezma@...hat.com>
    Signed-off-by: Jonah Palmer <jonah.palmer@...cle.com>
    Message-Id: <20240710125522.4168043-6-jonah.palmer@...cle.com>
    Reviewed-by: Michael S. Tsirkin <mst@...hat.com>
    Signed-off-by: Michael S. Tsirkin <mst@...hat.com>

Adding Jonah for more thought here.

>
> > >
> > >
> > > > Signed-off-by: Jason Wang <jasowang@...hat.com>
> > > > ---
> > > >  drivers/virtio/virtio_ring.c | 423 +++++++++++++++++++++++++++++++++--
> > > >  1 file changed, 402 insertions(+), 21 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 27a9459a0555..21d456392ba0 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -70,11 +70,14 @@
> > > >  enum vq_layout {
> > > >       SPLIT = 0,
> > > >       PACKED,
> > > > +     SPLIT_IN_ORDER,
> > > > +     PACKED_IN_ORDER,
> > > >       VQ_TYPE_MAX,
> > > >  };
> > > >
> > > >  struct vring_desc_state_split {
> > > >       void *data;                     /* Data for callback. */
> > > > +     u32 total_len;                  /* Buffer Length */
> > > >
> > > >       /* Indirect desc table and extra table, if any. These two will be
> > > >        * allocated together. So we won't stress more to the memory allocator.
> > > > @@ -84,6 +87,7 @@ struct vring_desc_state_split {
> > > >
> > > >  struct vring_desc_state_packed {
> > > >       void *data;                     /* Data for callback. */
> > > > +     u32 total_len;                  /* Buffer Length */
> > > >
> > > >       /* Indirect desc table and extra table, if any. These two will be
> > > >        * allocated together. So we won't stress more to the memory allocator.
> > >
> > > We are bloating up the cache footprint for everyone,
> > > so there's a chance of regressions.
> > > Pls include benchmark for in order off, to make sure we
> > > are not regressing.
> >
> > Ok.
> >
> > > How big was the ring?
> >
> > 256.
>
> that is very modest, you want to fill at least one cache way,
> preferably more.

I can test larger queue sizes.

>
> > > Worth trying with a biggish one, where there is more cache
> > > pressure.
> >
> > Ok.
> >
> > >
> > >
> > > Why not have a separate state for in-order?
> >
> > It can work.
> >
> > >
> > >
> > >
> > > > @@ -206,6 +210,12 @@ struct vring_virtqueue {
> > > >
> > > >       /* Head of free buffer list. */
> > > >       unsigned int free_head;
> > > > +
> > > > +     /* Head of the batched used buffers, vq->num means no batching */
> > > > +     unsigned int batch_head;
> > > > +
> > > > +     unsigned int batch_len;
> > > > +
> > >
> > > Are these two only used for in-order? Please document that.
> >
> > Yes, I will do that.
> >
> > > I also want some documentation about the batching trickery
> > > used please.
> > > What is batched, when, how is batching flushed, why are we
> > > only batching in-order ...
> >
> > I'm not sure I get things like this, what you want seems to be the
> > behaviour of the device which has been stated by the spec or I may
> > miss something here.
>
> "a single used ring entry with the id corresponding to the
>  head entry of the descriptor chain describing the last buffer in the
>  batch"
> ?

Exactly.

>
> so together they form this used ring entry describing the last buffer?
> "head" is the id and "len" the length?

Yes.

>
> maybe
>
>         /*
>          * With IN_ORDER, devices write a single used ring entry with
>          * the id corresponding to the head entry of the descriptor chain
>          * describing the last buffer in the batch
>          */
>         struct used_entry {
>                 u32 id;
>                 u32 len;
>         } batch_last;
>
> ?

This should be fine.

>
>
>
>
> > >
> > >
> > >
> > >
> > > >       /* Number we've added since last sync. */
> > > >       unsigned int num_added;
> > > >
> > > > @@ -256,10 +266,14 @@ static void vring_free(struct virtqueue *_vq);
> > > >
> > > >  #define to_vvq(_vq) container_of_const(_vq, struct vring_virtqueue, vq)
> > > >
> > > > -
> > > >  static inline bool virtqueue_is_packed(const struct vring_virtqueue *vq)
> > > >  {
> > > > -     return vq->layout == PACKED;
> > > > +     return vq->layout == PACKED || vq->layout == PACKED_IN_ORDER;
> > > > +}
> > > > +
> > > > +static inline bool virtqueue_is_in_order(const struct vring_virtqueue *vq)
> > > > +{
> > > > +     return vq->layout == SPLIT_IN_ORDER || vq->layout == PACKED_IN_ORDER;
> > > >  }
> > > >
> > > >  static bool virtqueue_use_indirect(const struct vring_virtqueue *vq,
> > > > @@ -570,7 +584,7 @@ static inline int virtqueue_add_split(struct vring_virtqueue *vq,
> > > >       struct vring_desc_extra *extra;
> > > >       struct scatterlist *sg;
> > > >       struct vring_desc *desc;
> > > > -     unsigned int i, n, c, avail, descs_used, err_idx;
> > > > +     unsigned int i, n, c, avail, descs_used, err_idx, total_len = 0;
> > >
> > >
> > > I would add a comment here:
> > >
> > > /* Total length for in-order */
> > > unsigned int total_len = 0;
> >
> > Ok.
> >
> > Thanks
>