linux-kernel - Re: [PATCH V3 19/19] virtio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9c254fa8-7bb9-4484-8546-5f0a469619f9@oracle.com>
Date: Thu, 3 Jul 2025 09:49:44 -0400
From: Jonah Palmer <jonah.palmer@...cle.com>
To: Jason Wang <jasowang@...hat.com>, "Michael S. Tsirkin" <mst@...hat.com>
Cc: xuanzhuo@...ux.alibaba.com, eperezma@...hat.com,
        virtualization@...ts.linux.dev, linux-kernel@...r.kernel.org
Subject: Re: [PATCH V3 19/19] virtio_ring: add in order support



On 7/2/25 11:13 PM, Jason Wang wrote:
> On Wed, Jul 2, 2025 at 6:57 PM Michael S. Tsirkin <mst@...hat.com> wrote:
>>
>> On Wed, Jul 02, 2025 at 05:29:18PM +0800, Jason Wang wrote:
>>> On Tue, Jul 1, 2025 at 2:57 PM Michael S. Tsirkin <mst@...hat.com> wrote:
>>>>
>>>> On Mon, Jun 16, 2025 at 04:25:17PM +0800, Jason Wang wrote:
>>>>> This patch implements in order support for both split virtqueue and
>>>>> packed virtqueue.
>>>>
>>>> I'd like to see more motivation for this work, documented.
>>>> It's not really performance, not as it stands, see below:
>>>>
>>>>>
>>>>> Benchmark with KVM guest + testpmd on the host shows:
>>>>>
>>>>> For split virtqueue: no obvious differences were noticed
>>>>>
>>>>> For packed virtqueue:
>>>>>
>>>>> 1) RX gets 3.1% PPS improvements from 6.3 Mpps to 6.5 Mpps
>>>>> 2) TX gets 4.6% PPS improvements from 8.6 Mpps to 9.0 Mpps
>>>>>
>>>>
>>>> That's a very modest improvement for a lot of code.
>>>> I also note you put in some batching just for in-order.
>>>> Which could also explain the gains maybe?
>>>> What if you just put in a simple implementation with no
>>>> batching tricks? do you still see a gain?
>>>
>>> It is used to implement the batch used updating.
>>>
>>> """
>>> Some devices always use descriptors in the same order in which they
>>> have been made available. These devices can offer the
>>> VIRTIO_F_IN_ORDER feature. If negotiated, this knowledge allows
>>> devices to notify the use of a batch of buffers to the driver by only
>>> writing out a single used ring entry with the id corresponding to the
>>> head entry of the descriptor chain describing the last buffer in the
>>> batch.
>>> """
>>>
>>> DPDK implements this behavior, so it's a must for the virtio driver.
>>>
>>>> Does any hardware implement this? Maybe that can demonstrate
>>>> bigger gains.
>>>
>>> Maybe but I don't have one in my hand.
>>>
>>> For performance, I think it should be sufficient as a starter. I can
>>> say in the next version something like "more optimizations could be
>>> done on top"
>>
>> What are some optimizations you have in mind?
> 
> One thing in my mind, spec currently said:
> 
> """
>   If negotiated, this knowledge allows devices to notify the use of a
> batch of buffers to the driver by only writing out a single used
> descriptor with the Buffer ID corresponding to the last descriptor in
> the batch.
> """
> 
> If the device writes the last descriptor ID instead of the buffer ID
> and skip the number of descriptors in the used ring. For split
> virtqueue, the avail ring is not needed anymore. Device knows the
> availability of buffers via avail_idx. In this way, we completely
> eliminate the access of the available ring. This reduces the memory
> access which is expensive for both:
> 
> 1) kernel vhost-net where small user space memory access is expensive
> 2) hardware PCI transactions
> 
> Does this make sense?
> 
>>
>>
>>
>>> Note that the patch that introduces packed virtqueue, there's not even
>>> any numbers:
>>>
>>> commit 1ce9e6055fa0a9043405c5604cf19169ec5379ff
>>> Author: Tiwei Bie <tiwei.bie@...el.com>
>>> Date:   Wed Nov 21 18:03:27 2018 +0800
>>>
>>>      virtio_ring: introduce packed ring support
>>>
>>>      Introduce the packed ring support. Packed ring can only be
>>>      created by vring_create_virtqueue() and each chunk of packed
>>>      ring will be allocated individually. Packed ring can not be
>>>      created on preallocated memory by vring_new_virtqueue() or
>>>      the likes currently.
>>>
>>>      Signed-off-by: Tiwei Bie <tiwei.bie@...el.com>
>>>      Signed-off-by: David S. Miller <davem@...emloft.net>
>>
>>
>> I think the assumption there was that intel has hardware that
>> requires packed. That's why Dave merged this.
> 
> I think it should according to Qemu patch:
> 
> commit c03213fdc9d7b680cc575cd1e725750702a10b09
> Author: Jonah Palmer <jonah.palmer@...cle.com>
> Date:   Wed Jul 10 08:55:18 2024 -0400
> 
>      vhost,vhost-user: Add VIRTIO_F_IN_ORDER to vhost feature bits
> 
>      Add support for the VIRTIO_F_IN_ORDER feature across a variety of vhost
>      devices.
> 
>      The inclusion of VIRTIO_F_IN_ORDER in the feature bits arrays for these
>      devices ensures that the backend is capable of offering and providing
>      support for this feature, and that it can be disabled if the backend
>      does not support it.
> 
>      Acked-by: Eugenio Pérez <eperezma@...hat.com>
>      Signed-off-by: Jonah Palmer <jonah.palmer@...cle.com>
>      Message-Id: <20240710125522.4168043-6-jonah.palmer@...cle.com>
>      Reviewed-by: Michael S. Tsirkin <mst@...hat.com>
>      Signed-off-by: Michael S. Tsirkin <mst@...hat.com>
> 
> Adding Jonah for more thought here.
> 

Hi. By "it should", are you referring to Intel having hardware requiring 
this feature? Sorry, just having a bit of trouble following.

In any case, this looks like a good first implementation that can be 
used as a foundation for future implementations to further improve its 
performance.

This was the thought process I had when I implemented this feature in 
Qemu. That is, get a solid framework down that supports this feature 
(which had some small, modest performance improvements for some devices) 
and then add on future patches (perhaps device-specific and/or general 
implementations) that take advantage of the fact that data follows this 
FIFO model.

As Jason mentioned, one such future implementation could remove the need 
for the use of the avail ring in the split VQ case since the device 
wouldn't need to read it to learn which descriptor comes next.

Another example could be with vhost-net / vhost-vdpa. Currently each 
queue tracks 3 separate indices and keeps a per-descriptor bookkeeping 
table to handle buffers completing out of order. If the backend knows 
data is FIFO, we might be able to drop these trackers and just use a 
head and tail counter with a single contiguous iovec ring. This could 
result in a smaller cache footprint and fewer DMAs.

Jonah

>>
>>>>
>>>>
>>>>> Signed-off-by: Jason Wang <jasowang@...hat.com>
>>>>> ---
>>>>>   drivers/virtio/virtio_ring.c | 423 +++++++++++++++++++++++++++++++++--
>>>>>   1 file changed, 402 insertions(+), 21 deletions(-)
>>>>>
>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>> index 27a9459a0555..21d456392ba0 100644
>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>> @@ -70,11 +70,14 @@
>>>>>   enum vq_layout {
>>>>>        SPLIT = 0,
>>>>>        PACKED,
>>>>> +     SPLIT_IN_ORDER,
>>>>> +     PACKED_IN_ORDER,
>>>>>        VQ_TYPE_MAX,
>>>>>   };
>>>>>
>>>>>   struct vring_desc_state_split {
>>>>>        void *data;                     /* Data for callback. */
>>>>> +     u32 total_len;                  /* Buffer Length */
>>>>>
>>>>>        /* Indirect desc table and extra table, if any. These two will be
>>>>>         * allocated together. So we won't stress more to the memory allocator.
>>>>> @@ -84,6 +87,7 @@ struct vring_desc_state_split {
>>>>>
>>>>>   struct vring_desc_state_packed {
>>>>>        void *data;                     /* Data for callback. */
>>>>> +     u32 total_len;                  /* Buffer Length */
>>>>>
>>>>>        /* Indirect desc table and extra table, if any. These two will be
>>>>>         * allocated together. So we won't stress more to the memory allocator.
>>>>
>>>> We are bloating up the cache footprint for everyone,
>>>> so there's a chance of regressions.
>>>> Pls include benchmark for in order off, to make sure we
>>>> are not regressing.
>>>
>>> Ok.
>>>
>>>> How big was the ring?
>>>
>>> 256.
>>
>> that is very modest, you want to fill at least one cache way,
>> preferably more.
> 
> I can test larger queue sizes.
> 
>>
>>>> Worth trying with a biggish one, where there is more cache
>>>> pressure.
>>>
>>> Ok.
>>>
>>>>
>>>>
>>>> Why not have a separate state for in-order?
>>>
>>> It can work.
>>>
>>>>
>>>>
>>>>
>>>>> @@ -206,6 +210,12 @@ struct vring_virtqueue {
>>>>>
>>>>>        /* Head of free buffer list. */
>>>>>        unsigned int free_head;
>>>>> +
>>>>> +     /* Head of the batched used buffers, vq->num means no batching */
>>>>> +     unsigned int batch_head;
>>>>> +
>>>>> +     unsigned int batch_len;
>>>>> +
>>>>
>>>> Are these two only used for in-order? Please document that.
>>>
>>> Yes, I will do that.
>>>
>>>> I also want some documentation about the batching trickery
>>>> used please.
>>>> What is batched, when, how is batching flushed, why are we
>>>> only batching in-order ...
>>>
>>> I'm not sure I get things like this, what you want seems to be the
>>> behaviour of the device which has been stated by the spec or I may
>>> miss something here.
>>
>> "a single used ring entry with the id corresponding to the
>>   head entry of the descriptor chain describing the last buffer in the
>>   batch"
>> ?
> 
> Exactly.
> 
>>
>> so together they form this used ring entry describing the last buffer?
>> "head" is the id and "len" the length?
> 
> Yes.
> 
>>
>> maybe
>>
>>          /*
>>           * With IN_ORDER, devices write a single used ring entry with
>>           * the id corresponding to the head entry of the descriptor chain
>>           * describing the last buffer in the batch
>>           */
>>          struct used_entry {
>>                  u32 id;
>>                  u32 len;
>>          } batch_last;
>>
>> ?
> 
> This should be fine.
> 
>>
>>
>>
>>
>>>>
>>>>
>>>>
>>>>
>>>>>        /* Number we've added since last sync. */
>>>>>        unsigned int num_added;
>>>>>
>>>>> @@ -256,10 +266,14 @@ static void vring_free(struct virtqueue *_vq);
>>>>>
>>>>>   #define to_vvq(_vq) container_of_const(_vq, struct vring_virtqueue, vq)
>>>>>
>>>>> -
>>>>>   static inline bool virtqueue_is_packed(const struct vring_virtqueue *vq)
>>>>>   {
>>>>> -     return vq->layout == PACKED;
>>>>> +     return vq->layout == PACKED || vq->layout == PACKED_IN_ORDER;
>>>>> +}
>>>>> +
>>>>> +static inline bool virtqueue_is_in_order(const struct vring_virtqueue *vq)
>>>>> +{
>>>>> +     return vq->layout == SPLIT_IN_ORDER || vq->layout == PACKED_IN_ORDER;
>>>>>   }
>>>>>
>>>>>   static bool virtqueue_use_indirect(const struct vring_virtqueue *vq,
>>>>> @@ -570,7 +584,7 @@ static inline int virtqueue_add_split(struct vring_virtqueue *vq,
>>>>>        struct vring_desc_extra *extra;
>>>>>        struct scatterlist *sg;
>>>>>        struct vring_desc *desc;
>>>>> -     unsigned int i, n, c, avail, descs_used, err_idx;
>>>>> +     unsigned int i, n, c, avail, descs_used, err_idx, total_len = 0;
>>>>
>>>>
>>>> I would add a comment here:
>>>>
>>>> /* Total length for in-order */
>>>> unsigned int total_len = 0;
>>>
>>> Ok.
>>>
>>> Thanks
>>
>