[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJaqyWcf3tz17q6G=123Xb+warf8Ckg=PLaPkzLU9hYHiUy9Zg@mail.gmail.com>
Date: Wed, 15 Oct 2025 12:36:47 +0200
From: Eugenio Perez Martin <eperezma@...hat.com>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: Maxime Coquelin <mcoqueli@...hat.com>, Yongji Xie <xieyongji@...edance.com>,
virtualization@...ts.linux.dev, linux-kernel@...r.kernel.org,
Xuan Zhuo <xuanzhuo@...ux.alibaba.com>, Dragos Tatulea DE <dtatulea@...dia.com>, jasowang@...hat.com
Subject: Re: [RFC 1/2] virtio_net: timeout control virtqueue commands
On Wed, Oct 15, 2025 at 10:09 AM Michael S. Tsirkin <mst@...hat.com> wrote:
>
> On Wed, Oct 15, 2025 at 10:03:49AM +0200, Maxime Coquelin wrote:
> > On Wed, Oct 15, 2025 at 9:45 AM Eugenio Perez Martin
> > <eperezma@...hat.com> wrote:
> > >
> > > On Wed, Oct 15, 2025 at 9:05 AM Michael S. Tsirkin <mst@...hat.com> wrote:
> > > >
> > > > On Wed, Oct 15, 2025 at 08:52:50AM +0200, Eugenio Perez Martin wrote:
> > > > > On Wed, Oct 15, 2025 at 8:33 AM Michael S. Tsirkin <mst@...hat.com> wrote:
> > > > > >
> > > > > > On Wed, Oct 15, 2025 at 08:08:31AM +0200, Eugenio Perez Martin wrote:
> > > > > > > On Tue, Oct 14, 2025 at 11:25 AM Michael S. Tsirkin <mst@...hat.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Oct 14, 2025 at 11:14:40AM +0200, Maxime Coquelin wrote:
> > > > > > > > > On Tue, Oct 14, 2025 at 10:29 AM Michael S. Tsirkin <mst@...hat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, Oct 07, 2025 at 03:06:21PM +0200, Eugenio Pérez wrote:
> > > > > > > > > > > An userland device implemented through VDUSE could take rtnl forever if
> > > > > > > > > > > the virtio-net driver is running on top of virtio_vdpa. Let's break the
> > > > > > > > > > > device if it does not return the buffer in a longer-than-assumible
> > > > > > > > > > > timeout.
> > > > > > > > > >
> > > > > > > > > > So now I can't debug qemu with gdb because guest dies :(
> > > > > > > > > > Let's not break valid use-cases please.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Instead, solve it in vduse, probably by handling cvq within
> > > > > > > > > > kernel.
> > > > > > > > >
> > > > > > > > > Would a shadow control virtqueue implementation in the VDUSE driver work?
> > > > > > > > > It would ack systematically messages sent by the Virtio-net driver,
> > > > > > > > > and so assume the userspace application will Ack them.
> > > > > > > > >
> > > > > > > > > When the userspace application handles the message, if the handling fails,
> > > > > > > > > it somehow marks the device as broken?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Maxime
> > > > > > > >
> > > > > > > > Yes but it's a bit more convoluted than just acking them.
> > > > > > > > Once you use the buffer you can get another one and so on
> > > > > > > > with no limit.
> > > > > > > > One fix is to actually maintain device state in the
> > > > > > > > kernel, update it, and then notify userspace.
> > > > > > > >
> > > > > > >
> > > > > > > I thought of implementing this approach at first, but it has two drawbacks.
> > > > > > >
> > > > > > > The first one: it's racy. Let's say the driver updates the MAC filter,
> > > > > > > VDUSE timeout occurs, the guest receives the fail, and then the device
> > > > > > > replies with an OK. There is no way for the device or VDUSE to update
> > > > > > > the driver.
> > > > > >
> > > > > > There's no timeout. Kernel can guarantee executing all requests.
> > > > > >
> > > > >
> > > > > I don't follow this. How should the VDUSE kernel module act if the
> > > > > VDUSE userland device does not use the CVQ buffer then?
> > > >
> > > > First I am not sure a VQ is the best interface for talking to userspace.
> > > > But assuming yes - just avoid sending more data, send it later after
> > > > userspace used the buffer.
> > > >
> > >
> > > Let me take a step back, I think I didn't describe the scenario well enough.
> > >
> > > We have a VDUSE device, and then the same host is interacting with the
> > > device through the virtio_net driver over virtio_vdpa.
> > >
> > > Then, the virtio_net driver sends a control command though its CVQ, so
> > > it *takes the RTNL*. That command reaches the VDUSE CVQ.
> > >
> > > It does not matter if the VDUSE device in the userland processes the
> > > commands through a CVQ, reading the vduse character device, or another
> > > system. The question is: what to do if the VDUSE device does not
> > > process that command in a timely manner? Should we just let the RTNL
> > > be taken forever?
> > >
> >
> > My understanding is that:
> > 1. Virtio-net sends a control messages, waits for reply
> > 2. VDUSE driver dequeues it, adds it to the SCVQ, replies OK to the CVQ
> > 3. Userspace application dequeues the message from the SCVQ
> > a. If handling is successful it replies OK
> > b. If handling fails, replies ERROR
If that's the case, everything would be ok now. In both cases, the
RTNL is held only by that time. The problem is when the VDUSE device
userland does not reply.
> > 4. VDUSE driver reads the reply
> > a. if OK, do nothing
> > b. if ERROR, mark the device as broken?
> >
> > This is simplified as it does not take into account SCVQ overflow if
> > the application is stuck.
> > If IIUC, Michael suggests to only enqueue a single message at the time
> > in the SVQ,
> > and bufferize the pending messages in the VDUSE driver.
But the RTNL keeps being held in all that process, isn't it?
>
> Not exactly bufferize, record. E.g. we do not need to send
> 100 messages to enable/disable promisc mode - together they
> have no effect.
>
I still don't follow how that unlocks the RTNL. Let me put some workflows:
1) MAC_TABLE_SET, what can we do if:
The driver sets a set of MAC addresses, (A, B, C). VDUSE device does
send this set to the VDUSE userland device, as we don't have more
information. Now, the driver sends a new table with addresses (A, B,
D), but the device still didn't reply to the VDUSE driver.
VDUSE should track that the new state is (A, B, D), and then wait for
the previous request to be replied by the device? What should we
report to the driver? If we wait for the device to reply, we're in the
same situation regarding the RTNL.
Now we receive a new state (A, B, E). We haven't sent the (A, B, D),
so it is good to just replace the (A, B, D) with that. and send it
when (A, B, C) is completed with either success or failure.
2) VQ_PAIRS_SET
The driver starts with 1 vq pair. Now the driver sets 3 vq pairs, and
the VDUSE CVQ forwards the command. The driver still thinks that it is
using 1 vq pair. I can store that the driver request was 3, and it is
still in-flight. Now the timeout occurs, so the VDUSE device returns
fail to the driver, and the driver frees the vq regions etc. After
that, the device now replies OK. The memory that was sent as the new
vqs avail ring and descriptor ring now contains garbage, and it could
happen that the device start overriding unrelated memory.
Not even VQ_RESET protects against it as there is still a window
between the CMD set and the VQ reset.
Powered by blists - more mailing lists