[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACGkMEuZpk8QcrUQSOxqt6j3F9Ge-HdSs5-18FayMMQmH3Tcmg@mail.gmail.com>
Date: Sun, 23 Apr 2023 14:30:18 +0800
From: Jason Wang <jasowang@...hat.com>
To: Maxime Coquelin <maxime.coquelin@...hat.com>
Cc: xieyongji@...edance.com, mst@...hat.com, david.marchand@...hat.com,
linux-kernel@...r.kernel.org,
virtualization@...ts.linux-foundation.org, netdev@...r.kernel.org,
xuanzhuo@...ux.alibaba.com, eperezma@...hat.com,
Peter Xu <peterx@...hat.com>
Subject: Re: [RFC 0/2] vduse: add support for networking devices
On Fri, Apr 21, 2023 at 10:28 PM Maxime Coquelin
<maxime.coquelin@...hat.com> wrote:
>
>
>
> On 4/21/23 07:51, Jason Wang wrote:
> > On Thu, Apr 20, 2023 at 10:16 PM Maxime Coquelin
> > <maxime.coquelin@...hat.com> wrote:
> >>
> >>
> >>
> >> On 4/20/23 06:34, Jason Wang wrote:
> >>> On Wed, Apr 19, 2023 at 9:43 PM Maxime Coquelin
> >>> <maxime.coquelin@...hat.com> wrote:
> >>>>
> >>>> This small series enables virtio-net device type in VDUSE.
> >>>> With it, basic operation have been tested, both with
> >>>> virtio-vdpa and vhost-vdpa using DPDK Vhost library series
> >>>> adding VDUSE support [0] using split rings layout.
> >>>>
> >>>> Control queue support (and so multiqueue) has also been
> >>>> tested, but require a Kernel series from Jason Wang
> >>>> relaxing control queue polling [1] to function reliably.
> >>>>
> >>>> Other than that, we have identified a few gaps:
> >>>>
> >>>> 1. Reconnection:
> >>>> a. VDUSE_VQ_GET_INFO ioctl() returns always 0 for avail
> >>>> index, even after the virtqueue has already been
> >>>> processed. Is that expected? I have tried instead to
> >>>> get the driver's avail index directly from the avail
> >>>> ring, but it does not seem reliable as I sometimes get
> >>>> "id %u is not a head!\n" warnings. Also such solution
> >>>> would not be possible with packed ring, as we need to
> >>>> know the wrap counters values.
> >>>
> >>> Looking at the codes, it only returns the value that is set via
> >>> set_vq_state(). I think it is expected to be called before the
> >>> datapath runs.
> >>>
> >>> So when bound to virtio-vdpa, it is expected to return 0. But we need
> >>> to fix the packed virtqueue case, I wonder if we need to call
> >>> set_vq_state() explicitly in virtio-vdpa before starting the device.
> >>>
> >>> When bound to vhost-vdpa, Qemu will call VHOST_SET_VRING_BASE which
> >>> will end up a call to set_vq_state(). Unfortunately, it doesn't
> >>> support packed ring which needs some extension.
> >>>
> >>>>
> >>>> b. Missing IOCTLs: it would be handy to have new IOCTLs to
> >>>> query Virtio device status,
> >>>
> >>> What's the use case of this ioctl? It looks to me userspace is
> >>> notified on each status change now:
> >>>
> >>> static int vduse_dev_set_status(struct vduse_dev *dev, u8 status)
> >>> {
> >>> struct vduse_dev_msg msg = { 0 };
> >>>
> >>> msg.req.type = VDUSE_SET_STATUS;
> >>> msg.req.s.status = status;
> >>>
> >>> return vduse_dev_msg_sync(dev, &msg);
> >>> }
> >>
> >> The idea was to be able to query the status at reconnect time, and
> >> neither having to assume its value nor having to store its value in a
> >> file (the status could change while the VDUSE application is stopped,
> >> but maybe it would receive the notification at reconnect).
> >
> > I see.
> >
> >>
> >> I will prototype using a tmpfs file to save needed information, and see
> >> if it works.
> >
> > It might work but then the API is not self contained. Maybe it's
> > better to have a dedicated ioctl.
> >
> >>
> >>>> and retrieve the config
> >>>> space set at VDUSE_CREATE_DEV time.
> >>>
> >>> In order to be safe, VDUSE avoids writable config space. Otherwise
> >>> drivers could block on config writing forever. That's why we don't do
> >>> it now.
> >>
> >> The idea was not to make the config space writable, but just to be able
> >> to fetch what was filled at VDUSE_CREATE_DEV time.
> >>
> >> With the tmpfs file, we can avoid doing that and just save the config
> >> space there.
> >
> > Same as the case for status.
>
> I have cooked a DPDK patch to support reconnect with a tmpfs file as
> suggested by Yongji:
>
> https://gitlab.com/mcoquelin/dpdk-next-virtio/-/commit/53913f2b1155b02c44d5d3d298aafd357e7a8c48
This seems tricky, for example for status:
dev->log->status = dev->status;
What if we crash here?
>
> That's still rough around the edges, but it seems to work reliably
> for the testing I have done so far. We'll certainly want to use the
> tmpfs memory to directly store available indexes and wrap counters to
> avoid introducing overhead in the datapath.
That's fine, we probably need a chapter in the kernel doc to describe
the reliable reconnection but it is not limited to tmpfs.
> The tricky part will be to
> manage NUMA affinity.
This part is not clear to me, what affinity should we care about?
There's a sysfs that was invented by YongJi for virtqueue affinity
management recently.
Thanks
>
> Regards,
> Maxime
>
> >
> > Thanks
> >
> >>
> >>> We need to harden the config write before we can proceed to this I think.
> >>>
> >>>>
> >>>> 2. VDUSE application as non-root:
> >>>> We need to run the VDUSE application as non-root. There
> >>>> is some race between the time the UDEV rule is applied
> >>>> and the time the device starts being used. Discussing
> >>>> with Jason, he suggested we may have a VDUSE daemon run
> >>>> as root that would create the VDUSE device, manages its
> >>>> rights and then pass its file descriptor to the VDUSE
> >>>> app. However, with current IOCTLs, it means the VDUSE
> >>>> daemon would need to know several information that
> >>>> belongs to the VDUSE app implementing the device such
> >>>> as supported Virtio features, config space, etc...
> >>>> If we go that route, maybe we should have a control
> >>>> IOCTL to create the device which would just pass the
> >>>> device type. Then another device IOCTL to perform the
> >>>> initialization. Would that make sense?
> >>>
> >>> I think so. We can hear from others.
> >>>
> >>>>
> >>>> 3. Coredump:
> >>>> In order to be able to perform post-mortem analysis, DPDK
> >>>> Vhost library marks pages used for vrings and descriptors
> >>>> buffers as MADV_DODUMP using madvise(). However with
> >>>> VDUSE it fails with -EINVAL. My understanding is that we
> >>>> set VM_DONTEXPAND flag to the VMAs and madvise's
> >>>> MADV_DODUMP fails if it is present. I'm not sure to
> >>>> understand why madvise would prevent MADV_DODUMP if
> >>>> VM_DONTEXPAND is set. Any thoughts?
> >>>
> >>> Adding Peter who may know the answer.
> >>
> >> Thanks!
> >> Maxime
> >>
> >>> Thanks
> >>>
> >>>>
> >>>> [0]: https://patchwork.dpdk.org/project/dpdk/list/?series=27594&state=%2A&archive=both
> >>>> [1]: https://lore.kernel.org/lkml/CACGkMEtgrxN3PPwsDo4oOsnsSLJfEmBEZ0WvjGRr3whU+QasUg@mail.gmail.com/T/
> >>>>
> >>>> Maxime Coquelin (2):
> >>>> vduse: validate block features only with block devices
> >>>> vduse: enable Virtio-net device type
> >>>>
> >>>> drivers/vdpa/vdpa_user/vduse_dev.c | 11 +++++++----
> >>>> 1 file changed, 7 insertions(+), 4 deletions(-)
> >>>>
> >>>> --
> >>>> 2.39.2
> >>>>
> >>>
> >>
> >
>
Powered by blists - more mailing lists