netdev - Re: vdpa legacy guest support (was Re: [PATCH] vdpa/mlx5: set

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211216205958-mutt-send-email-mst@kernel.org>
Date:   Thu, 16 Dec 2021 21:00:59 -0500
From:   "Michael S. Tsirkin" <mst@...hat.com>
To:     Jason Wang <jasowang@...hat.com>
Cc:     Si-Wei Liu <si-wei.liu@...cle.com>, Eli Cohen <elic@...dia.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        virtualization <virtualization@...ts.linux-foundation.org>,
        netdev <netdev@...r.kernel.org>
Subject: Re: vdpa legacy guest support (was Re: [PATCH] vdpa/mlx5:
 set_features should allow reset to zero)

On Fri, Dec 17, 2021 at 09:57:38AM +0800, Jason Wang wrote:
> On Fri, Dec 17, 2021 at 6:32 AM Si-Wei Liu <si-wei.liu@...cle.com> wrote:
> >
> >
> >
> > On 12/15/2021 6:53 PM, Jason Wang wrote:
> > > On Thu, Dec 16, 2021 at 10:02 AM Si-Wei Liu <si-wei.liu@...cle.com> wrote:
> > >>
> > >>
> > >> On 12/15/2021 1:33 PM, Michael S. Tsirkin wrote:
> > >>> On Wed, Dec 15, 2021 at 12:52:20PM -0800, Si-Wei Liu wrote:
> > >>>> On 12/14/2021 6:06 PM, Jason Wang wrote:
> > >>>>> On Wed, Dec 15, 2021 at 9:05 AM Si-Wei Liu <si-wei.liu@...cle.com> wrote:
> > >>>>>> On 12/13/2021 9:06 PM, Michael S. Tsirkin wrote:
> > >>>>>>> On Mon, Dec 13, 2021 at 05:59:45PM -0800, Si-Wei Liu wrote:
> > >>>>>>>> On 12/12/2021 1:26 AM, Michael S. Tsirkin wrote:
> > >>>>>>>>> On Fri, Dec 10, 2021 at 05:44:15PM -0800, Si-Wei Liu wrote:
> > >>>>>>>>>> Sorry for reviving this ancient thread. I was kinda lost for the conclusion
> > >>>>>>>>>> it ended up with. I have the following questions,
> > >>>>>>>>>>
> > >>>>>>>>>> 1. legacy guest support: from the past conversations it doesn't seem the
> > >>>>>>>>>> support will be completely dropped from the table, is my understanding
> > >>>>>>>>>> correct? Actually we're interested in supporting virtio v0.95 guest for x86,
> > >>>>>>>>>> which is backed by the spec at
> > >>>>>>>>>> https://urldefense.com/v3/__https://ozlabs.org/*rusty/virtio-spec/virtio-0.9.5.pdf__;fg!!ACWV5N9M2RV99hQ!dTKmzJwwRsFM7BtSuTDu1cNly5n4XCotH0WYmidzGqHSXt40i7ZU43UcNg7GYxZg$ . Though I'm not sure
> > >>>>>>>>>> if there's request/need to support wilder legacy virtio versions earlier
> > >>>>>>>>>> beyond.
> > >>>>>>>>> I personally feel it's less work to add in kernel than try to
> > >>>>>>>>> work around it in userspace. Jason feels differently.
> > >>>>>>>>> Maybe post the patches and this will prove to Jason it's not
> > >>>>>>>>> too terrible?
> > >>>>>>>> I suppose if the vdpa vendor does support 0.95 in the datapath and ring
> > >>>>>>>> layout level and is limited to x86 only, there should be easy way out.
> > >>>>>>> Note a subtle difference: what matters is that guest, not host is x86.
> > >>>>>>> Matters for emulators which might reorder memory accesses.
> > >>>>>>> I guess this enforcement belongs in QEMU then?
> > >>>>>> Right, I mean to get started, the initial guest driver support and the
> > >>>>>> corresponding QEMU support for transitional vdpa backend can be limited
> > >>>>>> to x86 guest/host only. Since the config space is emulated in QEMU, I
> > >>>>>> suppose it's not hard to enforce in QEMU.
> > >>>>> It's more than just config space, most devices have headers before the buffer.
> > >>>> The ordering in datapath (data VQs) would have to rely on vendor's support.
> > >>>> Since ORDER_PLATFORM is pretty new (v1.1), I guess vdpa h/w vendor nowadays
> > >>>> can/should well support the case when ORDER_PLATFORM is not acked by the
> > >>>> driver (actually this feature is filtered out by the QEMU vhost-vdpa driver
> > >>>> today), even with v1.0 spec conforming and modern only vDPA device. The
> > >>>> control VQ is implemented in software in the kernel, which can be easily
> > >>>> accommodated/fixed when needed.
> > >>>>
> > >>>>>> QEMU can drive GET_LEGACY,
> > >>>>>> GET_ENDIAN et al ioctls in advance to get the capability from the
> > >>>>>> individual vendor driver. For that, we need another negotiation protocol
> > >>>>>> similar to vhost_user's protocol_features between the vdpa kernel and
> > >>>>>> QEMU, way before the guest driver is ever probed and its feature
> > >>>>>> negotiation kicks in. Not sure we need a GET_MEMORY_ORDER ioctl call
> > >>>>>> from the device, but we can assume weak ordering for legacy at this
> > >>>>>> point (x86 only)?
> > >>>>> I'm lost here, we have get_features() so:
> > >>>> I assume here you refer to get_device_features() that Eli just changed the
> > >>>> name.
> > >>>>> 1) VERSION_1 means the device uses LE if provided, otherwise natvie
> > >>>>> 2) ORDER_PLATFORM means device requires platform ordering
> > >>>>>
> > >>>>> Any reason for having a new API for this?
> > >>>> Are you going to enforce all vDPA hardware vendors to support the
> > >>>> transitional model for legacy guest?
> > > Do we really have other choices?
> > >
> > > I suspect the legacy device is never implemented by any vendor:
> > >
> > > 1) no virtio way to detect host endian
> > This is even true for transitional device that is conforming to the
> > spec, right?
> 
> For hardware, yes.
> 
> > The transport specific way to detect host endian is still
> > being discussed and the spec revision is not finalized yet so far as I
> > see. Why this suddenly becomes a requirement/blocker for h/w vendors to
> > implement the transitional model?
> 
> It's not a sudden blocker, the problem has existed since day 0 if I
> was not wrong. That's why the problem looks a little bit complicated
> and why it would be much simpler if we stick to modern devices.
> 
> > Even if the spec is out, this is
> > pretty new and I suspect not all vendor would follow right away. I hope
> > the software framework can be tolerant with h/w vendors not supporting
> > host endianess (BE specifically) or not detecting it if they would like
> > to support a transitional device for legacy.
> 
> Well, if we know we don't want to support the BE host it would be fine.

I think you guys mean guest not host here. Same for memory ordering etc.
What matters is whether guest has barriers etc.

> >
> > > 2) bypass IOMMU with translated requests
> > > 3) PIO port
> > >
> > > Yes we have enp_vdpa, but it's more like a "transitional device" for
> > > legacy only guests.
> > >
> > >> meaning guest not acknowledging
> > >>>> VERSION_1 would use the legacy interfaces captured in the spec section 7.4
> > >>>> (regarding ring layout, native endianness, message framing, vq alignment of
> > >>>> 4096, 32bit feature, no features_ok bit in status, IO port interface i.e.
> > >>>> all the things) instead?
> > > Note that we only care about the datapath, control path is mediated anyhow.
> > >
> > > So feature_ok and IO port isn't an issue. The rest looks like a must
> > > for the hardware.
> > H/W vendors can opt out not implementing transitional interfaces at all
> > which limits itself a modern only device. Set endianess detection (via
> > transport specific means) aside, for vendors that wishes to support
> > transitional device with legacy interface, is it a hard stop to drop
> > supporting BE host if everything else is there? The spec today doesn't
> > define virtio specific means to detect host memory ordering or device
> > memory coherency,
> 
> Any reason that we need to care about memory coherency at the virtio
> level. I'd expect it's the task of transport.
> 
> > will it yet become a stopper another day for h/w
> > vendor to support more platforms?
> 
> Let's differentiate virtio from vdpa here. For virtio, there's no way
> to add any feature for legacy devices. We can only add memory features
> detecting for modern devices.
> 
> But for vDPA, we can introduce any API that can help vendors to
> present a transitional device. But we can force those APIs since it's
> too late to do that. So transitional devices support is optional for
> sure.
> 
> >
> > >
> > >> Noted we don't yet have a set_device_features()
> > >>>> that allows the vdpa device to tell whether it is operating in transitional
> > >>>> or modern-only mode.
> > > So the device feature should be provisioned via the netlink protocol.
> > Such netlink interface will only be used to limit feature exposure,
> > right? i.e. you can limit a transitional supporting vendor driver to
> > offering modern-only interface,
> 
> There's no way for the management to force a feature, like VERSION_1
> via the current protocol.
> 
> > but you never want to make a modern-only
> > vendor driver to support transitional (I'm not sure if it's a good idea
> > to support all the translation in software, esp. for datapath).
> 
> You may hit this problem for sure, you can't force all vendors to
> support transitional devices especially considering spec said legacy
> is optional. We don't want to end up with a userspace code that can
> only work for some specific vendors.
> 
> > > And what we want is not "set_device_feature()" but
> > > "set_device_mandatory_feautre()", then the parent can choose to fail
> > > the negotiation when VERSION_1 is not negotiated.
> > This assumes the transport specific detection of BE host is in place,
> > right?
> 
> Again, the point is, we can not assume such detection works for all of
> the vendors. And assume BE detection is ready, we still need this for
> modern devices, isn't it?
> 
> > I am not clear who initiates the set_device_mandatory_feautre()
> > call, QEMU during guest feature negotiation, or admin user setting it
> > ahead via netlink?
> 
> Netlink, actually, the spec needs to be extended as well, we saw
> similar requests in the past. E.g there could be a device that works
> in a packed layout only.
> 
> Thanks
> 
> >
> > Thanks,
> > -Siwei
> >
> > >   Qemu then knows for
> > > sure it talks to a transitional device or modern only device.
> > >
> > > Thanks
> > >
> > >> For software virtio, all support for the legacy part in
> > >>>> a transitional model has been built up there already, however, it's not easy
> > >>>> for vDPA vendors to implement all the requirements for an all-or-nothing
> > >>>> legacy guest support (big endian guest for example). To these vendors, the
> > >>>> legacy support within a transitional model is more of feature to them and
> > >>>> it's best to leave some flexibility for them to implement partial support
> > >>>> for legacy. That in turn calls out the need for a vhost-user protocol
> > >>>> feature like negotiation API that can prohibit those unsupported guest
> > >>>> setups to as early as backend_init before launching the VM.
> > >>> Right. Of note is the fact that it's a spec bug which I
> > >>> hope yet to fix, though due to existing guest code the
> > >>> fix won't be complete.
> > >> I thought at one point you pointed out to me that the spec does allow
> > >> config space read before claiming features_ok, and only config write
> > >> before features_ok is prohibited. I haven't read up the full thread of
> > >> Halil's VERSION_1 for transitional big endian device yet, but what is
> > >> the spec bug you hope to fix?
> > >>
> > >>> WRT ioctls, One thing we can do though is abuse set_features
> > >>> where it's called by QEMU early on with just the VERSION_1
> > >>> bit set, to distinguish between legacy and modern
> > >>> interface. This before config space accesses and FEATURES_OK.
> > >>>
> > >>> Halil has been working on this, pls take a look and maybe help him out.
> > >> Interesting thread, am reading now and see how I may leverage or help there.
> > >>
> > >>>>>>>> I
> > >>>>>>>> checked with Eli and other Mellanox/NVDIA folks for hardware/firmware level
> > >>>>>>>> 0.95 support, it seems all the ingredient had been there already dated back
> > >>>>>>>> to the DPDK days. The only major thing limiting is in the vDPA software that
> > >>>>>>>> the current vdpa core has the assumption around VIRTIO_F_ACCESS_PLATFORM for
> > >>>>>>>> a few DMA setup ops, which is virtio 1.0 only.
> > >>>>>>>>
> > >>>>>>>>>> 2. suppose some form of legacy guest support needs to be there, how do we
> > >>>>>>>>>> deal with the bogus assumption below in vdpa_get_config() in the short term?
> > >>>>>>>>>> It looks one of the intuitive fix is to move the vdpa_set_features call out
> > >>>>>>>>>> of vdpa_get_config() to vdpa_set_config().
> > >>>>>>>>>>
> > >>>>>>>>>>              /*
> > >>>>>>>>>>               * Config accesses aren't supposed to trigger before features are
> > >>>>>>>>>> set.
> > >>>>>>>>>>               * If it does happen we assume a legacy guest.
> > >>>>>>>>>>               */
> > >>>>>>>>>>              if (!vdev->features_valid)
> > >>>>>>>>>>                      vdpa_set_features(vdev, 0);
> > >>>>>>>>>>              ops->get_config(vdev, offset, buf, len);
> > >>>>>>>>>>
> > >>>>>>>>>> I can post a patch to fix 2) if there's consensus already reached.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>> -Siwei
> > >>>>>>>>> I'm not sure how important it is to change that.
> > >>>>>>>>> In any case it only affects transitional devices, right?
> > >>>>>>>>> Legacy only should not care ...
> > >>>>>>>> Yes I'd like to distinguish legacy driver (suppose it is 0.95) against the
> > >>>>>>>> modern one in a transitional device model rather than being legacy only.
> > >>>>>>>> That way a v0.95 and v1.0 supporting vdpa parent can support both types of
> > >>>>>>>> guests without having to reconfigure. Or are you suggesting limit to legacy
> > >>>>>>>> only at the time of vdpa creation would simplify the implementation a lot?
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> -Siwei
> > >>>>>>> I don't know for sure. Take a look at the work Halil was doing
> > >>>>>>> to try and support transitional devices with BE guests.
> > >>>>>> Hmmm, we can have those endianness ioctls defined but the initial QEMU
> > >>>>>> implementation can be started to support x86 guest/host with little
> > >>>>>> endian and weak memory ordering first. The real trick is to detect
> > >>>>>> legacy guest - I am not sure if it's feasible to shift all the legacy
> > >>>>>> detection work to QEMU, or the kernel has to be part of the detection
> > >>>>>> (e.g. the kick before DRIVER_OK thing we have to duplicate the tracking
> > >>>>>> effort in QEMU) as well. Let me take a further look and get back.
> > >>>>> Michael may think differently but I think doing this in Qemu is much easier.
> > >>>> I think the key is whether we position emulating legacy interfaces in QEMU
> > >>>> doing translation on top of a v1.0 modern-only device in the kernel, or we
> > >>>> allow vdpa core (or you can say vhost-vdpa) and vendor driver to support a
> > >>>> transitional model in the kernel that is able to work for both v0.95 and
> > >>>> v1.0 drivers, with some slight aid from QEMU for
> > >>>> detecting/emulation/shadowing (for e.g CVQ, I/O port relay). I guess for the
> > >>>> former we still rely on vendor for a performant data vqs implementation,
> > >>>> leaving the question to what it may end up eventually in the kernel is
> > >>>> effectively the latter).
> > >>>>
> > >>>> Thanks,
> > >>>> -Siwei
> > >>> My suggestion is post the kernel patches, and we can evaluate
> > >>> how much work they are.
> > >> Thanks for the feedback. I will take some read then get back, probably
> > >> after the winter break. Stay tuned.
> > >>
> > >> Thanks,
> > >> -Siwei
> > >>
> > >>>>> Thanks
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>> Meanwhile, I'll check internally to see if a legacy only model would
> > >>>>>> work. Thanks.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> -Siwei
> > >>>>>>
> > >>>>>>
> > >>>>>>>>>> On 3/2/2021 2:53 AM, Jason Wang wrote:
> > >>>>>>>>>>> On 2021/3/2 5:47 下午, Michael S. Tsirkin wrote:
> > >>>>>>>>>>>> On Mon, Mar 01, 2021 at 11:56:50AM +0800, Jason Wang wrote:
> > >>>>>>>>>>>>> On 2021/3/1 5:34 上午, Michael S. Tsirkin wrote:
> > >>>>>>>>>>>>>> On Wed, Feb 24, 2021 at 10:24:41AM -0800, Si-Wei Liu wrote:
> > >>>>>>>>>>>>>>>> Detecting it isn't enough though, we will need a new ioctl to notify
> > >>>>>>>>>>>>>>>> the kernel that it's a legacy guest. Ugh :(
> > >>>>>>>>>>>>>>> Well, although I think adding an ioctl is doable, may I
> > >>>>>>>>>>>>>>> know what the use
> > >>>>>>>>>>>>>>> case there will be for kernel to leverage such info
> > >>>>>>>>>>>>>>> directly? Is there a
> > >>>>>>>>>>>>>>> case QEMU can't do with dedicate ioctls later if there's indeed
> > >>>>>>>>>>>>>>> differentiation (legacy v.s. modern) needed?
> > >>>>>>>>>>>>>> BTW a good API could be
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> #define VHOST_SET_ENDIAN _IOW(VHOST_VIRTIO, ?, int)
> > >>>>>>>>>>>>>> #define VHOST_GET_ENDIAN _IOW(VHOST_VIRTIO, ?, int)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> we did it per vring but maybe that was a mistake ...
> > >>>>>>>>>>>>> Actually, I wonder whether it's good time to just not support
> > >>>>>>>>>>>>> legacy driver
> > >>>>>>>>>>>>> for vDPA. Consider:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1) It's definition is no-normative
> > >>>>>>>>>>>>> 2) A lot of budren of codes
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> So qemu can still present the legacy device since the config
> > >>>>>>>>>>>>> space or other
> > >>>>>>>>>>>>> stuffs that is presented by vhost-vDPA is not expected to be
> > >>>>>>>>>>>>> accessed by
> > >>>>>>>>>>>>> guest directly. Qemu can do the endian conversion when necessary
> > >>>>>>>>>>>>> in this
> > >>>>>>>>>>>>> case?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> Overall I would be fine with this approach but we need to avoid breaking
> > >>>>>>>>>>>> working userspace, qemu releases with vdpa support are out there and
> > >>>>>>>>>>>> seem to work for people. Any changes need to take that into account
> > >>>>>>>>>>>> and document compatibility concerns.
> > >>>>>>>>>>> Agree, let me check.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>>        I note that any hardware
> > >>>>>>>>>>>> implementation is already broken for legacy except on platforms with
> > >>>>>>>>>>>> strong ordering which might be helpful in reducing the scope.
> > >>>>>>>>>>> Yes.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> >