[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4ABB1D44.5000007@redhat.com>
Date: Thu, 24 Sep 2009 10:18:28 +0300
From: Avi Kivity <avi@...hat.com>
To: Gregory Haskins <gregory.haskins@...il.com>
CC: "Ira W. Snyder" <iws@...o.caltech.edu>,
"Michael S. Tsirkin" <mst@...hat.com>, netdev@...r.kernel.org,
virtualization@...ts.linux-foundation.org, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, mingo@...e.hu, linux-mm@...ck.org,
akpm@...ux-foundation.org, hpa@...or.com,
Rusty Russell <rusty@...tcorp.com.au>, s.hetze@...ux-ag.com,
alacrityvm-devel@...ts.sourceforge.net
Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 09/24/2009 12:15 AM, Gregory Haskins wrote:
>
>>> There are various aspects about designing high-performance virtual
>>> devices such as providing the shortest paths possible between the
>>> physical resources and the consumers. Conversely, we also need to
>>> ensure that we meet proper isolation/protection guarantees at the same
>>> time. What this means is there are various aspects to any
>>> high-performance PV design that require to be placed in-kernel to
>>> maximize the performance yet properly isolate the guest.
>>>
>>> For instance, you are required to have your signal-path (interrupts and
>>> hypercalls), your memory-path (gpa translation), and
>>> addressing/isolation model in-kernel to maximize performance.
>>>
>>>
>> Exactly. That's what vhost puts into the kernel and nothing more.
>>
> Actually, no. Generally, _KVM_ puts those things into the kernel, and
> vhost consumes them. Without KVM (or something equivalent), vhost is
> incomplete. One of my goals with vbus is to generalize the "something
> equivalent" part here.
>
I don't really see how vhost and vbus are different here. vhost expects
signalling to happen through a couple of eventfds and requires someone
to supply them and implement kernel support (if needed). vbus requires
someone to write a connector to provide the signalling implementation.
Neither will work out-of-the-box when implementing virtio-net over
falling dominos, for example.
>>> Vbus accomplishes its in-kernel isolation model by providing a
>>> "container" concept, where objects are placed into this container by
>>> userspace. The host kernel enforces isolation/protection by using a
>>> namespace to identify objects that is only relevant within a specific
>>> container's context (namely, a "u32 dev-id"). The guest addresses the
>>> objects by its dev-id, and the kernel ensures that the guest can't
>>> access objects outside of its dev-id namespace.
>>>
>>>
>> vhost manages to accomplish this without any kernel support.
>>
> No, vhost manages to accomplish this because of KVMs kernel support
> (ioeventfd, etc). Without a KVM-like in-kernel support, vhost is a
> merely a kind of "tuntap"-like clone signalled by eventfds.
>
Without a vbus-connector-falling-dominos, vbus-venet can't do anything
either. Both vhost and vbus need an interface, vhost's is just narrower
since it doesn't do configuration or enumeration.
> This goes directly to my rebuttal of your claim that vbus places too
> much in the kernel. I state that, one way or the other, address decode
> and isolation _must_ be in the kernel for performance. Vbus does this
> with a devid/container scheme. vhost+virtio-pci+kvm does it with
> pci+pio+ioeventfd.
>
vbus doesn't do kvm guest address decoding for the fast path. It's
still done by ioeventfd.
>> The guest
>> simply has not access to any vhost resources other than the guest->host
>> doorbell, which is handed to the guest outside vhost (so it's somebody
>> else's problem, in userspace).
>>
> You mean _controlled_ by userspace, right? Obviously, the other side of
> the kernel still needs to be programmed (ioeventfd, etc). Otherwise,
> vhost would be pointless: e.g. just use vanilla tuntap if you don't need
> fast in-kernel decoding.
>
Yes (though for something like level-triggered interrupts we're probably
keeping it in userspace, enjoying the benefits of vhost data path while
paying more for signalling).
>>> All that is required is a way to transport a message with a "devid"
>>> attribute as an address (such as DEVCALL(devid)) and the framework
>>> provides the rest of the decode+execute function.
>>>
>>>
>> vhost avoids that.
>>
> No, it doesn't avoid it. It just doesn't specify how its done, and
> relies on something else to do it on its behalf.
>
That someone else can be in userspace, apart from the actual fast path.
> Conversely, vbus specifies how its done, but not how to transport the
> verb "across the wire". That is the role of the vbus-connector abstraction.
>
So again, vbus does everything in the kernel (since it's so easy and
cheap) but expects a vbus-connector. vhost does configuration in
userspace (since it's so clunky and fragile) but expects a couple of
eventfds.
>>> Contrast this to vhost+virtio-pci (called simply "vhost" from here).
>>>
>>>
>> It's the wrong name. vhost implements only the data path.
>>
> Understood, but vhost+virtio-pci is what I am contrasting, and I use
> "vhost" for short from that point on because I am too lazy to type the
> whole name over and over ;)
>
If you #define A A+B+C don't expect intelligent conversation afterwards.
>>> It is not immune to requiring in-kernel addressing support either, but
>>> rather it just does it differently (and its not as you might expect via
>>> qemu).
>>>
>>> Vhost relies on QEMU to render PCI objects to the guest, which the guest
>>> assigns resources (such as BARs, interrupts, etc).
>>>
>> vhost does not rely on qemu. It relies on its user to handle
>> configuration. In one important case it's qemu+pci. It could just as
>> well be the lguest launcher.
>>
> I meant vhost=vhost+virtio-pci here. Sorry for the confusion.
>
> The point I am making specifically is that vhost in general relies on
> other in-kernel components to function. I.e. It cannot function without
> having something like the PCI model to build an IO namespace. That
> namespace (in this case, pio addresses+data tuples) are used for the
> in-kernel addressing function under KVM + virtio-pci.
>
> The case of the lguest launcher is a good one to highlight. Yes, you
> can presumably also use lguest with vhost, if the requisite facilities
> are exposed to lguest-bus, and some eventfd based thing like ioeventfd
> is written for the host (if it doesnt exist already).
>
> And when the next virt design "foo" comes out, it can make a "foo-bus"
> model, and implement foo-eventfd on the backend, etc, etc.
>
It's exactly the same with vbus needing additional connectors for
additional transports.
> Ira can make ira-bus, and ira-eventfd, etc, etc.
>
> Each iteration will invariably introduce duplicated parts of the stack.
>
Invariably? Use libraries (virtio-shmem.ko, libvhost.so).
>> For the N+1th time, no. vhost is perfectly usable without pci. Can we
>> stop raising and debunking this point?
>>
> Again, I understand vhost is decoupled from PCI, and I don't mean to
> imply anything different. I use PCI as an example here because a) its
> the only working example of vhost today (to my knowledge), and b) you
> have stated in the past that PCI is the only "right" way here, to
> paraphrase. Perhaps you no longer feel that way, so I apologize if you
> feel you already recanted your position on PCI and I missed it.
>
For kvm/x86 pci definitely remains king. I was talking about the two
lguest users and Ira.
> I digress. My point here isn't PCI. The point here is the missing
> component for when PCI is not present. The component that is partially
> satisfied by vbus's devid addressing scheme. If you are going to use
> vhost, and you don't have PCI, you've gotta build something to replace it.
>
Yes, that's why people have keyboards. They'll write that glue code if
they need it. If it turns out to be a hit an people start having virtio
transport module writing parties, they'll figure out a way to share code.
>>> All you really need is a simple decode+execute mechanism, and a way to
>>> program it from userspace control. vbus tries to do just that:
>>> commoditize it so all you need is the transport of the control messages
>>> (like DEVCALL()), but the decode+execute itself is reuseable, even
>>> across various environments (like KVM or Iras rig).
>>>
>>>
>> If you think it should be "commodotized", write libvhostconfig.so.
>>
> I know you are probably being facetious here, but what do you propose
> for the parts that must be in-kernel?
>
On the guest side, virtio-shmem.ko can unify the ring access. It
probably makes sense even today. On the host side I eventfd is the
kernel interface and libvhostconfig.so can provide the configuration
when an existing ABI is not imposed.
>>> And your argument, I believe, is that vbus allows both to be implemented
>>> in the kernel (though to reiterate, its optional) and is therefore a bad
>>> design, so lets discuss that.
>>>
>>> I believe the assertion is that things like config-space are best left
>>> to userspace, and we should only relegate fast-path duties to the
>>> kernel. The problem is that, in my experience, a good deal of
>>> config-space actually influences the fast-path and thus needs to
>>> interact with the fast-path mechanism eventually anyway.
>>> Whats left
>>> over that doesn't fall into this category may cheaply ride on existing
>>> plumbing, so its not like we created something new or unnatural just to
>>> support this subclass of config-space.
>>>
>>>
>> Flexibility is reduced, because changing code in the kernel is more
>> expensive than in userspace, and kernel/user interfaces aren't typically
>> as wide as pure userspace interfaces. Security is reduced, since a bug
>> in the kernel affects the host, while a bug in userspace affects just on
>> guest.
>>
> For a mac-address attribute? Thats all we are really talking about
> here. These points you raise, while true of any kernel code I suppose,
> are a bit of a stretch in this context.
>
Look at the virtio-net feature negotiation. There's a lot more there
than the MAC address, and it's going to grow.
>> Example: feature negotiation. If it happens in userspace, it's easy to
>> limit what features we expose to the guest.
>>
> Its not any harder in the kernel. I do this today.
>
> And when you are done negotiating said features, you will generally have
> to turn around and program the feature into the backend anyway (e.g.
> ioctl() to vhost module). Now you have to maintain some knowledge of
> that particular feature and how to program it in two places.
>
No, you can leave it enabled unconditionally in vhost (the guest won't
use what it doesn't know about).
> Conversely, I am eliminating the (unnecessary) middleman by letting the
> feature negotiating take place directly between the two entities that
> will consume it.
>
The middleman is necessary, if you want to support live migration, or to
restrict a guest to a subset of your features.
>> If it happens in the
>> kernel, we need to add an interface to let the kernel know which
>> features it should expose to the guest.
>>
> You need this already either way for both models anyway. As an added
> bonus, vbus has generalized that interface using sysfs attributes, so
> all models are handled in a similar and community accepted way.
>
vhost doesn't need it since userspace takes care of it.
>> We also need to add an
>> interface to let userspace know which features were negotiated, if we
>> want to implement live migration. Something fairly trivial bloats rapidly.
>>
> Can you elaborate on the requirements for live-migration? Wouldnt an
> opaque save/restore model work here? (e.g. why does userspace need to be
> able to interpret the in-kernel state, just pass it along as a blob to
> the new instance).
>
A blob would work, if you commit to forward and backward compatibility
in the kernel side (i.e. an older kernel must be able to accept a blob
from a newer one). I don't like blobs though, they tie you to the
implemenetation.
>> As you can see above, userspace needs to be involved in this, and the
>> number of interfaces required is smaller if it's in userspace:
>>
> Actually, no. My experience has been the opposite. Anytime I sat down
> and tried to satisfy your request to move things to the userspace,
> things got ugly and duplicative really quick. I suspect part of the
> reason you may think its easier because you already have part of
> virtio-net in userspace and its surrounding support, but that is not the
> case moving forward for new device types.
>
I can't comment on your experience, but we'll definitely build on
existing code for new device types.
>> you only
>> need to know which features the kernel supports (they can be enabled
>> unconditionally, just not exposed).
>>
>> Further, some devices are perfectly happy to be implemented in
>> userspace, so we need userspace configuration support anyway. Why
>> reimplement it in the kernel?
>>
> Thats fine. vbus is targetted for high-performance IO. So if you have
> a robust userspace (like KVM+QEMU) and low-performance constraints (say,
> for a console or something), put it in userspace and vbus is not
> involved. I don't care.
>
So now the hypothetical non-pci hypervisor needs to support two busses.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists