netdev - Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 25 Sep 2009 17:32:19 -0400
From:	Gregory Haskins <gregory.haskins@...il.com>
To:	Avi Kivity <avi@...hat.com>
CC:	"Ira W. Snyder" <iws@...o.caltech.edu>,
	"Michael S. Tsirkin" <mst@...hat.com>, netdev@...r.kernel.org,
	virtualization@...ts.linux-foundation.org, kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org, mingo@...e.hu, linux-mm@...ck.org,
	akpm@...ux-foundation.org, hpa@...or.com,
	Rusty Russell <rusty@...tcorp.com.au>, s.hetze@...ux-ag.com,
	alacrityvm-devel@...ts.sourceforge.net
Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

Avi Kivity wrote:
> On 09/24/2009 09:03 PM, Gregory Haskins wrote:
>>
>>> I don't really see how vhost and vbus are different here.  vhost expects
>>> signalling to happen through a couple of eventfds and requires someone
>>> to supply them and implement kernel support (if needed).  vbus requires
>>> someone to write a connector to provide the signalling implementation.
>>> Neither will work out-of-the-box when implementing virtio-net over
>>> falling dominos, for example.
>>>      
>> I realize in retrospect that my choice of words above implies vbus _is_
>> complete, but this is not what I was saying.  What I was trying to
>> convey is that vbus is _more_ complete.  Yes, in either case some kind
>> of glue needs to be written.  The difference is that vbus implements
>> more of the glue generally, and leaves less required to be customized
>> for each iteration.
>>    
> 
> 
> No argument there.  Since you care about non-virt scenarios and virtio
> doesn't, naturally vbus is a better fit for them as the code stands.

Thanks for finally starting to acknowledge there's a benefit, at least.

To be more precise, IMO virtio is designed to be a performance oriented
ring-based driver interface that supports all types of hypervisors (e.g.
shmem based kvm, and non-shmem based Xen).  vbus is designed to be a
high-performance generic shared-memory interconnect (for rings or
otherwise) framework for environments where linux is the underpinning
"host" (physical or virtual).  They are distinctly different, but
complementary (the former addresses the part of the front-end, and
latter addresses the back-end, and a different part of the front-end).

In addition, the kvm-connector used in AlacrityVM's design strives to
add value and improve performance via other mechanisms, such as dynamic
 allocation, interrupt coalescing (thus reducing exit-ratio, which is a
serious issue in KVM) and priortizable/nestable signals.

Today there is a large performance disparity between what a KVM guest
sees and what a native linux application sees on that same host.  Just
take a look at some of my graphs between "virtio", and "native", for
example:

http://developer.novell.com/wiki/images/b/b7/31-rc4_throughput.png

A dominant vbus design principle is to try to achieve the same IO
performance for all "linux applications" whether they be literally
userspace applications, or things like KVM vcpus or Ira's physical
boards.  It also aims to solve problems not previously expressible with
current technologies (even virtio), like nested real-time.

And even though you repeatedly insist otherwise, the neat thing here is
that the two technologies mesh (at least under certain circumstances,
like when virtio is deployed on a shared-memory friendly linux backend
like KVM).  I hope that my stack diagram below depicts that clearly.


> But that's not a strong argument for vbus; instead of adding vbus you
> could make virtio more friendly to non-virt

Actually, it _is_ a strong argument then because adding vbus is what
helps makes virtio friendly to non-virt, at least for when performance
matters.

> (there's a limit how far you
> can take this, not imposed by the code, but by virtio's charter as a
> virtual device driver framework).
> 
>> Going back to our stack diagrams, you could think of a vhost solution
>> like this:
>>
>> --------------------------
>> | virtio-net
>> --------------------------
>> | virtio-ring
>> --------------------------
>> | virtio-bus
>> --------------------------
>> | ? undefined-1 ?
>> --------------------------
>> | vhost
>> --------------------------
>>
>> and you could think of a vbus solution like this
>>
>> --------------------------
>> | virtio-net
>> --------------------------
>> | virtio-ring
>> --------------------------
>> | virtio-bus
>> --------------------------
>> | bus-interface
>> --------------------------
>> | ? undefined-2 ?
>> --------------------------
>> | bus-model
>> --------------------------
>> | virtio-net-device (vhost ported to vbus model? :)
>> --------------------------
>>
>>
>> So the difference between vhost and vbus in this particular context is
>> that you need to have "undefined-1" do device discovery/hotswap,
>> config-space, address-decode/isolation, signal-path routing, memory-path
>> routing, etc.  Today this function is filled by things like virtio-pci,
>> pci-bus, KVM/ioeventfd, and QEMU for x86.  I am not as familiar with
>> lguest, but presumably it is filled there by components like
>> virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher.  And to use
>> more contemporary examples, we might have virtio-domino, domino-bus,
>> domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko,
>> and ira-launcher.
>>
>> Contrast this to the vbus stack:  The bus-X components (when optionally
>> employed by the connector designer) do device-discovery, hotswap,
>> config-space, address-decode/isolation, signal-path and memory-path
>> routing, etc in a general (and pv-centric) way. The "undefined-2"
>> portion is the "connector", and just needs to convey messages like
>> "DEVCALL" and "SHMSIGNAL".  The rest is handled in other parts of the
>> stack.
>>
>>    
> 
> Right.  virtio assumes that it's in a virt scenario and that the guest
> architecture already has enumeration and hotplug mechanisms which it
> would prefer to use.  That happens to be the case for kvm/x86.

No, virtio doesn't assume that.  It's stack provides the "virtio-bus"
abstraction and what it does assume is that it will be wired up to
something underneath. Kvm/x86 conveniently has pci, so the virtio-pci
adapter was created to reuse much of that facility.  For other things
like lguest and s360, something new had to be created underneath to make
up for the lack of pci-like support.

vbus, in conjunction with the kvm-connector, tries to unify that process
a little more by creating a PV-optimized bus.  The idea is that it can
be reused in that situation instead of creating a new hypervisor
specific bus each time.  It's also designed for high-performance, so you
get that important trait for free simply by tying into it.

> 
>> So to answer your question, the difference is that the part that has to
>> be customized in vbus should be a fraction of what needs to be
>> customized with vhost because it defines more of the stack.
> 
> But if you want to use the native mechanisms, vbus doesn't have any
> added value.

First of all, thats incorrect.  If you want to use the "native"
mechanisms (via the way the vbus-connector is implemented, for instance)
you at least still have the benefit that the backend design is more
broadly re-useable in more environments (like non-virt, for instance),
because vbus does a proper job of defining the requisite
layers/abstractions compared to vhost.  So it adds value even in that
situation.

Second of all, with PV there is no such thing as "native".  It's
software so it can be whatever we want.  Sure, you could argue that the
guest may have built-in support for something like PCI protocol.
However, PCI protocol itself isn't suitable for high-performance PV out
of the can.  So you will therefore invariably require new software
layers on top anyway, even if part of the support is already included.

And lastly, why would you _need_ to use the so called "native"
mechanism?  The short answer is, "you don't".  Any given system (guest
or bare-metal) already have a wide-range of buses (try running "tree
/sys/bus" in Linux).  More importantly, the concept of adding new buses
is widely supported in both the Windows and Linux driver model (and
probably any other guest-type that matters).  Therefore, despite claims
to the contrary, its not hard or even unusual to add a new bus to the mix.

In summary, vbus is simply one more bus of many, purpose built to
support high-end IO in a virt-like model, giving controlled access to
the linux-host underneath it.  You can write a high-performance layer
below the OS bus-model (vbus), or above it (virtio-pci) but either way
you are modifying the stack to add these capabilities, so we might as
well try to get this right.

With all due respect, you are making a big deal out of a minor issue.

> 
>> And, as
>> eluded to in my diagram, both virtio-net and vhost (with some
>> modifications to fit into the vbus framework) are potentially
>> complementary, not competitors.
>>    
> 
> Only theoretically.  The existing installed base would have to be thrown
> away

"Thrown away" is pure hyperbole.  The installed base, worse case, needs
to load a new driver for a missing device.  This is pretty much how
every machine works today, anyway.  And if loading a driver was actually
some insurmountable hurdle, as its sometimes implied (but its not in
reality), you can alternatively make vbus look like a legacy bus if you
are willing to sacrifice some of features, like exit-ratio reduction and
priority.

FWIW: AlacrityVM isn't willing to sacrifice those features, so we will
provide a Linux and Windows driver for explicit bus support, as well as
open-specs and community development assistance to any other guest that
wants to add support in the future.

> or we'd need to support both.
> 
>

No matter what model we talk about, there's always going to be a "both"
since the userspace virtio models are probably not going to go away (nor
should they).

> 
> 
>>> Without a vbus-connector-falling-dominos, vbus-venet can't do anything
>>> either.
>>>      
>> Mostly covered above...
>>
>> However, I was addressing your assertion that vhost somehow magically
>> accomplishes this "container/addressing" function without any specific
>> kernel support.  This is incorrect.  I contend that this kernel support
>> is required and present.  The difference is that its defined elsewhere
>> (and typically in a transport/arch specific way).
>>
>> IOW: You can basically think of the programmed PIO addresses as forming
>> its "container".  Only addresses explicitly added are visible, and
>> everything else is inaccessible.  This whole discussion is merely a
>> question of what's been generalized verses what needs to be
>> re-implemented each time.
>>    
> 
> Sorry, this is too abstract for me.

With all due respect, understanding my point above is required to have
any kind of meaningful discussion here.

> 
> 
> 
>>> vbus doesn't do kvm guest address decoding for the fast path.  It's
>>> still done by ioeventfd.
>>>      
>> That is not correct.  vbus does its own native address decoding in the
>> fast path, such as here:
>>
>> http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/client.c;h=e85b2d92d629734866496b67455dd307486e394a;hb=e6cbd4d1decca8e829db3b2b9b6ec65330b379e9#l331
>>
>>
>>    
> 
> All this is after kvm has decoded that vbus is addresses.  It can't work
> without someone outside vbus deciding that.

How the connector message is delivered is really not relevant.  Some
architectures will simply deliver the message point-to-point (like the
original hypercall design for KVM, or something like Ira's rig), and
some will need additional demuxing (like pci-bridge/pio based KVM).
It's an implementation detail of the connector.

However, the real point here is that something needs to establish a
scoped namespace mechanism, add items to that namespace, and advertise
the presence of the items to the guest.  vbus has this facility built in
to its stack.  vhost doesn't, so it must come from elsewhere.


> 
>> In fact, it's actually a simpler design to unify things this way because
>> you avoid splitting the device model up. Consider how painful the vhost
>> implementation would be if it didn't already have the userspace
>> virtio-net to fall-back on.  This is effectively what we face for new
>> devices going forward if that model is to persist.
>>    
> 
> 
> It doesn't have just virtio-net, it has userspace-based hostplug

vbus has hotplug too: mkdir and rmdir

As an added bonus, its device-model is modular.  A developer can write a
new device model, compile it, insmod it to the host kernel, hotplug it
to the running guest with mkdir/ln, and the come back out again
(hotunplug with rmdir, rmmod, etc).  They may do this all without taking
the guest down, and while eating QEMU based IO solutions for breakfast
performance wise.

Afaict, qemu can't do either of those things.

> and a bunch of other devices impemented in userspace.

Thats fine.  I am primarily interested in the high-performance
components, so most of those other items can stay there in userspace if
that is their ideal location.

>  Currently qemu has
> virtio bindings for pci and syborg (whatever that is), and device models
> for baloon, block, net, and console, so it seems implementing device
> support in userspace is not as disasterous as you make it to be.

I intentionally qualified "device" with "new" in my statement.  And in
that context I was talking about ultimately developing/supporting
in-kernel models, not pure legacy userspace ones.  I have no doubt the
implementation of the original userpsace devices was not a difficult or
horrific endeavor.

Requiring new models to be implemented (at least) twice is a poor design
IMO, however.  Requiring them to split such a minor portion of their
functionality (like read-only attributes) is a poor design, too.  I have
already demonstrated there are other ways to achieve the same
high-performance goals without requiring two models developed/tested
each time and for each manager.  For the times I went and tried to
satisfy your request in this manner, developing the code and managing
the resources in two places, for lack of a better description, made me
want to wretch. So I gave up, resolved that my original design was
better, and hoped that I could convince you and the community of the same.

> 
>>> Invariably?
>>>      
>> As in "always"
>>    
> 
> Refactor instead of duplicating.

There is no duplicating.  vbus has no equivalent today as virtio doesn't
define these layers.

> 
>>   
>>>   Use libraries (virtio-shmem.ko, libvhost.so).
>>>      
>> What do you suppose vbus is?  vbus-proxy.ko = virtio-shmem.ko, and you
>> dont need libvhost.so per se since you can just use standard kernel
>> interfaces (like configfs/sysfs).  I could create an .so going forward
>> for the new ioctl-based interface, I suppose.
>>    
> 
> Refactor instead of rewriting.

There is no rewriting.  vbus has no equivalent today as virtio doesn't
define these layers.

By your own admission, you said if you wanted that capability, use a
library.  What I think you are not understanding is vbus _is_ that
library.  So what is the problem, exactly?

> 
> 
> 
>>> For kvm/x86 pci definitely remains king.
>>>      
>> For full virtualization, sure.  I agree.  However, we are talking about
>> PV here.  For PV, PCI is not a requirement and is a technical dead-end
>> IMO.
>>
>> KVM seems to be the only virt solution that thinks otherwise (*), but I
>> believe that is primarily a condition of its maturity.  I aim to help
>> advance things here.
>>
>> (*) citation: xen has xenbus, lguest has lguest-bus, vmware has some
>> vmi-esq thing (I forget what its called) to name a few.  Love 'em or
>> hate 'em, most other hypervisors do something along these lines.  I'd
>> like to try to create one for KVM, but to unify them all (at least for
>> the Linux-based host designs).
>>    
> 
> VMware are throwing VMI away (won't be supported in their new product,
> and they've sent a patch to rip it off from Linux);

vmware only cares about x86 iiuc, so probably not a good example.

> Xen has to tunnel
> xenbus in pci for full virtualization (which is where Windows is, and
> where Linux will be too once people realize it's faster).  lguest is
> meant as an example hypervisor, not an attempt to take over the world.

So pick any other hypervisor, and the situation is often similar.

> 
> "PCI is a dead end" could not be more wrong, it's what guests support.

It's what _some_ guests support.  Even for the guests that support it,
it's not well designed for PV.  Therefore, you have to do a bunch of
dancing and waste resources on top to squeeze every last drop of
performance out of your platform.  In addition, it has a bunch of
baggage that goes with it that is not necessary to do the job in a
software environment.  It is therefore burdensome to recreate if you
don't already have something to leverage, like QEMU, just for the sake
of creating the illusion that its there.

Sounds pretty dead to me, sorry.  We don't need it.

Alternatively, you can just try to set a stake in the ground for looking
forward and fixing those PV-specific problems hopefully once and for
all, like vbus and the kvm-connector tries to do.  Sure, there will be
some degree of pain first as we roll out the subsystem and deploy
support, but thats true for lots of things.  It's simply a platform
investment.


> An right now you can have a guest using pci to access a mix of
> userspace-emulated devices, userspace-emulated-but-kernel-accelerated
> virtio devices, and real host devices.  All on one dead-end bus.  Try
> that with vbus.

vbus is not interested in userspace devices.  The charter is to provide
facilities for utilizing the host linux kernel's IO capabilities in the
most efficient, yet safe, manner possible.  Those devices that fit
outside that charter can ride on legacy mechanisms if that suits them best.

> 
> 
>>>> I digress.  My point here isn't PCI.  The point here is the missing
>>>> component for when PCI is not present.  The component that is partially
>>>> satisfied by vbus's devid addressing scheme.  If you are going to use
>>>> vhost, and you don't have PCI, you've gotta build something to replace
>>>> it.
>>>>
>>>>        
>>> Yes, that's why people have keyboards.  They'll write that glue code if
>>> they need it.  If it turns out to be a hit an people start having virtio
>>> transport module writing parties, they'll figure out a way to share
>>> code.
>>>      
>> Sigh...  The party has already started.  I tried to invite you months
>> ago...
>>    
> 
> I've been voting virtio since 2007.

That doesn't have much to do with whats underneath it, since it doesn't
define these layers.  See my stack diagram's for details.

> 
>>> On the guest side, virtio-shmem.ko can unify the ring access.  It
>>> probably makes sense even today.  On the host side I eventfd is the
>>> kernel interface and libvhostconfig.so can provide the configuration
>>> when an existing ABI is not imposed.
>>>      
>> That won't cut it.  For one, creating an eventfd is only part of the
>> equation.  I.e. you need to have originate/terminate somewhere
>> interesting (and in-kernel, otherwise use tuntap).
>>    
> 
> vbus needs the same thing so it cancels out.

No, it does not.  vbus just needs a relatively simple single message
pipe between the guest and host (think "hypercall tunnel", if you will).
 Per queue/device addressing is handled by the same conceptual namespace
as the one that would trigger eventfds in the model you mention.  And
that namespace is built in to the vbus stack, and objects are registered
automatically as they are created.

Contrast that to vhost, which requires some other kernel interface to
exist, and to be managed manually for each object that is created.  Your
libvhostconfig would need to somehow know how to perform this
registration operation, and there would have to be something in the
kernel to receive it, presumably on a per platform basis.  Solving this
problem generally would probably end up looking eerily like vbus,
because thats what vbus does.

> 
>>> Look at the virtio-net feature negotiation.  There's a lot more there
>>> than the MAC address, and it's going to grow.
>>>      
>> Agreed, but note that makes my point.  That feature negotiation almost
>> invariably influences the device-model, not some config-space shim.
>> IOW: terminating config-space at some userspace shim is pointless.  The
>> model ultimately needs the result of whatever transpires during that
>> negotiation anyway.
>>    
> 
> Well, let's see.  Can vbus today:
> 
> - let userspace know which features are available (so it can decide if
> live migration is possible)

yes, its in sysfs.

> - let userspace limit which features are exposed to the guest (so it can
> make live migration possible among hosts of different capabilities)

yes, its in sysfs.

> - let userspace know which features were negotiated (so it can transfer
> them to the other host during live migration)

no, but we can easily add ->save()/->restore() to the model going
forward, and the negotiated features are just a subcomponent if its
serialized stream.

> - let userspace tell the kernel which features were negotiated (when
> live migration completes, to avoid requiring the guest to re-negotiate)

that would be the function of the ->restore() deserializer.

> - do all that from an unprivileged process

yes, in the upcoming alacrityvm v0.3 with the ioctl based control plane.

> - securely wrt other unprivileged processes

yes, same mechanism plus it has a fork-inheritance model.

Bottom line: vbus isn't done, especially w.r.t. live-migration..but that
is not an valid argument against the idea if you believe in
release-early/release-often. kvm wasn't (isn't) done either when it was
proposed/merged.

Kind Regards,
-Greg


Download attachment "signature.asc" of type "application/pgp-signature" (268 bytes)