linux-kernel - Re: [PATCH] xen: core dom0 support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49AB9336.7010103@goop.org>
Date:	Mon, 02 Mar 2009 00:05:10 -0800
From:	Jeremy Fitzhardinge <jeremy@...p.org>
To:	Nick Piggin <nickpiggin@...oo.com.au>
CC:	Andrew Morton <akpm@...ux-foundation.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	the arch/x86 maintainers <x86@...nel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Xen-devel <xen-devel@...ts.xensource.com>
Subject: Re: [PATCH] xen: core dom0 support

Nick Piggin wrote:
>> Those would be pertinent questions if I were suddenly popping up and
>> saying "hey, let's add Xen support to the kernel!"  But Xen support has
>> been in the kernel for well over a year now, and is widely used, enabled
>> in distros, etc.  The patches I'm proposing here are not a whole new
>> thing, they're part of the last 10% to fill out the kernel's support to
>> make it actually useful.
>>     
>
> As a guest, I guess it has been agreed that guest support for all
> different hypervisors is "a good thing". dom0 is more like a piece
> of the hypervisor itself, right?
>   

Hm, I wouldn't put it like that.  dom0 is no more part of the hypervisor 
than the hypervisor is part of dom0.  The hypervisor provides one set of 
services (domain isolation and multiplexing).  Domains with direct 
hardware access and drivers provide arbitration for virtualized device 
access.  They provide orthogonal sets of functionality which are both 
required to get a working system.

Also, the machinery needed to allow a kernel to operate as dom0 is more 
than that: it allows direct access to hardware in general.  An otherwise 
unprivileged domU can be given access to a specific PCI device via 
PCI-passthrough so that it can drive it directly.  This is often used 
for direct access to 3D hardware, or high-performance networking (esp 
with multi-context hardware that's designed for virtualization use).

>> Because Xen is dedicated to just running virtual machines, its internal
>> architecture can be more heavily oriented towards that task, which
>> affects things from how its scheduler works, its use and multiplexing of
>> physical memory.  For example, Xen manages to use new hardware
>> virtualization features pretty quickly, partly because it doesn't need
>> to trade-off against normal kernel functions.  The clear distinction
>> between the privileged hypervisor and the rest of the domains makes the
>> security people happy as well.  Also, because Xen is small and fairly
>> self-contained, there's quite a few hardware vendors shipping it burned
>> into the firmware so that it really is the first thing to boot (many of
>> instant-on features that laptops have are based on Xen).  Both HP and
>> Dell, at least, are selling servers with Xen pre-installed in the firmware.
>>     
>
> That would kind of seem like Xen has a better design to me, OTOH if it
> needs this dom0 for most device drivers and things, then how much
> difference is it really? Is KVM really disadvantaged by being a part of
> the kernel?
>   

Well, you can lump everything together in dom0 if you want, and that is 
a common way to run a Xen system.  But there's no reason you can't 
disaggregate drivers into their own domains, each with the 
responsibility for a particular device or set of devices (or indeed, any 
other service you want provided).  Xen can use hardware features like 
VT-d to really enforce the partitioning so that the domains can't 
program their hardware to touch anything except what they're allowed to 
touch, so nothing is trusted beyond its actual area of responsibility.  
It also means that killing off and restarting a driver domain is a 
fairly lightweight and straightforward operation because the state is 
isolated and self-contained; guests using a device have to be able to 
deal with a disconnect/reconnect anyway (for migration), so it doesn't 
affect them much.  Part of the reason there's a lot of academic interest 
in Xen is because it has the architectural flexibility to try out lots 
of different configurations.

I wouldn't say that KVM is necessarily disadvantaged by its design; its 
just a particular set of tradeoffs made up-front.  It loses Xen's 
flexibility, but the result is very familiar to Linux people.  A guest 
domain just looks like a qemu process that happens to run in a strange 
processor mode a lot of the time.  The qemu process provides virtual 
device access to its domain, and accesses the normal device drivers like 
any other usermode process would.  The domains are as isolated from each 
other as much as processes normally are, but they're all floating around 
in the same kernel; whether that provides enough isolation for whatever 
technical, billing, security, compliance/regulatory or other 
requirements you have is up to the user to judge.

>> Once important area of paravirtualization is that Xen guests directly
>> use the processor's pagetables; there is no shadow pagetable or use of
>> hardware pagetable nesting.  This means that a tlb miss is just a tlb
>> miss, and happens at full processor performance.  This is possible
>> because 1) pagetables are always read-only to the guest, and 2) the
>> guest is responsible for looking up in a table to map guest-local pfns
>> into machine-wide mfns before installing them in a pte.  Xen will check
>> that any new mapping or pagetable satisfies all the rules, by checking
>> that the writable reference count is 0, and that the domain owns (or has
>> been allowed access to) any mfn it tries to install in a pagetable.
>>     
>
> Xen's memory virtualization is pretty neat, I'll give it that. Is it
> faster than KVM on a modern CPU?

It really depends on the workload.  There's three cases to consider: 
software shadow pagetables, hardware nested pagetables, and Xen direct 
pagetables.  Even now, Xen's (highly optimised) shadow pagetable code 
generally out-performs modern nested pagetables, at least when running 
Windows (for which that code was most heavily tuned).  Shadow pagetables 
and nested pagetables will generally outperform direct pagetables when 
the workload does lots of pagetable updates compared to accesses.  (I 
don't know what the current state of kvm's shadow pagetable performance 
is, but it seems OK.)

But if you're mostly accessing the pagetable, direct pagetables still 
win.  On a tlb miss, it gets 4 memory accesses, whereas a nested 
pagetable tlb miss needs 24 memory accesses; and a nested tlb hit means 
that you have 24 tlb entries being tied up to service the hit, vs 4.  
(Though the chip vendors are fairly secretive about exactly how they 
structure their tlbs to deal with nested lookups, so I may be off 
here.)  (It also depends on whether you arrange to put the guest, host 
or both memory into large pages; doing so helps a lot.)

>  Would it be possible I wonder to make
> a MMU virtualization layer for CPUs without support, using Xen's page
> table protection methods, and have KVM use that? Or does that amount
> to putting a significant amount of Xen hypervisor into the kernel..?
>   

At one point Avi was considering doing it, but I don't think he ever 
made any real effort in that direction.  KVM is pretty wedded to having 
hardware support anyway, so there's not much point in removing it in 
this one area.

The Xen technique gets its performance from collapsing a level of 
indirection, but that has a cost in terms of flexibility; the hypervisor 
can't do as much mucking around behind the guest's back (for example, 
the guest sees real hardware memory addresses in the form of mfns, so 
Xen can't move pages around, at least not without some form of explicit 
synchronisation).

    J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/