[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45EF6077.9090302@vmware.com>
Date: Wed, 07 Mar 2007 17:01:43 -0800
From: Daniel Arai <arai@...are.com>
To: tglx@...utronix.de
Cc: Zachary Amsden <zach@...are.com>,
Virtualization Mailing List <virtualization@...ts.osdl.org>,
john stultz <johnstul@...ibm.com>, Ingo Molnar <mingo@...e.hu>,
akpm@...ux-foundation.org, LKML <linux-kernel@...r.kernel.org>
Subject: Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
Thomas Gleixner wrote:
> You managed to avoid the usage of other code (i.e. PIT / HPET) already,
> so why is it sooo desireable to emulate apics instead of substituting it
> by a small and sane replacement ? Just because you happen to have an
> LAPIC emulator ? That's no reason to wire yourself into the kernel code
> and make it harder to change and maintain.
There are several reasons why it's desirable to emulate the APIC. As you
mentioned, we already have APIC emulation, and APIC emulation isn't a huge
bottleneck on most workloads. Our code works, the Linux code works, and
replacing both pieces of code with something "small and sane" isn't going to
improve performance very much, so why bother? Any hypervisor implementation is
going to be a tradeoff between what's easy to implement in the hypervisor,
what's easy to implement in the guest operating system, and what's performance
critical.
Secondly, not all (para-)virtualized operating systems will want to use
abstracted devices. Some virtual operating systems will be given direct access
to hardware devices, and will need to run the actual driver for that device and
not some abstracted device driver. So I don't buy your argument that every
piece of the kernel that interacts with a paravirtualized driver should have a
"small and sane replacement."
But more importantly, we want a kernel that can run both on native hardware and
in a paravirtualized environment. Linux doesn't really provide abstractions for
replacing the appropriate code. We tried to hook into the source code at a
level that seemed possible.
For example, take smp_call_function(). What this essentially does is call
send_IPI_allbutself().
void fastcall send_IPI_self(int vector)
{
__send_IPI_shortcut(APIC_DEST_SELF, vector);
}
void __send_IPI_shortcut(unsigned int shortcut, int vector)
{
/*
* Subtle. In the case of the 'never do double writes' workaround
* we have to lock out interrupts to be safe. As we don't care
* of the value read we use an atomic rmw access to avoid costly
* cli/sti. Otherwise we use an even cheaper single atomic write
* to the APIC.
*/
unsigned int cfg;
/*
* Wait for idle.
*/
apic_wait_icr_idle();
/*
* No need to touch the target chip field
*/
cfg = __prepare_ICR(shortcut, vector);
/*
* Send the IPI. The write to APIC_ICR fires this off.
*/
apic_write_around(APIC_ICR, cfg);
}
There's no good way to override __send_IPI_shortcut. I suppose we could add
paravirt ops for __send_IPI_shortcut and every other op that touches the APIC.
But there are dozens of functions in apic.c that would need to be included in
paravirt ops. And for our implementation, we really just want to override
apic_read and apic_write, since we can make these faster when done through
hypercalls than through memory accesses. If we were to make these paravirt ops,
their implementations would be the same, except with a different apic_read and
apic_write. This is a whole lot of useless code duplication.
Most of the interrupt system is not written in such a way that multiple APICs
implementations can be selected from at boot time. This is an absolute
requirement so that the same kernel can boot on native and in a paravirtualized
environment. While this could be implemented, it seems like a waste of time,
since we can just emulate something similar to a real interrupt system and not
change things very much.
Dan Arai
VMware, Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists