[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45EDD82F.90204@vmware.com>
Date: Tue, 06 Mar 2007 13:07:59 -0800
From: Dan Hecht <dhecht@...are.com>
To: tglx@...utronix.de
CC: Zachary Amsden <zach@...are.com>, Ingo Molnar <mingo@...e.hu>,
akpm@...ux-foundation.org, ak@...e.de,
Virtualization Mailing List <virtualization@...ts.osdl.org>,
Jeremy Fitzhardinge <jeremy@...p.org>,
Rusty Russell <rusty@...tcorp.com.au>,
LKML <linux-kernel@...r.kernel.org>,
john stultz <johnstul@...ibm.com>,
Dan Hecht <dhecht@...are.com>
Subject: Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree
On 03/06/2007 02:59 AM, Thomas Gleixner wrote:
> On Tue, 2007-03-06 at 00:55 -0800, Zachary Amsden wrote:
>>> a proper CE device also has the added bonus of making high-res timers
>>> guests work automatically. It should be simple: just pass it through to
>>> your hypervisor, a hyper-CE-device, like a hyper-clocksource device has
>>> essentially no guest-side complexity.
>>>
>> It is not so simple. In theory it works great. In reality, the i386
>> implementation is completely hardwired to work the way hardware works,
>> and breaking the clockevent code out of the deep ties to the APIC is
>> extremely non-trivial. We tried, and could not accomplish it for 2.6.21
>> because the hrtimers integration was complex, and introduced many bugs
>> for us.
>
> Why is this so non-trivial ? All you have to do is _NOT_ register
> PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead,
> which uses the hypervisor timer emulation instead of real hardware.
>
> clockevents breaks the hardwired assumptions of the old timer code and
> allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e.
> stuff like
>
> /* Disable PIT. */
> outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */
>
Hmm, I think that the (virtual) bios still will set up the PIT ch 0, and
we still need to stop it.
In any case, clockevents doesn't really make it easier nor harder as far
as init goes. In the pre-clockevent days, we replace setup_pit_timer,
setup_boot_clock, setup_secondary_clock. With clockevents, I think the
hook points are the same. Mostly just need to allow the per-cpu
lapic_event to be generalized to local_clock_events that can be set to
whatever device we want. The other thing on i386 is just some minor
annoyances due initially setting up only the PIT on cpu0 on irq 0 and
then later setting up per-cpu timer on lvtt, and making this all place
nice with paravirt timers. But these are just details and just require
some minor changes and will be working, but it just takes some massaging.
So, that is not the real reason to move over the clockevents. The real
reason is to use the generic interrupt handlers. We understand that,
and will get to that point. In the mean time, we are harming no one.
Our code has zero effect when you booting natively or on a non-VMI
hypervisor.
>> We worked around this by keeping NO_IDLE_HZ support, which now
>> you deprecated. So now we are using NO_HZ without a hyper-CE device,
>> and it is working fine. We understand the benefits of moving to the CE
>> model - but it cannot be done overnight.
>
> This is ugly as hell. NO_HZ enables the dyntick functions in idle(),
> irq_enter() and irq_exit() so the clockevents code is actually invoked.
> I have not looked close enough why this does work at all.
>
I believe this was just a quick fix in response to Ingo breaking the VMI
build yesterday by disabling NO_IDLE_HZ on us. There is no technical
reason why NO_IDLE_HZ=y can't coexist with NO_HZ.
(The two work okay together because when using NO_IDLE_HZ, the hooks are
deeper in a custom safe_halt routine which isn't registered when using
nohz mode at runtime, and conversely, the nohz code is guarded at
runtime by the ts->nohz_mode. So, the two really can co-exist at
compile time).
Again, no one is arguing that we shouldn't move to clockevents, it's
just a matter of time (sorry, no pun intended).
The vmi-time code was introduced to solve some shortcomings of the old
(pre-clocksource/clockevents/hrtimer/NO_HZ) i386 timer code that was
especially painful for virtualization. Certainly,
clocksource/clockevents/NO_HZ solves many of the problems (basically,
moving away from counting interrupts to using time sources). e.g. xtime
updating is no longer a worry with the new timeofday/clocksource stuff.
But there are some that may not quite be solved, listed below. (I
know I'm not telling you anything new, but I might as well flesh it out
for the other paravirt folks while the code is fresh in my mind):
1) Stolen time (virtual cpu is ready to run but not running): this is
handled inconsistently between the various clockevent handlers /
CLOCK_EVT_MODE_ONESHOT combinations:
a) tick_handle_periodic / CLOCK_EVT_MODE_PERIODIC: depends on how you
define "periodic" timer in a paravirtual world. If you do something
like Xen-style where you send periodic events only to running vcpus,
then this handler suffers from some of the same problems as the old i386
timer handler:
- jiffies updated according to the number of interrupts you get, so
falls behind monotonic time. generally, counting timer interrupts is
bad for paravirt.
- process time updated according to the number of interrupts, so
falls behind monotonic time. This is probably okay though, since it is
essentially tracking (mono - stolen) time. I.e. the missing time is stolen.
- jiffies updated only by boot cpu, which is a problem for paravirt
since the boot vcpu can be descheduled while the other vcpus are scheduled.
- can probably just avoid this mode by not advertising PERIODIC
capability by your clockevent device.
b) tick_handle_periodic / CLOCK_EVT_MODE_ONESHOT:
- jiffies updated according to monotonic time since we'll loop to
catch up the oneshot timer.
- process time accounted in monotonic time, for the same reason.
this is probably not what we'd want since it will charge time to
whatever process happened to be scheduled in the guest during periods of
stolen time.
- Same problem about jiffies only updated by one vcpu.
c) tick_nohz_handler (implies CLOCK_EVT_MODE_ONESHOT):
- jiffies updated according to monotonic time. this is good.
- Process time effectively does not count stolen time since
update_process_times is called once per callback but oneshot expiry is
advanced into the future. This is probably the right thing for
paravirt, but, inconsistent with (b).
- jiffies updated by all vcpus, which is good.
d) hrtimer_interrupt (really tick_sched_timer):
- w.r.t. stolen time, will be similar to (c). We'll advance the
sched_timer to the next period in the future, skipping stolen time for
process accounting.
- jiffies will be kept up to date with monotonic time.
- jiffies updated by all vcpus, which is good.
Summary: Cases (c) and (d) should be relatively well behaved for
paravirt. So, if we can depend on NO_HZ=y and not being disabled at
runtime, we should be okay. We may want to have some knowledge of
stolen time passed from the hypervisor (if we wanted more accurate
process time accounting), but this can be put back in later, and isn't
too important with sample based accounting system like Linux. But, we
still need to QA all four cases, and all four cases will likely expose
different bugs due to second order effects.
2) Virtual interrupts have a relatively high overhead as compared with
native interrupts. So, in vmitime, we wanted to be able to lower the
timer interrupt rate at runtime, even if HZ is a compile time constant
(and set to something high, like 1000hz). While we could hack this in
by using evt->min_delta_ns, it wouldn't really work since process time
accounting would be wrong. Instead, we should allow the
tick_sched_timer in cases (c) and (d) to have runtime configurable
period, and then scale the time value accordingly before passing to
account_system_time. This is probably something the Xen folks will want
also, since I think Xen itself only gets 100hz hard timer, and so it can
implement at best a oneshot virtual timer with 100hz resolution. Any
objections to us doing something like this?
3) clockevent set_next_event interface is suboptimal for paravirt (and
probably realtime-ish uses). The problem is that the expiry is passed
as a relative time. On paravirt, an arbitrary amount of (stolen) time
may have passed since the delta was computed and when the timer device
is programmed, causing that next interrupt to be too far out in the
future. It seems a better interface for set_next_event would be to pass
the current time and the absolute expiry. Actually, I sent email to
Thomas and Ingo about this (and some other clockevents/hrtimer feedback)
in July 2006, but never heard back. Thoughts?
I think all the other important points that the vmitime code addressed
are also addressed by clocksource/clockevents/NO_HZ.
Finally, while I agree that writing the clockevent callback code is
trivial, we will hit bugs when moving over. It is something we need to
do, but just takes some time for us to test and shake out the bugs. For
example, we are seeing this bug. It seems to me that tick_sched_timer
should not be running in softirq context, but only from the
hrtimer_interrupt. Is that right? I'm not sure how we get into this case.
Switched to high resolution mode on CPU 0
------------[ cut here ]------------
kernel BUG at
/dhecht/linux/testing/torvalds/linux-2.6.21/kernel/posix-cpu-timers.c:1295!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU: 0
EIP: 0062:[<c0137801>] Not tainted VLI
EFLAGS: 00010202 (2.6.21-rc2-smp #29)
EIP is at run_posix_cpu_timers+0x24/0x6a7
eax: 00000200 ebx: c1405be0 ecx: c1405102 edx: c14051d4
esi: c1405ca0 edi: 77ad5c64 ebp: dfed5a50 esp: dfe81db4
ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 006a
Process swapper (pid: 1, ti=dfe80000 task=dfed5a50 task.ti=dfe80000)
Stack: 00000078 c039e760 0000007d c010a885 28f8c4ec 00000000 dfed5a50
c14051c0
dfe81df8 c0122ac3 c14051cc 00000003 00000008 c14051d4 dfed5a50
00000000
dfe81df4 dfe81df4 c1405be0 c1405ca0 77ad5c64 00000000 c013cd3a
c1405c18
Call Trace:
[<c010a885>] sched_clock+0x3d/0x69
[<c0122ac3>] scheduler_tick+0x52/0xc9
[<c013cd3a>] tick_sched_timer+0x57/0x9d
[<c013cce3>] tick_sched_timer+0x0/0x9d
[<c013936a>] hrtimer_run_queues+0x1c3/0x21d
[<c012d69b>] run_timer_softirq+0x21/0x177
[<c012a0bd>] __do_softirq+0x66/0xca
[<c012a164>] do_softirq+0x43/0x51
[<c012a26b>] irq_exit+0x38/0x6b
[<c0107ab5>] do_IRQ+0x82/0x99
[<c025cf44>] serial8250_console_write+0x129/0x136
[<c025ec10>] serial8250_console_putchar+0x0/0x78
[<c0105d13>] common_interrupt+0x23/0x30
[<c01267f9>] vprintk+0x29b/0x2ca
[<c02056a8>] idr_get_new_above_int+0x13c/0x216
[<c0126843>] printk+0x1b/0x1f
[<c03ebf31>] audit_init+0x26/0x101
[<c03ebe7e>] ikconfig_init+0x14/0x37
[<c03da92d>] init+0x145/0x23e
[<c0104c06>] ret_from_fork+0x6/0x20
[<c0104d69>] syscall_exit+0x5/0x18
[<c03da7e8>] init+0x0/0x23e
[<c03da7e8>] init+0x0/0x23e
[<c0105fe7>] kernel_thread_helper+0x7/0x10
=======================
Code: e9 b0 fe ff ff 5b c3 55 57 56 53 83 ec 48 89 c5 8d 44 24 40 89 44
24 40 89 44 24 44 e8 19 9a ec 3b 90 8d 74 26 00 f6 c4 02 74 04 <0f> 0b
eb fe 8b 95 24 01 00 00 85 d2 74 10 8b 85 08 01 00 00 03
EIP: [<c0137801>] run_posix_cpu_timers+0x24/0x6a7 SS:ESP 006a:dfe81db4
Kernel panic - not syncing: Fatal exception in interrupt
thanks,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists