[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aO2LV-ipewL59LC6@google.com>
Date: Mon, 13 Oct 2025 16:29:27 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: fuqiang wang <fuqiang.wng@...il.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H . Peter Anvin" <hpa@...or.com>, Maxim Levitsky <mlevitsk@...hat.com>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, yu chen <chen.yu@...ystack.com>,
dongxu zhang <dongxu.zhang@...ystack.com>
Subject: Re: [PATCH RESEND] avoid hv timer fallback to sw timer if delay
exceeds period
On Wed, Oct 01, 2025, fuqiang wang wrote:
> When the guest uses the APIC periodic timer, if the delay exceeds the
> period, the delta will be negative.
IIUC, by "delay" you mean the time it takes for KVM to get (back) to
advance_periodic_target_expiration(). If that's correct, I think it would be
clearer to word this as:
When the guest uses the APIC periodic timer, if the next period has already
expired, e.g. due to the period being smaller than the delay in processing
the timer, the delta will be negative.
> nsec_to_cycles() may then convert this
> delta into an absolute value larger than guest_l1_tsc, resulting in a
> negative tscdeadline. Since the hv timer supports a maximum bit width of
> cpu_preemption_timer_multi + 32, this causes the hv timer setup to fail and
> switch to the sw timer.
>
> Moreover, due to the commit 98c25ead5eda ("KVM: VMX: Move preemption timer
> <=> hrtimer dance to common x86"), if the guest is using the sw timer
> before blocking, it will continue to use the sw timer after being woken up,
> and will not switch back to the hv timer until the relevant APIC timer
> register is reprogrammed. Since the periodic timer does not require
> frequent APIC timer register programming, the guest may continue to use the
> software timer for an extended period.
>
> The reproduction steps and patch verification results at link [1].
>
> [1]: https://github.com/cai-fuqiang/kernel_test/tree/master/period_timer_test
>
> Fixes: 98c25ead5eda ("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86")
I'm pretty sure this should be:
Fixes: d8f2f498d9ed ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode")
because that's where the bug with tsdeadline (incorrectly) wrapping was introduced.
The aforementioned commit exacerbated (and likely exposed?) the bug, but AFAICT
that commit itself didn't introduce any bugs (related to this issue).
> Signed-off-by: fuqiang wang <fuqiang.wng@...il.com>
> ---
> arch/x86/kvm/lapic.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 5fc437341e03..afd349f4d933 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2036,6 +2036,9 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
> u64 tscl = rdtsc();
> ktime_t delta;
>
> + u64 delta_cycles_u;
> + u64 delta_cycles_s;
> +
> /*
> * Synchronize both deadlines to the same time source or
> * differences in the periods (caused by differences in the
> @@ -2047,8 +2050,11 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
> ktime_add_ns(apic->lapic_timer.target_expiration,
> apic->lapic_timer.period);
> delta = ktime_sub(apic->lapic_timer.target_expiration, now);
> + delta_cycles_u = nsec_to_cycles(apic->vcpu, abs(delta));
> + delta_cycles_s = delta > 0 ? delta_cycles_u : -delta_cycles_u;
> +
> apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
> - nsec_to_cycles(apic->vcpu, delta);
> + delta_cycles_s;
This isn't quite correct either. E.g. if delta is negative while L1 TSC is tiny,
then subtracting the delta will incorrectly result the deadline wrapping too.
Very, very, theoretically, L1 TSC could even be '0', e.g. due to a weird offset
for L1, so I don't think subtracting is ever safe. Heh, of course we're hosed
in that case no matter what since KVM treats tscdeadline==0 as "not programmed".
Anyways, can't we just skip adding negative value? Whether or not the TSC deadline
has expired is mostly a boolean value; for the vast majority of code it doesn't
matter exactly when the timer expired.
The only code that cares is __kvm_wait_lapic_expire(), and the only downside to
setting tscdeadline=L1.TSC is that adjust_lapic_timer_advance() won't adjust as
aggressively as it probably should.
Ha! That's essentially what update_target_expiration() already does:
now = ktime_get();
remaining = ktime_sub(apic->lapic_timer.target_expiration, now);
if (ktime_to_ns(remaining) < 0)
remaining = 0;
E.g.
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0ae7f913d782..2fb03a8a9ae9 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2131,18 +2131,26 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
ktime_t delta;
/*
- * Synchronize both deadlines to the same time source or
- * differences in the periods (caused by differences in the
- * underlying clocks or numerical approximation errors) will
- * cause the two to drift apart over time as the errors
- * accumulate.
+ * Use kernel time as the time source for both deadlines so that they
+ * stay synchronized. Computing each deadline independently will cause
+ * the two deadlines to drift apart over time as differences in the
+ * periods accumulate, e.g. due to differences in the underlying clocks
+ * or numerical approximation errors.
*/
apic->lapic_timer.target_expiration =
ktime_add_ns(apic->lapic_timer.target_expiration,
apic->lapic_timer.period);
delta = ktime_sub(apic->lapic_timer.target_expiration, now);
- apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
- nsec_to_cycles(apic->vcpu, delta);
+
+ /*
+ * Don't adjust the TSC deadline if the next period has already expired,
+ * e.g. due to software overhead resulting in delays larger than the
+ * period. Blindly adding a negative delta could cause the deadline to
+ * become excessively large due to the deadline being an unsigned value.
+ */
+ apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl);
+ if (delta > 0)
+ apic->lapic_timer.tscdeadline += nsec_to_cycles(apic->vcpu, delta);
}
static void start_sw_period(struct kvm_lapic *apic)
Powered by blists - more mailing lists