lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c87d11a7-b4dd-463e-b40a-188fd2219b3b@gmail.com>
Date: Fri, 17 Oct 2025 20:21:10 +0800
From: fuqiang wang <fuqiang.wng@...il.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Thomas Gleixner
 <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
 Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
 x86@...nel.org, "H . Peter Anvin" <hpa@...or.com>,
 Maxim Levitsky <mlevitsk@...hat.com>, kvm@...r.kernel.org,
 linux-kernel@...r.kernel.org, yu chen <chen.yu@...ystack.com>,
 dongxu zhang <dongxu.zhang@...ystack.com>
Subject: Re: [PATCH RESEND] avoid hv timer fallback to sw timer if delay
 exceeds period



On 10/14/25 7:29 AM, Sean Christopherson wrote:
> On Wed, Oct 01, 2025, fuqiang wang wrote:
>> When the guest uses the APIC periodic timer, if the delay exceeds the
>> period, the delta will be negative.
> 
> IIUC, by "delay" you mean the time it takes for KVM to get (back) to
> advance_periodic_target_expiration().  If that's correct, I think it would be
> clearer to word this as:
> 
>    When the guest uses the APIC periodic timer, if the next period has already
>    expired, e.g. due to the period being smaller than the delay in processing
>    the timer, the delta will be negative.
> 

Indeed, this explanation is much clearer. I will adopt it in the next version
patch.

>> nsec_to_cycles() may then convert this
>> delta into an absolute value larger than guest_l1_tsc, resulting in a
>> negative tscdeadline. Since the hv timer supports a maximum bit width of
>> cpu_preemption_timer_multi + 32, this causes the hv timer setup to fail and
>> switch to the sw timer.
>>
>> Moreover, due to the commit 98c25ead5eda ("KVM: VMX: Move preemption timer
>> <=> hrtimer dance to common x86"), if the guest is using the sw timer
>> before blocking, it will continue to use the sw timer after being woken up,
>> and will not switch back to the hv timer until the relevant APIC timer
>> register is reprogrammed.  Since the periodic timer does not require
>> frequent APIC timer register programming, the guest may continue to use the
>> software timer for an extended period.
>>
>> The reproduction steps and patch verification results at link [1].
>>
>> [1]: https://github.com/cai-fuqiang/kernel_test/tree/master/period_timer_test
>>
>> Fixes: 98c25ead5eda ("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86")

Yes, it's better, I will fix it.

> 
> I'm pretty sure this should be:
> 
>    Fixes: d8f2f498d9ed ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode")
> 
> because that's where the bug with tsdeadline (incorrectly) wrapping was introduced.
> The aforementioned commit exacerbated (and likely exposed?) the bug, but AFAICT
> that commit itself didn't introduce any bugs (related to this issue).
> 
>> Signed-off-by: fuqiang wang <fuqiang.wng@...il.com>
>> ---
>>   arch/x86/kvm/lapic.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 5fc437341e03..afd349f4d933 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -2036,6 +2036,9 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
>>   	u64 tscl = rdtsc();
>>   	ktime_t delta;
>>   
>> +	u64 delta_cycles_u;
>> +	u64 delta_cycles_s;
>> +
>>   	/*
>>   	 * Synchronize both deadlines to the same time source or
>>   	 * differences in the periods (caused by differences in the
>> @@ -2047,8 +2050,11 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
>>   		ktime_add_ns(apic->lapic_timer.target_expiration,
>>   				apic->lapic_timer.period);
>>   	delta = ktime_sub(apic->lapic_timer.target_expiration, now);
>> +	delta_cycles_u = nsec_to_cycles(apic->vcpu, abs(delta));
>> +	delta_cycles_s = delta > 0 ? delta_cycles_u : -delta_cycles_u;
>> +
>>   	apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
>> -		nsec_to_cycles(apic->vcpu, delta);
>> +		delta_cycles_s;
> 
> This isn't quite correct either.  E.g. if delta is negative while L1 TSC is tiny,
> then subtracting the delta will incorrectly result the deadline wrapping too.
> Very, very, theoretically, L1 TSC could even be '0', e.g. due to a weird offset
> for L1, so I don't think subtracting is ever safe.  Heh, of course we're hosed
> in that case no matter what since KVM treats tscdeadline==0 as "not programmed".
> 
> Anyways, can't we just skip adding negative value?  Whether or not the TSC deadline
> has expired is mostly a boolean value; for the vast majority of code it doesn't
> matter exactly when the timer expired.

I don’t think this patch will cause tscdeadline to wrap around to zero, because
the system is unlikely to start a timer when the guest tsc is zero and have it
expire immediately. However, keeping the logic to skip adding negative values is
a good idea, seems like there’s no downside.

> 
> The only code that cares is __kvm_wait_lapic_expire(), and the only downside to
> setting tscdeadline=L1.TSC is that adjust_lapic_timer_advance() won't adjust as
> aggressively as it probably should.

I am not sure which type of timers should use the "advanced tscdeadline hrtimer
expiration feature".

I list the history of this feature.

1. Marcelo first introduce this feature, only support the tscdeadline sw timer.
2. Yunhong introduce vmx preemption timer(hv), only support for tscdeadline.
3. Liwanpeng extend the hv timer to oneshot and period timers.
4. Liwanpeng extend this feature to hv timer.
5. Sean and liwanpeng fix some BUG extend this feature to hv period/oneshot timer.

[1] d0659d946be0("KVM: x86: add option to advance tscdeadline hrtimer expiration")
     Marcelo Tosatti     Dec 16 2014
[2] ce7a058a2117("KVM: x86: support using the vmx preemption timer for tsc deadline timer")
     Yunhong Jiang       Jun 13 2016
[3] 8003c9ae204e("KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support")
     liwanpeng           Oct 24 2016
[4] c5ce8235cffa("KVM: VMX: Optimize tscdeadline timer latency")
     liwanpeng           May 29 2018
[5] ee66e453db13("KVM: lapic: Busy wait for timer to expire when using hv_timer")
     Sean Christopherson Apr 16 2019

     d981dd15498b("KVM: LAPIC: Accurately guarantee busy wait for timer to expire when using hv_timer")
     liwanpeng           Apr 28 2021

Now, timers supported for this feature includes:
- sw: tscdeadline
- hv: all (tscdeadline, oneshot, period)

====
IMHO
====

1. for period timer
===================

I think for periodic timers emulation, the expiration time is already adjusted
to compensate for the delays introduced by timer emulation, so don't need this
feature to adjust again. But after use the feature, the first timer expiration
may be relatively accurate.

E.g., At time 0, start a periodic task (period: 10,000 ns) with a simulated
delay of 100 ns.

With this feature enabled and reasonably accurate prediction, the expiration
time set seen by the guest are: 10000, 20000, 30000...

With this feature not enabled, the expiration times set: 10100, 20100, 30100...

But IMHO, for periodic timers, accuracy of the period seems to be the main
concern, because it does not frequently start and stop. The incorrect period
caused by the first timer expiration can be ignored.

2. for oneshot timer
====================

In [1], Marcelo treated oneshot and tscdeadline differently. Shouldn’t the
behavior of these two timers be similar? Unlike periodic timers, both oneshot
and tscdeadline timers set a specific expiration time, and what the guest cares
about is whether the expiration time is accurate. Moreover, this feature is
mainly intended to mitigate the latency introduced by timer virtualization.
Since software timers have higher latency compared to hardware virtual timers,
the need for this feature is actually more urgent for software timers. However,
in the current implementation, the feature is applied to hv oneshot/period
timers, but not to sw oneshot/period timers.

===============
Summary of IMHO
===============

The feature should be applied to the following timer types:
sw/hv tscdeadline and sw/hv oneshot

should not be applied to:
sw/hv period

However, so far I haven’t found any discussion on the mailing lists regarding
the commits summarized above. Please let me know if I’ve overlooked something.

> 
> Ha!  That's essentially what update_target_expiration() already does:
> 
> 	now = ktime_get();
> 	remaining = ktime_sub(apic->lapic_timer.target_expiration, now);
> 	if (ktime_to_ns(remaining) < 0)
> 		remaining = 0;
> 
> E.g.
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 0ae7f913d782..2fb03a8a9ae9 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2131,18 +2131,26 @@ static void advance_periodic_target_expiration(struct kvm_lapic *apic)
>          ktime_t delta;
>   
>          /*
> -        * Synchronize both deadlines to the same time source or
> -        * differences in the periods (caused by differences in the
> -        * underlying clocks or numerical approximation errors) will
> -        * cause the two to drift apart over time as the errors
> -        * accumulate.
> +        * Use kernel time as the time source for both deadlines so that they
> +        * stay synchronized.  Computing each deadline independently will cause
> +        * the two deadlines to drift apart over time as differences in the
> +        * periods accumulate, e.g. due to differences in the underlying clocks
> +        * or numerical approximation errors.
>           */
>          apic->lapic_timer.target_expiration =
>                  ktime_add_ns(apic->lapic_timer.target_expiration,
>                                  apic->lapic_timer.period);
>          delta = ktime_sub(apic->lapic_timer.target_expiration, now);
> -       apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
> -               nsec_to_cycles(apic->vcpu, delta);
> +
> +       /*
> +        * Don't adjust the TSC deadline if the next period has already expired,
> +        * e.g. due to software overhead resulting in delays larger than the
> +        * period.  Blindly adding a negative delta could cause the deadline to
> +        * become excessively large due to the deadline being an unsigned value.
> +        */
> +       apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl);
> +       if (delta > 0)
> +               apic->lapic_timer.tscdeadline += nsec_to_cycles(apic->vcpu, delta);
>   }
>   
>   static void start_sw_period(struct kvm_lapic *apic)

Thank you for your patient explanations and for helping me make the revisions. I
will update this in the next patch version.

Regards,
fuqiang

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ