linux-kernel - Re: [KVM timekeeping fixes 4/4] TSC catchup mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4C9A583D.7010806@redhat.com>
Date:	Wed, 22 Sep 2010 09:25:49 -1000
From:	Zachary Amsden <zamsden@...hat.com>
To:	Marcelo Tosatti <mtosatti@...hat.com>
CC:	kvm@...r.kernel.org, Avi Kivity <avi@...hat.com>,
	Glauber Costa <glommer@...hat.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [KVM timekeeping fixes 4/4] TSC catchup mode

On 09/21/2010 08:18 AM, Marcelo Tosatti wrote:
> On Mon, Sep 20, 2010 at 03:11:30PM -1000, Zachary Amsden wrote:
>    
>> On 09/20/2010 05:38 AM, Marcelo Tosatti wrote:
>>      
>>> On Sat, Sep 18, 2010 at 02:38:15PM -1000, Zachary Amsden wrote:
>>>        
>>>> Negate the effects of AN TYM spell while kvm thread is preempted by tracking
>>>> conversion factor to the highest TSC rate and catching the TSC up when it has
>>>> fallen behind the kernel view of time.  Note that once triggered, we don't
>>>> turn off catchup mode.
>>>>
>>>> A slightly more clever version of this is possible, which only does catchup
>>>> when TSC rate drops, and which specifically targets only CPUs with broken
>>>> TSC, but since these all are considered unstable_tsc(), this patch covers
>>>> all necessary cases.
>>>>
>>>> Signed-off-by: Zachary Amsden<zamsden@...hat.com>
>>>> ---
>>>>   arch/x86/include/asm/kvm_host.h |    6 +++
>>>>   arch/x86/kvm/x86.c              |   87 +++++++++++++++++++++++++++++---------
>>>>   2 files changed, 72 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>>> index 8c5779d..e209078 100644
>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>> @@ -384,6 +384,9 @@ struct kvm_vcpu_arch {
>>>>   	u64 last_host_tsc;
>>>>   	u64 last_guest_tsc;
>>>>   	u64 last_kernel_ns;
>>>> +	u64 last_tsc_nsec;
>>>> +	u64 last_tsc_write;
>>>> +	bool tsc_catchup;
>>>>
>>>>   	bool nmi_pending;
>>>>   	bool nmi_injected;
>>>> @@ -444,6 +447,9 @@ struct kvm_arch {
>>>>   	u64 last_tsc_nsec;
>>>>   	u64 last_tsc_offset;
>>>>   	u64 last_tsc_write;
>>>> +	u32 virtual_tsc_khz;
>>>> +	u32 virtual_tsc_mult;
>>>> +	s8 virtual_tsc_shift;
>>>>
>>>>   	struct kvm_xen_hvm_config xen_hvm_config;
>>>>
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index 09f468a..9152156 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -962,6 +962,7 @@ static inline u64 get_kernel_ns(void)
>>>>   }
>>>>
>>>>   static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
>>>> +unsigned long max_tsc_khz;
>>>>
>>>>   static inline int kvm_tsc_changes_freq(void)
>>>>   {
>>>> @@ -985,6 +986,24 @@ static inline u64 nsec_to_cycles(u64 nsec)
>>>>   	return ret;
>>>>   }
>>>>
>>>> +static void kvm_arch_set_tsc_khz(struct kvm *kvm, u32 this_tsc_khz)
>>>> +{
>>>> +	/* Compute a scale to convert nanoseconds in TSC cycles */
>>>> +	kvm_get_time_scale(this_tsc_khz, NSEC_PER_SEC / 1000,
>>>> +			&kvm->arch.virtual_tsc_shift,
>>>> +			&kvm->arch.virtual_tsc_mult);
>>>> +	kvm->arch.virtual_tsc_khz = this_tsc_khz;
>>>> +}
>>>> +
>>>> +static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
>>>> +{
>>>> +	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.last_tsc_nsec,
>>>> +				      vcpu->kvm->arch.virtual_tsc_mult,
>>>> +				      vcpu->kvm->arch.virtual_tsc_shift);
>>>> +	tsc += vcpu->arch.last_tsc_write;
>>>> +	return tsc;
>>>> +}
>>>> +
>>>>   void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
>>>>   {
>>>>   	struct kvm *kvm = vcpu->kvm;
>>>> @@ -1029,6 +1048,8 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
>>>>
>>>>   	/* Reset of TSC must disable overshoot protection below */
>>>>   	vcpu->arch.hv_clock.tsc_timestamp = 0;
>>>> +	vcpu->arch.last_tsc_write = data;
>>>> +	vcpu->arch.last_tsc_nsec = ns;
>>>>   }
>>>>   EXPORT_SYMBOL_GPL(kvm_write_tsc);
>>>>
>>>> @@ -1041,22 +1062,42 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>>>>   	s64 kernel_ns, max_kernel_ns;
>>>>   	u64 tsc_timestamp;
>>>>
>>>> -	if ((!vcpu->time_page))
>>>> -		return 0;
>>>> -
>>>>   	/* Keep irq disabled to prevent changes to the clock */
>>>>   	local_irq_save(flags);
>>>>   	kvm_get_msr(v, MSR_IA32_TSC,&tsc_timestamp);
>>>>   	kernel_ns = get_kernel_ns();
>>>>   	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
>>>> -	local_irq_restore(flags);
>>>>
>>>>   	if (unlikely(this_tsc_khz == 0)) {
>>>> +		local_irq_restore(flags);
>>>>   		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
>>>>   		return 1;
>>>>   	}
>>>>
>>>>   	/*
>>>> +	 * We may have to catch up the TSC to match elapsed wall clock
>>>> +	 * time for two reasons, even if kvmclock is used.
>>>> +	 *   1) CPU could have been running below the maximum TSC rate
>>>>          
>>> kvmclock handles frequency changes?
>>>
>>>        
>>>> +	 *   2) Broken TSC compensation resets the base at each VCPU
>>>> +	 *      entry to avoid unknown leaps of TSC even when running
>>>> +	 *      again on the same CPU.  This may cause apparent elapsed
>>>> +	 *      time to disappear, and the guest to stand still or run
>>>> +	 *	very slowly.
>>>>          
>>> I don't get this. Please explain.
>>>        
>> This compensation in arch_vcpu_load, for unstable TSC case, causes
>> time while preempted to disappear from the TSC by adjusting the TSC
>> back to match the last observed TSC.
>>
>>          if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>                  /* Make sure TSC doesn't go backwards */
>>                  s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>                                  native_read_tsc() -
>> vcpu->arch.last_host_tsc;
>>                  if (tsc_delta<  0)
>>                          mark_tsc_unstable("KVM discovered backwards TSC");
>>                  if (check_tsc_unstable())
>>                          kvm_x86_ops->adjust_tsc_offset(vcpu,
>> -tsc_delta);<<<<<
>>
>> Note that this is the correct thing to do if there are cross-CPU
>> deltas, when switching CPUs, or if the TSC becomes inherently
>> unpredictable while preempted (CPU bugs, kernel resets TSC).
>>
>> However, all the time that elapsed while not running disappears from
>> the TSC (and thus even from kvmclock, without recalibration, as it
>> is based off the TSC).  Since we've got to recalibrate the kvmclock
>> anyways, we might as well catch the TSC up to the proper value.
>>      
> Updating kvmclock's tsc_timestamp and system_time should be enough then,
> to fix this particular issue?
>    

Yes, it is, for kvmclock guests.  For TSC based kernels (RHEL < 5.5, 
FreeBSD, Darwin?), and guests which have userspace TSC, we still need this.

> The problem is you're sneaking in parts of trap mode (virtual_tsc_khz),
> without dealing with the issues raised in the past iteration. The
> interactions between catch and trap mode are not clear, migration is not
> handled, etc.
>    

Yes, I am :)

While I haven't yet resolved those issues to a successful conclusion, 
the situation is at least improved, and incremental progress is better 
than nothing.

I do believe that the catchup mode is at least clean and easy to 
understand, it is the transition to and from trap mode that raised a lot 
of problems, and that is what I'm reworking.  Regardless of how that 
turns out, it should integrate smoothly on top of the catchup mode, at 
least, that is the design goal I'm shooting for, so I'd like to get 
these pieces upstream now as I don't expect them to change much.

Zach
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/