linux-kernel - Re: [PATCH v1] Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150922190213.GC23748@amt.cnet>
Date:	Tue, 22 Sep 2015 16:02:13 -0300
From:	Marcelo Tosatti <mtosatti@...hat.com>
To:	Radim Krčmář <rkrcmar@...hat.com>
Cc:	linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
	Paolo Bonzini <pbonzini@...hat.com>,
	Luiz Capitulino <lcapitulino@...hat.com>,
	stable@...r.kernel.org
Subject: Re: [PATCH v1] Revert "KVM: x86: zero kvmclock_offset when vcpu0
 initializes kvmclock system MSR"

On Tue, Sep 22, 2015 at 06:33:46PM +0200, Radim Krčmář wrote:
> PVCLOCK_COUNTS_FROM_ZERO broke ABI and (at least) three things with it.
> All problems stem from repeated writes to MSR_KVM_SYSTEM_TIME(_NEW).
> The reverted patch treated the MSR write as a one-shot initializer:
> any write from VCPU 0 would reset system_time.
> 
> And this is what broke for Linux guests:
> * Onlining/hotplugging of VCPU 0
> 
>   VCPU has to assign an address to KVM clock before use, which is done
>   with MSR_KVM_SYSTEM_TIME_NEW.  Linux has an idea that time should not
>   jump backward, so any `sleep` won't return before |system_time at the
>   point of offline| elapses since the online.  Be sure to run ntp.
> 
> * S3 and S4 resume
> 
>   If you don't have PVCLOCK_TSC_STABLE_BIT in Linux, resume will freeze
>   for |system_time at the point of suspend|, because pvclock ensures
>   monoticity and kvmclock did not think about it.
> 
>   If you have stable clock, execution will resume immediately, but
>   restoring KVM clock writes to the MSR and dmesg starts to count from
>   zero.  It's better than the onlining, but not what we want either.
> 
> * Boot of SLES 10 guest
> 
>   SLES 10 has a custom implementation of kvm clock that calls
>   MSR_KVM_SYSTEM_TIME before every read to enhance precision ...
>   Two things are happening at the same time:
>   1) The guest periodically receives an interrupt that is handled by
>      main_timer_handler():
>      a) get time using the kvm clock:
>         1) write the address to MSR_KVM_SYSTEM_TIME
>         2) read tsc and pvclock (tsc_offset, system_time)
>         3) time = tsc - tsc_offset + system_time
>      b) compute time since the last main_timer_handler()
>      c) bump jiffies if enough time has elapsed
>   2) the guest wants to calibrate loops per jiffy [1]:
>      a) read tsc
>      b) loop till jiffies increase
>      c) compute lpj
> 
>   Because (1a1) always resets the system_time to 0, we read the same
>   value over and over so the condition for (1c) is never true and
>   jiffies remain constant.  A hang happens in (2b) as it is the first
>   place that depends on jiffies.
> 
> 
> We could make hypervisor workaround for this, but that is just asking
> for more trouble.  Luckily, reverting does not break to guests that
> learned about PVCLOCK_COUNTS_FROM_ZERO, in new ways.
> Only 4.2+ guests with NOHZ_FULL wanted PVCLOCK_COUNTS_FROM_ZERO, which
> is a good trade-off for not regressing.
> 
> This reverts commit b7e60c5aedd2b63f16ef06fde4f81ca032211bc5.
> And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.
> 
> Cc: stable@...r.kernel.org
> Signed-off-by: Radim Krčmář <rkrcmar@...hat.com>
> ---
>  v1: Extended commit message based on a discussion with Marcelo
> 
>  arch/x86/include/asm/pvclock-abi.h | 1 +
>  arch/x86/kvm/x86.c                 | 4 ----
>  2 files changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h
> index 655e07a48f6c..67f08230103a 100644
> --- a/arch/x86/include/asm/pvclock-abi.h
> +++ b/arch/x86/include/asm/pvclock-abi.h
> @@ -41,6 +41,7 @@ struct pvclock_wall_clock {
>  
>  #define PVCLOCK_TSC_STABLE_BIT	(1 << 0)
>  #define PVCLOCK_GUEST_STOPPED	(1 << 1)
> +/* PVCLOCK_COUNTS_FROM_ZERO broke ABI and can't be used anymore. */
>  #define PVCLOCK_COUNTS_FROM_ZERO (1 << 2)
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_X86_PVCLOCK_ABI_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4bca39f0fdb3..71731994d897 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1711,8 +1711,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  		vcpu->pvclock_set_guest_stopped_request = false;
>  	}
>  
> -	pvclock_flags |= PVCLOCK_COUNTS_FROM_ZERO;
> -
>  	/* If the host uses TSC clocksource, then it is stable */
>  	if (use_master_clock)
>  		pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
> @@ -2010,8 +2008,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  					&vcpu->requests);
>  
>  			ka->boot_vcpu_runs_old_kvmclock = tmp;
> -
> -			ka->kvmclock_offset = -get_kernel_ns();
>  		}
>  
>  		vcpu->arch.time = data;
> -- 
> 2.5.3

NACK, please use original patchset.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/