linux-kernel - Re: [PATCH v5] KVM: x86/tsc: Don't sync TSC on the first write in state restoration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <461B7217-7AA7-479E-9060-772E243CB03D@infradead.org>
Date:   Wed, 13 Sep 2023 11:30:06 +0200
From:   David Woodhouse <dwmw2@...radead.org>
To:     Like Xu <like.xu.linux@...il.com>,
        Sean Christopherson <seanjc@...gle.com>
CC:     Paolo Bonzini <pbonzini@...hat.com>,
        Oliver Upton <oliver.upton@...ux.dev>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5] KVM: x86/tsc: Don't sync TSC on the first write in state restoration



On 13 September 2023 11:17:49 CEST, Like Xu <like.xu.linux@...il.com> wrote:
>> I think you also need to set kvm->arch.user_set_tsc() in
>> kvm_arch_tsc_set_attr(), don't you?
>
>How about:
>
>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index c55cc60769db..374965f66137 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -5545,6 +5545,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
> 		tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
> 		ns = get_kvmclock_base_ns();
>
>+		kvm->arch.user_set_tsc = true;
> 		__kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
> 		raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);

Yep, that looks good.


>> This comment isn't quite right; it wants to use some excerpt of the
>> commit message I've suggested above.
>
>How about:
>
>@@ -2735,20 +2735,34 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
> 			 * kvm_clock stable after CPU hotplug
> 			 */
> 			synchronizing = true;
>-		} else {
>+		} else if (kvm->arch.user_set_tsc) {
> 			u64 tsc_exp = kvm->arch.last_tsc_write +
> 						nsec_to_cycles(vcpu, elapsed);
> 			u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
> 			/*
>-			 * Special case: TSC write with a small delta (1 second)
>-			 * of virtual cycle time against real time is
>-			 * interpreted as an attempt to synchronize the CPU.
>+			 * Here lies UAPI baggage: user-initiated TSC write with
>+			 * a small delta (1 second) of virtual cycle time
>+			 * against real time is interpreted as an attempt to
>+			 * synchronize the CPU.

Much better, thanks. But I don't much like "an attempt to synchronize the CPU". 

In my response to Sean I objected to that classification. Userspace is just *setting* the TSC. There is no dedicated intent to "synchronize" it. It just sets it, and the value just *might* happen to be in sync with another vCPU. 

It's just that our API is so fundamentally broken that it *can't* be in sync, so we "help" it a bit if it looks close.

So maybe...

Here lies UAPI baggage: when a user-initiated TSC write has a small delta (1 second) of virtual cycle time against the previously set vCPU, we assume that they were intended to be in sync and the delta was only due to the racy nature of the legacy API.

>+			 * This trick falls down when restoring a guest which genuinely
>+			 * has been running for less time than the 1 second of imprecision
>+			 * which we allow for in the legacy API. In this case, the first
>+			 * value written by userspace (on any vCPU) should not be subject
>+			 * to this 'correction' to make it sync up with values that only
>+			 * from from the kernel's default vCPU creation. Make the 1-second
>+			 * slop hack only trigger if flag is already set.
>+			 *
>+			 * The correct answer is for the VMM not to use the legacy API.




>> Userspace used to be able to write zero to force a sync. You've removed
>> that ability from the ABI, and haven't even mentioned it. Must we?
>
>Will continue to use "bool user_initiated" for lack of a better move.

Why? Can't we treat an explicit zero write just the same as when the kernel does it?