linux-kernel - Re: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against startup values

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b6661934-53d1-f7ca-d3d6-31b32a2ebb05@gmail.com>
Date:   Mon, 25 Sep 2023 15:36:47 +0800
From:   Like Xu <like.xu.linux@...il.com>
To:     David Woodhouse <dwmw2@...radead.org>,
        Sean Christopherson <seanjc@...gle.com>
Cc:     Oliver Upton <oliver.upton@...ux.dev>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [PATCH v6] KVM: x86/tsc: Don't sync user-written TSC against
 startup values

Ping for further comments and confirmation from Sean.
Could we trigger a new version for this issue ? Thanks.

On 19/9/2023 7:29 pm, Like Xu wrote:
> On 14/9/2023 3:31 pm, David Woodhouse wrote:
>> On Thu, 2023-09-14 at 11:50 +0800, Like Xu wrote:
>>> On 13/9/2023 10:47 pm, Sean Christopherson wrote:
>>>> On Wed, Sep 13, 2023, Like Xu wrote:
>>>>> I'll wait for a cooling off period to see if the maintainers need me to 
>>>>> post v7.
>>>>
>>>> You should have waiting to post v5, let alone v6.  Resurrecting a thread 
>>>> after a
>>>> month and not waiting even 7 hours for others to respond is extremely 
>>>> frustrating.
>>>
>>> You are right. I don't seem to be keeping up with many of other issues. Sorry
>>> for that.
>>> Wish there were 48 hours in a day.
>>>
>>> Back to this issue: for commit message, I'd be more inclined to David's
>>> understanding,
>>
>> The discussion that Sean and I had should probably be reflected in the
>> commit message too. To the end of the commit log you used for v6, after
>> the final 'To that end:…' paragraph, let's add:
>>
>>   Note that userspace can explicitly request a *synchronization* of the
>>   TSC by writing zero. For the purpose of this patch, this counts as
>>   "setting" the TSC. If userspace then subsequently writes an explicit
>>   non-zero value which happens to be within 1 second of the previous
>>   value, it will be 'corrected'. For that case, this preserves the prior
>>   behaviour of KVM (which always applied the 1-second 'correction'
>>   regardless of user vs. kernel).
>>
>>> @@ -2728,27 +2729,45 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu,
>>> u64 data)
>>>          elapsed = ns - kvm->arch.last_tsc_nsec;
>>>
>>>          if (vcpu->arch.virtual_tsc_khz) {
>>> +               /*
>>> +                * Force synchronization when creating or hotplugging a vCPU,
>>> +                * i.e. when the TSC value is '0', to help keep clocks stable.
>>> +                * If this is NOT a hotplug/creation case, skip synchronization
>>> +                * on the first write from userspace so as not to misconstrue
>>> +                * state restoration after live migration as an attempt from
>>> +                * userspace to synchronize.
>>> +                */
>>
>> You cannot *misconstrue* an attempt from userspace to synchronize. If
>> userspace writes a zero, it's a sync attempt. If it's non-zero it's a
>> TSC value to be set. It's not very subtle :)
>>
>> I think the 1-second slop thing is sufficiently documented in the 'else
>> if' clause below, so I started writing an alternative 'overall' comment
>> to go here and found it a bit redundant. So maybe let's just drop this
>> comment and add one back in the if (data == 0) case...
>>
>>>                  if (data == 0) {
>>> -                       /*
>>> -                        * detection of vcpu initialization -- need to sync
>>> -                        * with other vCPUs. This particularly helps to keep
>>> -                        * kvm_clock stable after CPU hotplug
>>> -                        */
>>
>>
>>              /*
>>               * Force synchronization when creating a vCPU, or when
>>               * userspace explicitly writes a zero value.
>>               */
>>
>>>                          synchronizing = true;
>>> -               } else {
>>> +               } else if (kvm->arch.user_set_tsc) {
>>>                          u64 tsc_exp = kvm->arch.last_tsc_write +
>>>                                                  nsec_to_cycles(vcpu, elapsed);
>>>                          u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
>>>                          /*
>>> -                        * Special case: TSC write with a small delta (1 second)
>>> -                        * of virtual cycle time against real time is
>>> -                        * interpreted as an attempt to synchronize the CPU.
>>> +                        * Here lies UAPI baggage: when a user-initiated TSC 
>>> write has
>>> +                        * a small delta (1 second) of virtual cycle time 
>>> against the
>>> +                        * previously set vCPU, we assume that they were 
>>> intended to be
>>> +                        * in sync and the delta was only due to the racy 
>>> nature of the
>>> +                        * legacy API.
>>> +                        *
>>> +                        * This trick falls down when restoring a guest which 
>>> genuinely
>>> +                        * has been running for less time than the 1 second 
>>> of imprecision
>>> +                        * which we allow for in the legacy API. In this 
>>> case, the first
>>> +                        * value written by userspace (on any vCPU) should 
>>> not be subject
>>> +                        * to this 'correction' to make it sync up with 
>>> values that only
>>
>> Missing the word 'come' here too, in '…that only *come* from…',
>>
>>> +                        * from the kernel's default vCPU creation. Make the 
>>> 1-second slop
>>> +                        * hack only trigger if the user_set_tsc flag is 
>>> already set.
>>> +                        *
>>> +                        * The correct answer is for the VMM not to use the 
>>> legacy API.
>>
>> Maybe we should drop this line, as we don't actually have a sane API
>> yet that VMMs can use instead.
>>
> 
> Thanks for your comments, but not sure if Sean has any more concerns to move 
> forward:
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1a4def36d5bb..9a7dfef9d32d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1324,6 +1324,7 @@ struct kvm_arch {
>       int nr_vcpus_matched_tsc;
> 
>       u32 default_tsc_khz;
> +    bool user_set_tsc;
> 
>       seqcount_raw_spinlock_t pvclock_sc;
>       bool use_master_clock;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6c9c81e82e65..11fbd2a4a370 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2714,8 +2714,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, 
> u64 offset, u64 tsc,
>       kvm_track_tsc_matching(vcpu);
>   }
> 
> -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
> +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
>   {
> +    u64 data = user_value ? *user_value : 0;
>       struct kvm *kvm = vcpu->kvm;
>       u64 offset, ns, elapsed;
>       unsigned long flags;
> @@ -2730,25 +2731,37 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, 
> u64 data)
>       if (vcpu->arch.virtual_tsc_khz) {
>           if (data == 0) {
>               /*
> -             * detection of vcpu initialization -- need to sync
> -             * with other vCPUs. This particularly helps to keep
> -             * kvm_clock stable after CPU hotplug
> +             * Force synchronization when creating a vCPU, or when
> +             * userspace explicitly writes a zero value.
>                */
>               synchronizing = true;
> -        } else {
> +        } else if (kvm->arch.user_set_tsc) {
>               u64 tsc_exp = kvm->arch.last_tsc_write +
>                           nsec_to_cycles(vcpu, elapsed);
>               u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
>               /*
> -             * Special case: TSC write with a small delta (1 second)
> -             * of virtual cycle time against real time is
> -             * interpreted as an attempt to synchronize the CPU.
> +             * Here lies UAPI baggage: when a user-initiated TSC write has
> +             * a small delta (1 second) of virtual cycle time against the
> +             * previously set vCPU, we assume that they were intended to be
> +             * in sync and the delta was only due to the racy nature of the
> +             * legacy API.
> +             *
> +             * This trick falls down when restoring a guest which genuinely
> +             * has been running for less time than the 1 second of imprecision
> +             * which we allow for in the legacy API. In this case, the first
> +             * value written by userspace (on any vCPU) should not be subject
> +             * to this 'correction' to make it sync up with values that only
> +             * come from the kernel's default vCPU creation. Make the 1-second
> +             * slop hack only trigger if the user_set_tsc flag is already set.
>                */
>               synchronizing = data < tsc_exp + tsc_hz &&
>                       data + tsc_hz > tsc_exp;
>           }
>       }
> 
> +    if (user_value)
> +        kvm->arch.user_set_tsc = true;
> +
>       /*
>        * For a reliable TSC, we can match TSC offsets, and for an unstable
>        * TSC, we add elapsed time in this computation.  We could let the
> @@ -3777,7 +3790,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>           break;
>       case MSR_IA32_TSC:
>           if (msr_info->host_initiated) {
> -            kvm_synchronize_tsc(vcpu, data);
> +            kvm_synchronize_tsc(vcpu, &data);
>           } else {
>               u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - 
> vcpu->arch.l1_tsc_offset;
>               adjust_tsc_offset_guest(vcpu, adj);
> @@ -5536,6 +5549,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
>           tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
>           ns = get_kvmclock_base_ns();
> 
> +        kvm->arch.user_set_tsc = true;
>           __kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
>           raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
> 
> @@ -11959,7 +11973,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
>       if (mutex_lock_killable(&vcpu->mutex))
>           return;
>       vcpu_load(vcpu);
> -    kvm_synchronize_tsc(vcpu, 0);
> +    kvm_synchronize_tsc(vcpu, NULL);
>       vcpu_put(vcpu);
> 
>       /* poll control enabled by default */