linux-kernel - Re: [PATCHv2 4/4] arm64: add host pv-vcpu-state support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <874kcz33g5.wl-maz@kernel.org>
Date:   Mon, 12 Jul 2021 17:24:42 +0100
From:   Marc Zyngier <maz@...nel.org>
To:     Sergey Senozhatsky <senozhatsky@...omium.org>
Cc:     Will Deacon <will@...nel.org>,
        Suleiman Souhlal <suleiman@...gle.com>,
        Joel Fernandes <joelaf@...gle.com>,
        linux-arm-kernel@...ts.infradead.org, kvmarm@...ts.cs.columbia.edu,
        linux-kernel@...r.kernel.org,
        virtualization@...ts.linux-foundation.org
Subject: Re: [PATCHv2 4/4] arm64: add host pv-vcpu-state support

On Fri, 09 Jul 2021 05:37:13 +0100,
Sergey Senozhatsky <senozhatsky@...omium.org> wrote:
> 
> Add PV-vcpu-state support bits to the host. Host uses the guest
> PV-state per-CPU pointers to update the VCPU state each time
> it kvm_arch_vcpu_load() or kvm_arch_vcpu_put() the VCPU, so
> that guest scheduler can become aware of the fact that not
> all VCPUs are always available. Currently guest scheduler
> on amr64 always assumes that all CPUs are available because

arm64

> vcpu_is_preempted() is not implemented on arm64.
> 
> - schbench -t 3 -m 3 -p 4096
> 
>   Latency percentiles (usec)
>         BASE
> ================================================
>         50.0th: 1 (3556427 samples)
>         75.0th: 13 (879210 samples)
>         90.0th: 15 (893311 samples)
>         95.0th: 18 (159594 samples)
>         *99.0th: 118 (224187 samples)
>         99.5th: 691 (28555 samples)
>         99.9th: 7384 (23076 samples)
>         min=1, max=104218
> avg worker transfer: 25192.00 ops/sec 98.41MB/s
> 
>         PATCHED
> ================================================
>         50.0th: 1 (3507010 samples)
>         75.0th: 13 (1635775 samples)
>         90.0th: 16 (901271 samples)
>         95.0th: 24 (281051 samples)
>         *99.0th: 114 (255581 samples)
>         99.5th: 382 (33051 samples)
>         99.9th: 6392 (26592 samples)
>         min=1, max=83877
> avg worker transfer: 28613.39 ops/sec 111.77MB/s
> 
> - perf bench sched all
> 
>   ops/sec
>         BASE                 PATCHED
> ================================================
>         33452		     36485
>         33541		     39405
>         33365		     36858
>         33455		     38047
>         33449		     37866
>         33616		     34922
>         33479		     34388
>         33594		     37203
>         33458		     35363
>         33704		     35180
> 
> Student's T-test
> 
>          N           Min           Max        Median           Avg        Stddev
> base     10         33365         33704         33479       33511.3     100.92467
> patched  10         34388         39405         36858       36571.7      1607.454
> Difference at 95.0% confidence
> 	3060.4 +/- 1070.09
> 	9.13244% +/- 3.19321%
> 	(Student's t, pooled s = 1138.88)
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@...omium.org>
> ---
>  arch/arm64/include/asm/kvm_host.h | 23 +++++++++++
>  arch/arm64/kvm/Makefile           |  3 +-
>  arch/arm64/kvm/arm.c              |  3 ++
>  arch/arm64/kvm/hypercalls.c       | 11 ++++++
>  arch/arm64/kvm/pv-vcpu-state.c    | 64 +++++++++++++++++++++++++++++++
>  5 files changed, 103 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/kvm/pv-vcpu-state.c
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 41911585ae0c..e782f4d0c916 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -381,6 +381,12 @@ struct kvm_vcpu_arch {
>  		u64 last_steal;
>  		gpa_t base;
>  	} steal;
> +
> +	/* PV state of the VCPU */
> +	struct {
> +		gpa_t base;
> +		struct gfn_to_hva_cache ghc;

Please stay consistent with what we use of steal time accounting
(i.e. don't use gfn_to_hva_cache, but directly use a gpa with
kvm_put_guest()). If you can demonstrate that there is an unacceptable
overhead in doing so, then both steal time and preemption will be
updated at the same time.


> +	} vcpu_state;
>  };
>  
>  /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
> @@ -695,6 +701,23 @@ static inline bool kvm_arm_is_pvtime_enabled(struct kvm_vcpu_arch *vcpu_arch)
>  	return (vcpu_arch->steal.base != GPA_INVALID);
>  }
>  
> +int kvm_init_vcpu_state(struct kvm_vcpu *vcpu, gfn_t addr);
> +int kvm_release_vcpu_state(struct kvm_vcpu *vcpu);
> +
> +static inline void kvm_arm_vcpu_state_init(struct kvm_vcpu_arch *vcpu_arch)
> +{
> +	vcpu_arch->vcpu_state.base = GPA_INVALID;
> +	memset(&vcpu_arch->vcpu_state.ghc, 0, sizeof(struct gfn_to_hva_cache));

Does it really need to be inlined?

> +}
> +
> +static inline bool
> +kvm_arm_is_vcpu_state_enabled(struct kvm_vcpu_arch *vcpu_arch)
> +{
> +	return (vcpu_arch->vcpu_state.base != GPA_INVALID);
> +}
> +
> +void kvm_update_vcpu_preempted(struct kvm_vcpu *vcpu, bool preempted);
> +
>  void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome);
>  
>  struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 989bb5dad2c8..2a3ee82c6d90 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -12,7 +12,8 @@ obj-$(CONFIG_KVM) += hyp/
>  
>  kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
>  	 $(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
> -	 arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
> +	 arm.o mmu.o mmio.o psci.o perf.o hypercalls.o \
> +	 pvtime.o pv-vcpu-state.o \
>  	 inject_fault.o va_layout.o handle_exit.o \
>  	 guest.o debug.o reset.o sys_regs.o \
>  	 vgic-sys-reg-v3.o fpsimd.o pmu.o \
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index e9a2b8f27792..43e995c9fddb 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -332,6 +332,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  	kvm_arm_reset_debug_ptr(vcpu);
>  
>  	kvm_arm_pvtime_vcpu_init(&vcpu->arch);
> +	kvm_arm_vcpu_state_init(&vcpu->arch);
>  
>  	vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
>  
> @@ -429,10 +430,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  	if (vcpu_has_ptrauth(vcpu))
>  		vcpu_ptrauth_disable(vcpu);
>  	kvm_arch_vcpu_load_debug_state_flags(vcpu);
> +	kvm_update_vcpu_preempted(vcpu, false);
>  }
>  
>  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  {
> +	kvm_update_vcpu_preempted(vcpu, true);

This doesn't look right. With this, you are now telling the guest that
a vcpu that is blocked on WFI is preempted. This really isn't the
case, as it has voluntarily entered a low-power mode while waiting for
an interrupt. Indeed, the vcpu isn't running. A physical CPU wouldn't
be running either.

>  	kvm_arch_vcpu_put_debug_state_flags(vcpu);
>  	kvm_arch_vcpu_put_fp(vcpu);
>  	if (has_vhe())
> diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
> index 30da78f72b3b..95bcf86e0b6f 100644
> --- a/arch/arm64/kvm/hypercalls.c
> +++ b/arch/arm64/kvm/hypercalls.c
> @@ -110,6 +110,9 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>  		case ARM_SMCCC_HV_PV_TIME_FEATURES:
>  			val[0] = SMCCC_RET_SUCCESS;
>  			break;
> +		case ARM_SMCCC_HV_PV_VCPU_STATE_FEATURES:
> +			val[0] = SMCCC_RET_SUCCESS;
> +			break;
>  		}
>  		break;
>  	case ARM_SMCCC_HV_PV_TIME_FEATURES:
> @@ -139,6 +142,14 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
>  	case ARM_SMCCC_TRNG_RND32:
>  	case ARM_SMCCC_TRNG_RND64:
>  		return kvm_trng_call(vcpu);
> +	case ARM_SMCCC_HV_PV_VCPU_STATE_INIT:
> +		if (kvm_init_vcpu_state(vcpu, smccc_get_arg1(vcpu)) == 0)
> +			val[0] = SMCCC_RET_SUCCESS;
> +		break;
> +	case ARM_SMCCC_HV_PV_VCPU_STATE_RELEASE:
> +		if (kvm_release_vcpu_state(vcpu) == 0)
> +			val[0] = SMCCC_RET_SUCCESS;
> +		break;
>  	default:
>  		return kvm_psci_call(vcpu);
>  	}
> diff --git a/arch/arm64/kvm/pv-vcpu-state.c b/arch/arm64/kvm/pv-vcpu-state.c
> new file mode 100644
> index 000000000000..8496bb2a5966
> --- /dev/null
> +++ b/arch/arm64/kvm/pv-vcpu-state.c
> @@ -0,0 +1,64 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/arm-smccc.h>
> +#include <linux/kvm_host.h>
> +
> +#include <asm/kvm_mmu.h>
> +#include <asm/paravirt.h>
> +
> +#include <kvm/arm_psci.h>

Why this include?

> +
> +int kvm_init_vcpu_state(struct kvm_vcpu *vcpu, gpa_t addr)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	int ret;
> +	u64 idx;
> +
> +	if (kvm_arm_is_vcpu_state_enabled(&vcpu->arch))
> +		return 0;
> +
> +	idx = srcu_read_lock(&kvm->srcu);
> +	ret = kvm_gfn_to_hva_cache_init(vcpu->kvm,
> +					&vcpu->arch.vcpu_state.ghc,
> +					addr,
> +					sizeof(struct vcpu_state));
> +	srcu_read_unlock(&kvm->srcu, idx);
> +
> +	if (!ret)
> +		vcpu->arch.vcpu_state.base = addr;
> +	return ret;
> +}
> +
> +int kvm_release_vcpu_state(struct kvm_vcpu *vcpu)
> +{
> +	if (!kvm_arm_is_vcpu_state_enabled(&vcpu->arch))
> +		return 0;
> +
> +	kvm_arm_vcpu_state_init(&vcpu->arch);
> +	return 0;
> +}
> +
> +void kvm_update_vcpu_preempted(struct kvm_vcpu *vcpu, bool preempted)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	u64 idx;
> +
> +	if (!kvm_arm_is_vcpu_state_enabled(&vcpu->arch))
> +		return;
> +
> +	/*
> +	 * This function is called from atomic context, so we need to
> +	 * disable page faults. kvm_write_guest_cached() will call
> +	 * might_fault().
> +	 */
> +	pagefault_disable();
> +	/*
> +	 * Need to take the SRCU lock because kvm_write_guest_offset_cached()
> +	 * calls kvm_memslots();
> +	 */
> +	idx = srcu_read_lock(&kvm->srcu);
> +	kvm_write_guest_cached(kvm, &vcpu->arch.vcpu_state.ghc,
> +			       &preempted, sizeof(bool));
> +	srcu_read_unlock(&kvm->srcu, idx);
> +	pagefault_enable();
> +}

Before rushing into an implementation, this really could do with some
specification:

- where is the memory allocated?

- what is the lifetime of the memory?

- what is its expected memory type?

- what is the coherency model (what if the guest runs with caching
  disabled, for example)?

- is the functionality preserved across a VM reset?

- what is the strategy w.r.t live migration?

- can userspace detect that the feature is implemented?

- can userspace prevent the feature from being used?

- where is the documentation?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.