[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <963f68c8-b109-7ebb-751d-14ce46e3cdde@arm.com>
Date: Wed, 22 Sep 2021 11:11:44 +0100
From: Suzuki K Poulose <suzuki.poulose@....com>
To: Alexandru Elisei <alexandru.elisei@....com>, maz@...nel.org,
james.morse@....com, linux-arm-kernel@...ts.infradead.org,
kvmarm@...ts.cs.columbia.edu, will@...nel.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v4 00/39] KVM: arm64: Add Statistical Profiling
Extension (SPE) support
On 25/08/2021 17:17, Alexandru Elisei wrote:
> This is v4 of the SPE series posted at [1]. v2 can be found at [2], and the
> original series at [3].
>
> Statistical Profiling Extension (SPE) is an optional feature added in
> ARMv8.2. It allows sampling at regular intervals of the operations executed
> by the PE and storing a record of each operation in a memory buffer. A high
> level overview of the extension is presented in an article on arm.com [4].
>
> This is another complete rewrite of the series, and nothing is set in
> stone. If you think of a better way to do things, please suggest it.
>
>
> Features added
> ==============
>
> The rewrite enabled me to add support for several features not
> present in the previous iteration:
>
> - Support for heterogeneous systems, where only some of the CPUs support SPE.
> This is accomplished via the KVM_ARM_VCPU_SUPPORTED_CPUS VCPU ioctl.
>
> - Support for VM migration with the KVM_ARM_VCPU_SPE_CTRL(KVM_ARM_VCPU_SPE_STOP)
> VCPU ioctl.
>
> - The requirement for userspace to mlock() the guest memory has been removed,
> and now userspace can make changes to memory contents after the memory is
> mapped at stage 2.
>
> - Better debugging of guest memory pinning by printing a warning when we
> get an unexpected read or write fault. This helped me catch several bugs
> during development, it has already proven very useful. Many thanks to
> James who suggested when reviewing v3.
>
>
> Missing features
> ================
>
> I've tried to keep the series as small as possible to make it easier to review,
> while implementing the core functionality needed for the SPE emulation. As such,
> I've chosen to not implement several features:
>
> - Host profiling a guest which has the SPE feature bit set (see open
> questions).
>
> - No errata workarounds have been implemented yet, and there are quite a few of
> them for Neoverse N1 and Neoverse V1.
>
> - Disabling CONFIG_NUMA_BALANCING is a hack to get KVM SPE to work and I am
> investigating other ways to get around automatic numa balancing, like
> requiring userspace to disable it via set_mempolicy(). I am also going to
> look at how VFIO gets around it. Suggestions welcome.
>
> - There's plenty of room for optimization. Off the top of my head, using
> block mappings at stage 2, batch pinning of pages (similar to what VFIO
> does), optimize the way KVM keeps track of pinned pages (using a linked
> list triples the memory usage), context-switch the SPE registers on
> vcpu_load/vcpu_put on VHE if the host is not profiling, locking
> optimizations, etc, etc.
>
> - ...and others. I'm sure I'm missing at least a few things which are
> important for someone.
>
>
> Known issues
> ============
>
> This is an RFC, so keep in mind that almost definitely there will be scary
> bugs. For example, below is a list of known issues which don't affect the
> correctness of the emulation, and which I'm planning to fix in a future
> iteration:
>
> - With CONFIG_PROVE_LOCKING=y, lockdep complains about lock contention when
> the VCPU executes the dcache clean pending ops.
>
> - With CONFIG_PROVE_LOCKING=y, KVM will hit a BUG at
> kvm_lock_all_vcpus()->mutex_trylock(&vcpu->mutex) with more than 48
> VCPUs.
>
> This BUG statement can also be triggered with mainline. To reproduce it,
> compile kvmtool from this branch [5] and follow the instruction in the
> kvmtool commit message.
>
> One workaround could be to stop trying to lock all VCPUs when locking a
> memslot and document the fact that it is required that no VCPUs are run
> before the ioctl completes, otherwise bad things might happen to the VM.
>
>
> Open questions
> ==============
>
> 1. Implementing support for host profiling a guest with the SPE feature
> means setting the profiling buffer owning regime to EL2. While that is in
> effect, PMBIDR_EL1.P will equal 1. This has two consequences: if the guest
> probes SPE during this time, the driver will fail; and the guest will be
> able to determine when it is profiled. I see two options here:
This doesn't mean the EL2 is owning the SPE. It only tells you that a
higher level EL is owning the SPE. It could as well be EL3. (e.g,
MDCR_EL3.NSPB == 0 or 1). So I think this is architecturally correct,
as long as we trap the guest access to other SPE registers and inject
and UNDEF.
Thanks
Suzuki
Powered by blists - more mailing lists