[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5976b0c9-d4e7-7561-6ce0-790e2460d1ef@bytedance.com>
Date: Tue, 14 Feb 2023 16:06:26 +0000
From: Usama Arif <usama.arif@...edance.com>
To: linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
kvmarm@...ts.cs.columbia.edu, kvm@...r.kernel.org,
linux-doc@...r.kernel.org,
virtualization@...ts.linux-foundation.org, linux@...linux.org.uk,
yezengruan@...wei.com, catalin.marinas@....com, will@...nel.org,
maz@...nel.org, steven.price@....com, mark.rutland@....com,
bagasdotme@...il.com, pbonzini@...hat.com
Cc: fam.zheng@...edance.com, liangma@...ngbit.com,
punit.agrawal@...edance.com
Subject: Re: [v3 0/6] KVM: arm64: implement vcpu_is_preempted check
On 17/01/2023 10:29, Usama Arif wrote:
> This patchset adds support for vcpu_is_preempted in arm64, which allows the guest
> to check if a vcpu was scheduled out, which is useful to know incase it was
> holding a lock. vcpu_is_preempted is well integrated in core kernel code and can
> be used to improve performance in locking (owner_on_cpu usage in mutex_spin_on_owner,
> mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling
> (available_idle_cpu which is used in several places in kernel/sched/fair.c
> for e.g. in wake_affine to determine which CPU can run soonest).
>
> This patchset shows significant improvement on overcommitted hosts (vCPUs > pCPUS),
> as waiting for preempted vCPUs reduces performance.
>
Hi,
Just wanted to check if there are any comments for this?
Thanks,
Usama
> If merged, vcpu_is_preempted could also be used to optimize IPI performance (along
> with directed yield to target IPI vCPU) similar to how its done in x86
> (https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpengli@tencent.com/)
>
> All the results in the below experiments are done on an aws r6g.metal instance
> which has 64 pCPUs.
>
> The following table shows the index results of UnixBench running on a 128 vCPU VM
> with (6.0+vcpu_is_preempted) and without (6.0 base) the patchset.
> TestName 6.0 base 6.0+vcpu_is_preempted % improvement for vcpu_is_preempted
> Dhrystone 2 using register variables 187761 191274.7 1.871368389
> Double-Precision Whetstone 96743.6 98414.4 1.727039308
> Execl Throughput 689.3 10426 1412.548963
> File Copy 1024 bufsize 2000 maxblocks 549.5 3165 475.978162
> File Copy 256 bufsize 500 maxblocks 400.7 2084.7 420.2645371
> File Copy 4096 bufsize 8000 maxblocks 894.3 5003.2 459.4543218
> Pipe Throughput 76819.5 78601.5 2.319723508
> Pipe-based Context Switching 3444.8 13414.5 289.4130283
> Process Creation 301.1 293.4 -2.557289937
> Shell Scripts (1 concurrent) 1248.1 28300.6 2167.494592
> Shell Scripts (8 concurrent) 781.2 26222.3 3256.669227
> System Call Overhead 3426 3729.4 8.855808523
>
> System Benchmarks Index Score 3053 11534 277.7923354
>
> This shows a 278% overall improvement using these patches.
>
> The biggest improvement is in the shell scripts benchmark, which forks a lot of processes.
> This acquires rwsem lock where a large chunk of time is spent in base kernel.
> This can be seen from one of the callstack of the perf output of the shell
> scripts benchmark on base (pseudo NMI enabled for perf numbers below):
> - 33.79% el0_svc
> - 33.43% do_el0_svc
> - 33.43% el0_svc_common.constprop.3
> - 33.30% invoke_syscall
> - 17.27% __arm64_sys_clone
> - 17.27% __do_sys_clone
> - 17.26% kernel_clone
> - 16.73% copy_process
> - 11.91% dup_mm
> - 11.82% dup_mmap
> - 9.15% down_write
> - 8.87% rwsem_down_write_slowpath
> - 8.48% osq_lock
>
> Just under 50% of the total time in the shell script benchmarks ends up being
> spent in osq_lock in the base kernel:
> Children Self Command Shared Object Symbol
> 17.19% 10.71% sh [kernel.kallsyms] [k] osq_lock
> 6.17% 4.04% sort [kernel.kallsyms] [k] osq_lock
> 4.20% 2.60% multi. [kernel.kallsyms] [k] osq_lock
> 3.77% 2.47% grep [kernel.kallsyms] [k] osq_lock
> 3.50% 2.24% expr [kernel.kallsyms] [k] osq_lock
> 3.41% 2.23% od [kernel.kallsyms] [k] osq_lock
> 3.36% 2.15% rm [kernel.kallsyms] [k] osq_lock
> 3.28% 2.12% tee [kernel.kallsyms] [k] osq_lock
> 3.16% 2.02% wc [kernel.kallsyms] [k] osq_lock
> 0.21% 0.13% looper [kernel.kallsyms] [k] osq_lock
> 0.01% 0.00% Run [kernel.kallsyms] [k] osq_lock
>
> and this comes down to less than 1% total with 6.0+vcpu_is_preempted kernel:
> Children Self Command Shared Object Symbol
> 0.26% 0.21% sh [kernel.kallsyms] [k] osq_lock
> 0.10% 0.08% multi. [kernel.kallsyms] [k] osq_lock
> 0.04% 0.04% sort [kernel.kallsyms] [k] osq_lock
> 0.02% 0.01% grep [kernel.kallsyms] [k] osq_lock
> 0.02% 0.02% od [kernel.kallsyms] [k] osq_lock
> 0.01% 0.01% tee [kernel.kallsyms] [k] osq_lock
> 0.01% 0.00% expr [kernel.kallsyms] [k] osq_lock
> 0.01% 0.01% looper [kernel.kallsyms] [k] osq_lock
> 0.00% 0.00% wc [kernel.kallsyms] [k] osq_lock
> 0.00% 0.00% rm [kernel.kallsyms] [k] osq_lock
>
> To make sure, there is no change in performance when vCPUs < pCPUs, UnixBench
> was run on a 32 CPU VM. The kernel with vcpu_is_preempted implemented
> performed 0.9% better overall than base kernel, and the individual benchmarks
> were within +/-2% improvement over 6.0 base.
> Hence the patches have no negative affect when vCPUs < pCPUs.
>
> The respective QEMU change to test this is at
> https://github.com/uarif1/qemu/commit/2da2c2927ae8de8f03f439804a0dad9cf68501b6.
>
> Looking forward to your response!
> Thanks,
> Usama
> ---
> v2->v3
> - Updated the patchset from 6.0 to 6.2-rc3
> - Made pv_lock_init an early_initcall
> - Improved documentation
> - Changed pvlock_vcpu_state to aligned struct
> - Minor improvevments
>
> RFC->v2
> - Fixed table and code referencing in pvlock documentation
> - Switched to using a single hypercall similar to ptp_kvm and made check
> for has_kvm_pvlock simpler
>
> Usama Arif (6):
> KVM: arm64: Document PV-lock interface
> KVM: arm64: Add SMCCC paravirtualised lock calls
> KVM: arm64: Support pvlock preempted via shared structure
> KVM: arm64: Provide VCPU attributes for PV lock
> KVM: arm64: Support the VCPU preemption check
> KVM: selftests: add tests for PV time specific hypercall
>
> Documentation/virt/kvm/arm/hypercalls.rst | 3 +
> Documentation/virt/kvm/arm/index.rst | 1 +
> Documentation/virt/kvm/arm/pvlock.rst | 54 +++++++++
> Documentation/virt/kvm/devices/vcpu.rst | 25 ++++
> arch/arm64/include/asm/kvm_host.h | 25 ++++
> arch/arm64/include/asm/paravirt.h | 2 +
> arch/arm64/include/asm/pvlock-abi.h | 15 +++
> arch/arm64/include/asm/spinlock.h | 16 ++-
> arch/arm64/include/uapi/asm/kvm.h | 3 +
> arch/arm64/kernel/paravirt.c | 113 ++++++++++++++++++
> arch/arm64/kvm/Makefile | 2 +-
> arch/arm64/kvm/arm.c | 8 ++
> arch/arm64/kvm/guest.c | 9 ++
> arch/arm64/kvm/hypercalls.c | 8 ++
> arch/arm64/kvm/pvlock.c | 100 ++++++++++++++++
> include/linux/arm-smccc.h | 8 ++
> include/uapi/linux/kvm.h | 2 +
> tools/arch/arm64/include/uapi/asm/kvm.h | 1 +
> tools/include/linux/arm-smccc.h | 8 ++
> .../selftests/kvm/aarch64/hypercalls.c | 2 +
> 20 files changed, 403 insertions(+), 2 deletions(-)
> create mode 100644 Documentation/virt/kvm/arm/pvlock.rst
> create mode 100644 arch/arm64/include/asm/pvlock-abi.h
> create mode 100644 arch/arm64/kvm/pvlock.c
>
Powered by blists - more mailing lists