[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAFg_LQWV56zok563F8WbPEuUiJeeEhfUK3ua+tcm8ChZETWKWg@mail.gmail.com>
Date: Fri, 17 Dec 2021 17:39:24 +0800
From: Jinrong Liang <ljr.kernel@...il.com>
To: Tianqiang Xu <skyele@...u.edu.cn>
Cc: x86@...nel.org, pbonzini@...hat.com, seanjc@...gle.com,
vkuznets@...hat.com, wanpengli@...cent.com, jmattson@...gle.com,
joro@...tes.org, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, kvm@...r.kernel.org, hpa@...or.com,
jarkko@...nel.org, dave.hansen@...ux.intel.com,
linux-kernel@...r.kernel.org, linux-sgx@...r.kernel.org
Subject: Re: [PATCH 1/4] KVM: x86: Introduce .pcpu_is_idle() stub infrastructure
Hi Tianqiang,
Tianqiang Xu <skyele@...u.edu.cn> 于2021年12月17日周五 15:55写道:
>
> This patch series aims to fix performance issue caused by current
> para-virtualized scheduling design.
>
> The current para-virtualized scheduling design uses 'preempted' field of
> kvm_steal_time to avoid scheduling task on the preempted vCPU.
> However, when the pCPU where the preempted vCPU most recently run is idle,
> it will result in low cpu utilization, and consequently poor performance.
>
> The new field: 'is_idle' of kvm_steal_time can precisely reveal
> the status of pCPU where preempted vCPU most recently run, and
> then improve cpu utilization.
>
> pcpu_is_idle() is used to get the value of 'is_idle' of kvm_steal_time.
>
> Experiments on a VM with 16 vCPUs show that the patch can reduce around
> 50% to 80% execution time for most PARSEC benchmarks.
> This also holds true for a VM with 112 vCPUs.
>
> Experiments on 2 VMs with 112 vCPUs show that the patch can reduce around
> 20% to 80% execution time for most PARSEC benchmarks.
>
> Test environment:
> -- PowerEdge R740
> -- 56C-112T CPU Intel(R) Xeon(R) Gold 6238R CPU
> -- Host 190G DRAM
> -- QEMU 5.0.0
> -- PARSEC 3.0 Native Inputs
> -- Host is idle during the test
> -- Host and Guest kernel are both kernel-5.14.0
>
> Results:
> 1. 1 VM, 16 VCPU, 16 THREAD.
> Host Topology: sockets=2 cores=28 threads=2
> VM Topology: sockets=1 cores=16 threads=1
> Command: <path to parsec>/bin/parsecmgmt -a run -p <benchmark> -i native -n 16
> Statistics below are the real time of running each benchmark.(lower is better)
>
> before patch after patch improvements
> bodytrack 52.866s 22.619s 57.21%
> fluidanimate 84.009s 38.148s 54.59%
> streamcluster 270.17s 42.726s 84.19%
> splash2x.ocean_cp 31.932s 9.539s 70.13%
> splash2x.ocean_ncp 36.063s 14.189s 60.65%
> splash2x.volrend 134.587s 21.79s 83.81%
>
> 2. 1VM, 112 VCPU. Some benchmarks require the number of threads to be the power of 2,
> so we run them with 64 threads and 128 threads.
> Host Topology: sockets=2 cores=28 threads=2
> VM Topology: sockets=1 cores=112 threads=1
> Command: <path to parsec>/bin/parsecmgmt -a run -p <benchmark> -i native -n <64,112,128>
> Statistics below are the real time of running each benchmark.(lower is better)
>
> before patch after patch improvements
> fluidanimate(64 thread) 124.235s 27.924s 77.52%
> fluidanimate(128 thread) 169.127s 64.541s 61.84%
> streamcluster(112 thread) 861.879s 496.66s 42.37%
> splash2x.ocean_cp(64 thread) 46.415s 18.527s 60.08%
> splash2x.ocean_cp(128 thread) 53.647s 28.929s 46.08%
> splash2x.ocean_ncp(64 thread) 47.613s 19.576s 58.89%
> splash2x.ocean_ncp(128 thread) 54.94s 29.199s 46.85%
> splash2x.volrend(112 thread) 801.384s 144.824s 81.93%
>
> 3. 2VM, each VM: 112 VCPU. Some benchmarks require the number of threads to
> be the power of 2, so we run them with 64 threads and 128 threads.
> Host Topology: sockets=2 cores=28 threads=2
> VM Topology: sockets=1 cores=112 threads=1
> Command: <path to parsec>/bin/parsecmgmt -a run -p <benchmark> -i native -n <64,112,128>
> Statistics below are the average real time of running each benchmark in 2 VMs.(lower is better)
>
> before patch after patch improvements
> fluidanimate(64 thread) 135.2125s 49.827s 63.15%
> fluidanimate(128 thread) 178.309s 86.964s 51.23%
> splash2x.ocean_cp(64 thread) 47.4505s 20.314s 57.19%
> splash2x.ocean_cp(128 thread) 55.5645s 30.6515s 44.84%
> splash2x.ocean_ncp(64 thread) 49.9775s 23.489s 53.00%
> splash2x.ocean_ncp(128 thread) 56.847s 28.545s 49.79%
> splash2x.volrend(112 thread) 838.939s 239.632s 71.44%
>
> For space limit, we list representative statistics here.
I did a performance test according to the description in the patch,
but did not get the performance improvement described in the description.
I suspect that the big difference between my kernel configuration
and yours has caused this problem. Can you please provide more detailed
test information, such as kernel configuration that must be turned on or off ?
Regards,
Jinrong Liang
Powered by blists - more mailing lists