[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f5bcbc0-e2ef-4232-a56a-fda93c6a569e@linux.intel.com>
Date: Fri, 26 Apr 2024 15:06:48 -0400
From: "Liang, Kan" <kan.liang@...ux.intel.com>
To: Mingwei Zhang <mizhang@...gle.com>
Cc: "Mi, Dapeng" <dapeng1.mi@...ux.intel.com>,
Sean Christopherson <seanjc@...gle.com>, maobibo <maobibo@...ngson.cn>,
Xiong Zhang <xiong.y.zhang@...ux.intel.com>, pbonzini@...hat.com,
peterz@...radead.org, kan.liang@...el.com, zhenyuw@...ux.intel.com,
jmattson@...gle.com, kvm@...r.kernel.org, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org, zhiyuan.lv@...el.com, eranian@...gle.com,
irogers@...gle.com, samantha.alt@...el.com, like.xu.linux@...il.com,
chao.gao@...el.com
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU
state for Intel CPU
On 2024-04-26 2:41 p.m., Mingwei Zhang wrote:
>>> So this requires vcpu->pmu has two pieces of state information: 1) the
>>> flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
>>> just a boolean; phase #2, bitmap of occupied counters).
>>>
>>> This is a non-trivial optimization on the PMU context switch. I am
>>> thinking about splitting them into the following phases:
>>>
>>> 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
>>> for the 1st time.
>>> 2) fast PMU context switch on KVM side, i.e., KVM checking event
>>> selector value (enable/disable) and selectively switch PMU state
>>> (reducing rd/wr msrs)
>>> 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
>>> context switch boundary depending on existing active host-level
>>> events.
>>> 3.1) more accurate dynamic PMU context switch, ie., KVM checking
>>> host-level counter position and further reduces the number of msr
>>> accesses.
>>> 4) guest PMU context preemption, i.e., any new host-level perf
>>> profiling can immediately preempt the guest PMU in the vcpu loop
>>> (instead of waiting for the next PMU context switch in KVM).
>> I'm not quit sure about the 4.
>> The new host-level perf must be an exclude_guest event. It should not be
>> scheduled when a guest is using the PMU. Why do we want to preempt the
>> guest PMU? The current implementation in perf doesn't schedule any
>> exclude_guest events when a guest is running.
> right. The grey area is the code within the KVM_RUN loop, but
> _outside_ of the guest. This part of the code is on the "host" side.
> However, for efficiency reasons, KVM defers the PMU context switch by
> retaining the guest PMU MSR values within the loop.
I assume you mean the optimization of moving the context switch from
VM-exit/entry boundary to the vCPU boundary.
> Optimization 4
> allows the host side to immediately profiling this part instead of
> waiting for vcpu to reach to PMU context switch locations. Doing so
> will generate more accurate results.
If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
definition of the exclude_guest. Without 4, it brings some random blind
spots, right?
>
> Do we want to preempt that? I think it depends. For regular cloud
> usage, we don't. But for any other usages where we want to prioritize
> KVM/VMM profiling over guest vPMU, it is useful.
>
> My current opinion is that optimization 4 is something nice to have.
> But we should allow people to turn it off just like we could choose to
> disable preempt kernel.
The exclude_guest means everything but the guest. I don't see a reason
why people want to turn it off and get some random blind spots.
Thanks,
Kan
Powered by blists - more mailing lists