linux-kernel - Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAL715WKh8VBJ-O50oqSnCqKPQo4Bor_aMnRZeS_TzJP3ja8-YQ@mail.gmail.com>
Date: Wed, 24 Apr 2024 21:24:12 -0700
From: Mingwei Zhang <mizhang@...gle.com>
To: "Mi, Dapeng" <dapeng1.mi@...ux.intel.com>
Cc: Sean Christopherson <seanjc@...gle.com>, maobibo <maobibo@...ngson.cn>, 
	Xiong Zhang <xiong.y.zhang@...ux.intel.com>, pbonzini@...hat.com, peterz@...radead.org, 
	kan.liang@...el.com, zhenyuw@...ux.intel.com, jmattson@...gle.com, 
	kvm@...r.kernel.org, linux-perf-users@...r.kernel.org, 
	linux-kernel@...r.kernel.org, zhiyuan.lv@...el.com, eranian@...gle.com, 
	irogers@...gle.com, samantha.alt@...el.com, like.xu.linux@...il.com, 
	chao.gao@...el.com
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU
 state for Intel CPU

On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
>
>
> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
> > On Wed, Apr 24, 2024, Dapeng Mi wrote:
> >> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
> >>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> >>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
> >>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
> >>>>> switch location could be an option.
> >>>> If there are two VMs with pmu enabled both, however host PMU is not
> >>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
> >>>>
> >>>> If host pmu is used also, we can choose whether PMU switch should be
> >>>> done in vm exit path or vcpu thread sched-out path.
> >>>>
> >>> host PMU is always enabled, ie., Linux currently does not support KVM
> >>> PMU running standalone. I guess what you mean is there are no active
> >>> perf_events on the host side. Allowing a PMU context switch drifting
> >>> from vm-enter/exit boundary to vcpu loop boundary by checking host
> >>> side events might be a good option. We can keep the discussion, but I
> >>> won't propose that in v2.
> >> I suspect if it's really doable to do this deferring. This still makes host
> >> lose the most of capability to profile KVM. Per my understanding, most of
> >> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
> >> We have no idea when host want to create perf event to profile KVM, it could
> >> be at any time.
> > No, the idea is that KVM will load host PMU state asap, but only when host PMU
> > state actually needs to be loaded, i.e. only when there are relevant host events.
> >
> > If there are no host perf events, KVM keeps guest PMU state loaded for the entire
> > KVM_RUN loop, i.e. provides optimal behavior for the guest.  But if a host perf
> > events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
> > i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
> > for the guest while host perf events are active.
>
> I see. So KVM needs to provide a callback which needs to be called in
> the IPI handler. The KVM callback needs to be called to switch PMU state
> before perf really enabling host event and touching PMU MSRs. And only
> the perf event with exclude_guest attribute is allowed to create on
> host. Thanks.

Do we really need a KVM callback? I think that is one option.

Immediately after VMEXIT, KVM will check whether there are "host perf
events". If so, do the PMU context switch immediately. Otherwise, keep
deferring the context switch to the end of vPMU loop.

Detecting if there are "host perf events" would be interesting. The
"host perf events" refer to the perf_events on the host that are
active and assigned with HW counters and that are saved when context
switching to the guest PMU. I think getting those events could be done
by fetching the bitmaps in cpuc. I have to look into the details. But
at the time of VMEXIT, kvm should already have that information, so it
can immediately decide whether to do the PMU context switch or not.

oh, but when the control is executing within the run loop, a
host-level profiling starts, say 'perf record -a ...', it will
generate an IPI to all CPUs. Maybe that's when we need a callback so
the KVM guest PMU context gets preempted for the host-level profiling.
Gah..

hmm, not a fan of that. That means the host can poke the guest PMU
context at any time and cause higher overhead. But I admit it is much
better than the current approach.

The only thing is that: any command like 'perf record/stat -a' shot in
dark corners of the host can preempt guest PMUs of _all_ running VMs.
So, to alleviate that, maybe a module parameter that disables this
"preemption" is possible? This should fit scenarios where we don't
want guest PMU to be preempted outside of the vCPU loop?

Thanks. Regards
-Mingwei

-Mingwei

>
>
> >
> > My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom