linux-kernel - Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bebdafb8-387c-4984-885a-8b22f2d9b9f5@linux.intel.com>
Date: Fri, 26 Apr 2024 09:50:50 +0800
From: "Mi, Dapeng" <dapeng1.mi@...ux.intel.com>
To: "Liang, Kan" <kan.liang@...ux.intel.com>,
 Mingwei Zhang <mizhang@...gle.com>
Cc: Sean Christopherson <seanjc@...gle.com>, maobibo <maobibo@...ngson.cn>,
 Xiong Zhang <xiong.y.zhang@...ux.intel.com>, pbonzini@...hat.com,
 peterz@...radead.org, kan.liang@...el.com, zhenyuw@...ux.intel.com,
 jmattson@...gle.com, kvm@...r.kernel.org, linux-perf-users@...r.kernel.org,
 linux-kernel@...r.kernel.org, zhiyuan.lv@...el.com, eranian@...gle.com,
 irogers@...gle.com, samantha.alt@...el.com, like.xu.linux@...il.com,
 chao.gao@...el.com
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU
 state for Intel CPU


On 4/26/2024 4:43 AM, Liang, Kan wrote:
>
> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>>>
>>>
>>> On 2024-04-25 12:24 a.m., Mingwei Zhang wrote:
>>>> On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
>>>>>
>>>>> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
>>>>>> On Wed, Apr 24, 2024, Dapeng Mi wrote:
>>>>>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>>>>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>>>>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>>>>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>>>>>>>>> switch location could be an option.
>>>>>>>>> If there are two VMs with pmu enabled both, however host PMU is not
>>>>>>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>>>>>>>>
>>>>>>>>> If host pmu is used also, we can choose whether PMU switch should be
>>>>>>>>> done in vm exit path or vcpu thread sched-out path.
>>>>>>>>>
>>>>>>>> host PMU is always enabled, ie., Linux currently does not support KVM
>>>>>>>> PMU running standalone. I guess what you mean is there are no active
>>>>>>>> perf_events on the host side. Allowing a PMU context switch drifting
>>>>>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
>>>>>>>> side events might be a good option. We can keep the discussion, but I
>>>>>>>> won't propose that in v2.
>>>>>>> I suspect if it's really doable to do this deferring. This still makes host
>>>>>>> lose the most of capability to profile KVM. Per my understanding, most of
>>>>>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
>>>>>>> We have no idea when host want to create perf event to profile KVM, it could
>>>>>>> be at any time.
>>>>>> No, the idea is that KVM will load host PMU state asap, but only when host PMU
>>>>>> state actually needs to be loaded, i.e. only when there are relevant host events.
>>>>>>
>>>>>> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
>>>>>> KVM_RUN loop, i.e. provides optimal behavior for the guest.  But if a host perf
>>>>>> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
>>>>>> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
>>>>>> for the guest while host perf events are active.
>>>>> I see. So KVM needs to provide a callback which needs to be called in
>>>>> the IPI handler. The KVM callback needs to be called to switch PMU state
>>>>> before perf really enabling host event and touching PMU MSRs. And only
>>>>> the perf event with exclude_guest attribute is allowed to create on
>>>>> host. Thanks.
>>>> Do we really need a KVM callback? I think that is one option.
>>>>
>>>> Immediately after VMEXIT, KVM will check whether there are "host perf
>>>> events". If so, do the PMU context switch immediately. Otherwise, keep
>>>> deferring the context switch to the end of vPMU loop.
>>>>
>>>> Detecting if there are "host perf events" would be interesting. The
>>>> "host perf events" refer to the perf_events on the host that are
>>>> active and assigned with HW counters and that are saved when context
>>>> switching to the guest PMU. I think getting those events could be done
>>>> by fetching the bitmaps in cpuc.
>>> The cpuc is ARCH specific structure. I don't think it can be get in the
>>> generic code. You probably have to implement ARCH specific functions to
>>> fetch the bitmaps. It probably won't worth it.
>>>
>>> You may check the pinned_groups and flexible_groups to understand if
>>> there are host perf events which may be scheduled when VM-exit. But it
>>> will not tell the idx of the counters which can only be got when the
>>> host event is really scheduled.
>>>
>>>> I have to look into the details. But
>>>> at the time of VMEXIT, kvm should already have that information, so it
>>>> can immediately decide whether to do the PMU context switch or not.
>>>>
>>>> oh, but when the control is executing within the run loop, a
>>>> host-level profiling starts, say 'perf record -a ...', it will
>>>> generate an IPI to all CPUs. Maybe that's when we need a callback so
>>>> the KVM guest PMU context gets preempted for the host-level profiling.
>>>> Gah..
>>>>
>>>> hmm, not a fan of that. That means the host can poke the guest PMU
>>>> context at any time and cause higher overhead. But I admit it is much
>>>> better than the current approach.
>>>>
>>>> The only thing is that: any command like 'perf record/stat -a' shot in
>>>> dark corners of the host can preempt guest PMUs of _all_ running VMs.
>>>> So, to alleviate that, maybe a module parameter that disables this
>>>> "preemption" is possible? This should fit scenarios where we don't
>>>> want guest PMU to be preempted outside of the vCPU loop?
>>>>
>>> It should not happen. For the current implementation, perf rejects all
>>> the !exclude_guest system-wide event creation if a guest with the vPMU
>>> is running.
>>> However, it's possible to create an exclude_guest system-wide event at
>>> any time. KVM cannot use the information from the VM-entry to decide if
>>> there will be active perf events in the VM-exit.
>> Hmm, why not? If there is any exclude_guest system-wide event,
>> perf_guest_enter() can return something to tell KVM "hey, some active
>> host events are swapped out. they are originally in counter #2 and
>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
>> that and keep it in its pmu data structure.
> I think it's possible that someone creates !exclude_guest event after
> the perf_guest_enter(). The stale information is saved in the KVM. Perf
> will schedule the event in the next perf_guest_exit(). KVM will not know it.
>
>> Now, when doing context switching back to host at just VMEXIT, KVM
>> will check this data and see if host perf context has something active
>> (of course, they are all exclude_guest events). If not, deferring the
>> context switch to vcpu boundary. Otherwise, do the proper PMU context
>> switching by respecting the occupied counter positions on the host
>> side, i.e., avoid doubling the work on the KVM side.
>>
>> Kan, any suggestion on the above approach?
> I think we can only know the accurate event list at perf_guest_exit().
> You may check the pinned_groups and flexible_groups, which tell if there
> are candidate events.
>
>> Totally understand that
>> there might be some difficulty, since perf subsystem works in several
>> layers and obviously fetching low-level mapping is arch specific work.
>> If that is difficult, we can split the work in two phases: 1) phase
>> #1, just ask perf to tell kvm if there are active exclude_guest events
>> swapped out; 2) phase #2, ask perf to tell their (low-level) counter
>> indices.
>>
> If you want an accurate counter mask, the changes in the arch specific
> code is required. Two phases sound good to me.
>
> Besides perf changes, I think the KVM should also track which counters
> need to be saved/restored. The information can be get from the EventSel
> interception.

Yes, that's another optimization from guest point view. It's in our 
to-do list.


>
> Thanks,
> Kan
>>> The perf_guest_exit() will reload the host state. It's impossible to
>>> save the guest state after that. We may need a KVM callback. So perf can
>>> tell KVM whether to save the guest state before perf reloads the host state.
>>>
>>> Thanks,
>>> Kan
>>>>>
>>>>>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom