linux-kernel - Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <86d1f6d1-197a-ecd9-3349-a64da9ea9789@loongson.cn>
Date: Tue, 23 Apr 2024 14:08:49 +0800
From: maobibo <maobibo@...ngson.cn>
To: Mingwei Zhang <mizhang@...gle.com>
Cc: "Mi, Dapeng" <dapeng1.mi@...ux.intel.com>,
 Sean Christopherson <seanjc@...gle.com>,
 Xiong Zhang <xiong.y.zhang@...ux.intel.com>, pbonzini@...hat.com,
 peterz@...radead.org, kan.liang@...el.com, zhenyuw@...ux.intel.com,
 jmattson@...gle.com, kvm@...r.kernel.org, linux-perf-users@...r.kernel.org,
 linux-kernel@...r.kernel.org, zhiyuan.lv@...el.com, eranian@...gle.com,
 irogers@...gle.com, samantha.alt@...el.com, like.xu.linux@...il.com,
 chao.gao@...el.com
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU
 state for Intel CPU



On 2024/4/23 下午12:23, Mingwei Zhang wrote:
> On Mon, Apr 22, 2024 at 8:55 PM maobibo <maobibo@...ngson.cn> wrote:
>>
>>
>>
>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>>
>>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>>
>>>>
>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>>
>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>>> <seanjc@...gle.com> wrote:
>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>>> the roles and
>>>>>>>>>>> responsibilities between KVM and perf are poorly defined, which
>>>>>>>>>>> leads to suboptimal
>>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>>
>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>>> _would_ leak guest
>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>>> the PMCs might not
>>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>>
>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>>> doing so makes the
>>>>>>>>>>> overall code brittle because it's not clear whether KVM _needs_
>>>>>>>>>>> to clear PMCs,
>>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>>
>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>>> PMU HW.
>>>>>>>>>
>>>>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>>>>> discussion
>>>>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
>>>>>>>>> KVM doesn't need
>>>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>>>>> hardware, KVM
>>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>>> virtualizing a guest PMU.
>>>>>>>>>
>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>>> one via
>>>>>>>>>> future discussion.
>>>>>>>>>
>>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>>> before this code
>>>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
>>>>>>>>> still pave over
>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>>> do the clearing.  But perf and KVM need to work together from the
>>>>>>>>> get go, ie. I
>>>>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>>>>> and vice versa.
>>>>>>>>>
>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu
>>>>>>>> hardware
>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>>> there are
>>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>>> context switch.
>>>>>>>
>>>>>>> Two questions:
>>>>>>>
>>>>>>>    1) Can KVM prevent the guest from accessing the PMU?
>>>>>>>
>>>>>>>    2) If so, KVM can grant partial access to the PMU, or is it all
>>>>>>> or nothing?
>>>>>>>
>>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>>> LoongArch *requires*
>>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>>
>>>>>> Hi Sean,
>>>>>>
>>>>>> Thank for your quick response.
>>>>>>
>>>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>>>>> or all to access to the PMU. Only that if one pmu event is granted
>>>>>> to VM, host can not access this pmu event again. There must be pmu
>>>>>> event switch if host want to.
>>>>>
>>>>> PMU event is a software entity which won't be shared. did you mean if
>>>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>>>>> counter, right?
>>>> yes, if PMU HW counter/control is granted to VM. The value comes from
>>>> guest, and is not meaningful for host.  Host pmu core does not know
>>>> that it is granted to VM, host still think that it owns pmu.
>>>
>>> That's one issue this patchset tries to solve. Current new mediated x86
>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
>>> simultaneously. Only when there is no !exclude_guest event on host,
>>> guest is allowed to exclusively own the PMU HW resource.
>>>
>>>
>>>>
>>>> Just like FPU register, it is shared by VM and host during different
>>>> time and it is lately switched. But if IPI or timer interrupt uses FPU
>>>> register on host, there will be the same issue.
>>>
>>> I didn't fully get your point. When IPI or timer interrupt reach, a
>>> VM-exit is triggered to make CPU traps into host first and then the host
>> yes, it is.
> 
> This is correct. And this is one of the points that we had debated
> internally whether we should do PMU context switch at vcpu loop
> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
> force VM Exit, which I think happens every 4ms or 1ms, depending on
> configuration.
> 
> One of the key reasons we currently propose this is because it is the
> same boundary as the legacy PMU, i.e., it would be simple to propose
> from the perf subsystem perspective.
> 
> Performance wise, doing PMU context switch at vcpu boundary would be
> way better in general. But the downside is that perf sub-system lose
> the capability to profile majority of the KVM code (functions) when
> guest PMU is enabled.
> 
>>
>>> interrupt handler is called. Or are you complaining the executing
>>> sequence of switching guest PMU MSRs and these interrupt handler?
>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
>> path, however there is problem if vPMU is switched during vcpu thread
>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
>> register in host mode.
> 
> Oh, the IPI/timer irq handler will access PMU registers? I thought
> only the host-level NMI handler will access the PMU MSRs since PMI is
> registered under NMI.
> 
> In that case, you should disable  IRQ during vcpu context switch. For
> NMI, we prevent its handler from accessing the PMU registers. In
> particular, we use a per-cpu variable to guard that. So, the
> host-level PMI handler for perf sub-system will check the variable
> before proceeding.

perf core will access pmu hw in tick timer/hrtimer/ipi function call,
such as function perf_event_task_tick() is called in tick timer, there
are  event_function_call(event, __perf_event_xxx, &value) in  file
kernel/events/core.c.

https://lore.kernel.org/lkml/20240417065236.500011-1-gaosong@loongson.cn/T/#m15aeb79fdc9ce72dd5b374edd6acdcf7a9dafcf4


> 
>>
>> In general it will be better if the switch is done in vcpu thread
>> sched-out/sched-in, else there is requirement to profile kvm
>> hypervisor.Even there is such requirement, it is only one option. In
>> most conditions, it will better if time of VM context exit is small.
>>
> Performance wise, agree, but there will be debate on perf
> functionality loss at the host level.
> 
> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> boundary normally, but doing it at VM Enter/Exit boundary when host is
> profiling KVM kernel module. So, dynamically adjusting PMU context
> switch location could be an option.
> 
>>>
>>>
>>>>
>>>> Regards
>>>> Bibo Mao
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Can we add callback handler in structure kvm_guest_cbs?  just like
>>>>>>>> this:
>>>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>>>>>>> kvm_guest_cbs
>>>>>>>> = {
>>>>>>>>           .state                  = kvm_guest_state,
>>>>>>>>           .get_ip                 = kvm_guest_get_ip,
>>>>>>>>           .handle_intel_pt_intr   = NULL,
>>>>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>>>>    };
>>>>>>>>
>>>>>>>> By the way, I do not know should the callback handler be triggered
>>>>>>>> in perf
>>>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>>>>>>> triggered in
>>>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>>>> but I think it will be better if it is done in perf core.
>>>>>>>
>>>>>>> I don't think we want to take the approach of perf and KVM guests
>>>>>>> "fighting" over
>>>>>>> the PMU.  That's effectively what we have today, and it's a mess
>>>>>>> for KVM because
>>>>>>> it's impossible to provide consistent, deterministic behavior for
>>>>>>> the guest.  And
>>>>>>> it's just as messy for perf, which ends up having wierd, cumbersome
>>>>>>> flows that
>>>>>>> exists purely to try to play nice with KVM.
>>>>>> With existing pmu core code, in tick timer interrupt or IPI function
>>>>>> call interrupt pmu hw may be accessed by host when VM is running and
>>>>>> pmu is already granted to guest. KVM can not intercept host
>>>>>> IPI/timer interrupt, there is no pmu context switch, there will be
>>>>>> problem.
>>>>>>
>>>>>> Regards
>>>>>> Bibo Mao
>>>>>>
>>>>
>>