[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <46a889c4-b104-487e-be3e-7f4b57c0b339@linux.intel.com>
Date: Tue, 23 Apr 2024 16:24:45 +0800
From: "Mi, Dapeng" <dapeng1.mi@...ux.intel.com>
To: Mingwei Zhang <mizhang@...gle.com>
Cc: maobibo <maobibo@...ngson.cn>, Sean Christopherson <seanjc@...gle.com>,
Xiong Zhang <xiong.y.zhang@...ux.intel.com>, pbonzini@...hat.com,
peterz@...radead.org, kan.liang@...el.com, zhenyuw@...ux.intel.com,
jmattson@...gle.com, kvm@...r.kernel.org, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org, zhiyuan.lv@...el.com, eranian@...gle.com,
irogers@...gle.com, samantha.alt@...el.com, like.xu.linux@...il.com,
chao.gao@...el.com
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU
state for Intel CPU
On 4/23/2024 3:10 PM, Mingwei Zhang wrote:
> On Mon, Apr 22, 2024 at 11:45 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
>>
>> On 4/23/2024 2:08 PM, maobibo wrote:
>>>
>>> On 2024/4/23 下午12:23, Mingwei Zhang wrote:
>>>> On Mon, Apr 22, 2024 at 8:55 PM maobibo <maobibo@...ngson.cn> wrote:
>>>>>
>>>>>
>>>>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>>>>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>>>>>
>>>>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>>>>>
>>>>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>>>>>> <seanjc@...gle.com> wrote:
>>>>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>>>>>> the roles and
>>>>>>>>>>>>>> responsibilities between KVM and perf are poorly defined,
>>>>>>>>>>>>>> which
>>>>>>>>>>>>>> leads to suboptimal
>>>>>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>>>>>> _would_ leak guest
>>>>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>>>>>> the PMCs might not
>>>>>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>>>>>> doing so makes the
>>>>>>>>>>>>>> overall code brittle because it's not clear whether KVM
>>>>>>>>>>>>>> _needs_
>>>>>>>>>>>>>> to clear PMCs,
>>>>>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>>>>>> PMU HW.
>>>>>>>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>>>>>>>> discussion
>>>>>>>>>>>> that we won't win. Nor do I think it's one we *need* to make.
>>>>>>>>>>>> KVM doesn't need
>>>>>>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>>>>>>>> hardware, KVM
>>>>>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>>>>>> virtualizing a guest PMU.
>>>>>>>>>>>>
>>>>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>>>>>> implementation, until both clients agree to a "deal" between
>>>>>>>>>>>>> them.
>>>>>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>>>>>> one via
>>>>>>>>>>>>> future discussion.
>>>>>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>>>>>> before this code
>>>>>>>>>>>> is merged. It doesn't need to be anything fancy, e.g. perf can
>>>>>>>>>>>> still pave over
>>>>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>>>>>> do the clearing. But perf and KVM need to work together from
>>>>>>>>>>>> the
>>>>>>>>>>>> get go, ie. I
>>>>>>>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>>>>>>>> and vice versa.
>>>>>>>>>>>>
>>>>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly
>>>>>>>>>>> pmu
>>>>>>>>>>> hardware
>>>>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>>>>>> there are
>>>>>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>>>>>> context switch.
>>>>>>>>>> Two questions:
>>>>>>>>>>
>>>>>>>>>> 1) Can KVM prevent the guest from accessing the PMU?
>>>>>>>>>>
>>>>>>>>>> 2) If so, KVM can grant partial access to the PMU, or is it all
>>>>>>>>>> or nothing?
>>>>>>>>>>
>>>>>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>>>>>> LoongArch *requires*
>>>>>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>>>>> Hi Sean,
>>>>>>>>>
>>>>>>>>> Thank for your quick response.
>>>>>>>>>
>>>>>>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>>>>>>>> or all to access to the PMU. Only that if one pmu event is granted
>>>>>>>>> to VM, host can not access this pmu event again. There must be pmu
>>>>>>>>> event switch if host want to.
>>>>>>>> PMU event is a software entity which won't be shared. did you
>>>>>>>> mean if
>>>>>>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>>>>>>>> counter, right?
>>>>>>> yes, if PMU HW counter/control is granted to VM. The value comes from
>>>>>>> guest, and is not meaningful for host. Host pmu core does not know
>>>>>>> that it is granted to VM, host still think that it owns pmu.
>>>>>> That's one issue this patchset tries to solve. Current new mediated
>>>>>> x86
>>>>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
>>>>>> simultaneously. Only when there is no !exclude_guest event on host,
>>>>>> guest is allowed to exclusively own the PMU HW resource.
>>>>>>
>>>>>>
>>>>>>> Just like FPU register, it is shared by VM and host during different
>>>>>>> time and it is lately switched. But if IPI or timer interrupt uses
>>>>>>> FPU
>>>>>>> register on host, there will be the same issue.
>>>>>> I didn't fully get your point. When IPI or timer interrupt reach, a
>>>>>> VM-exit is triggered to make CPU traps into host first and then the
>>>>>> host
>>>>> yes, it is.
>>>> This is correct. And this is one of the points that we had debated
>>>> internally whether we should do PMU context switch at vcpu loop
>>>> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
>>>> force VM Exit, which I think happens every 4ms or 1ms, depending on
>>>> configuration.
>>>>
>>>> One of the key reasons we currently propose this is because it is the
>>>> same boundary as the legacy PMU, i.e., it would be simple to propose
>>>> from the perf subsystem perspective.
>>>>
>>>> Performance wise, doing PMU context switch at vcpu boundary would be
>>>> way better in general. But the downside is that perf sub-system lose
>>>> the capability to profile majority of the KVM code (functions) when
>>>> guest PMU is enabled.
>>>>
>>>>>> interrupt handler is called. Or are you complaining the executing
>>>>>> sequence of switching guest PMU MSRs and these interrupt handler?
>>>>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
>>>>> path, however there is problem if vPMU is switched during vcpu thread
>>>>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
>>>>> register in host mode.
>>>> Oh, the IPI/timer irq handler will access PMU registers? I thought
>>>> only the host-level NMI handler will access the PMU MSRs since PMI is
>>>> registered under NMI.
>>>>
>>>> In that case, you should disable IRQ during vcpu context switch. For
>>>> NMI, we prevent its handler from accessing the PMU registers. In
>>>> particular, we use a per-cpu variable to guard that. So, the
>>>> host-level PMI handler for perf sub-system will check the variable
>>>> before proceeding.
>>> perf core will access pmu hw in tick timer/hrtimer/ipi function call,
>>> such as function perf_event_task_tick() is called in tick timer, there
>>> are event_function_call(event, __perf_event_xxx, &value) in file
>>> kernel/events/core.c.
>>>
>>> https://lore.kernel.org/lkml/20240417065236.500011-1-gaosong@loongson.cn/T/#m15aeb79fdc9ce72dd5b374edd6acdcf7a9dafcf4
>>>
>> Just go through functions (not sure if all), whether
>> perf_event_task_tick() or the callbacks of event_function_call() would
>> check the event->state first, if the event is in
>> PERF_EVENT_STATE_INACTIVE, the PMU HW MSRs would not be touched really.
>> In this new proposal, all host events with exclude_guest attribute would
>> be put on PERF_EVENT_STATE_INACTIVE sate if guest own the PMU HW
>> resource. So I think it's fine.
>>
> Is there any event in the host still having PERF_EVENT_STATE_ACTIVE?
> If so, hmm, it will reach perf_pmu_disable(event->pmu), which will
> access the global ctrl MSR.
I don't think there is any event with PERF_EVENT_STATE_ACTIVE state on
host when guest owns the PMU HW resource.
In current solution, VM would fail to create if there is any system-wide
event without exclude_guest attribute. If VM is created successfully and
when vm-entry happens, the helper perf_guest_enter() would put all host
events with exclude_guest attribute into PERF_EVENT_STATE_INACTIVE state
and block host to create system-wide events without exclude_guest attribute.
Powered by blists - more mailing lists