linux-kernel - Re: [Patch v3] KVM: x86/pmu: Manipulate FIXED_CTR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8475706a-c6ba-45a2-b2ee-a4dc7f4621c5@gmail.com>
Date: Thu, 7 Mar 2024 11:27:55 +0800
From: Like Xu <like.xu.linux@...il.com>
To: Jim Mattson <jmattson@...gle.com>
Cc: Sean Christopherson <seanjc@...gle.com>,
 Mingwei Zhang <mizhang@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>,
 Kan Liang <kan.liang@...ux.intel.com>, kvm@...r.kernel.org,
 linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org,
 Zhenyu Wang <zhenyuw@...ux.intel.com>, Zhang Xiong
 <xiong.y.zhang@...el.com>, Lv Zhiyuan <zhiyuan.lv@...el.com>,
 Dapeng Mi <dapeng1.mi@...el.com>, Dapeng Mi <dapeng1.mi@...ux.intel.com>
Subject: Re: [Patch v3] KVM: x86/pmu: Manipulate FIXED_CTR_CTRL MSR with
 macros

On 6/3/2024 11:09 pm, Jim Mattson wrote:
> On Wed, Mar 6, 2024 at 1:11 AM Like Xu <like.xu.linux@...il.com> wrote:
>>
>> On 6/3/2024 7:22 am, Sean Christopherson wrote:
>>> +Mingwei
>>>
>>> On Thu, Aug 24, 2023, Dapeng Mi wrote:
>>>    diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
>>>> index 7d9ba301c090..ffda2ecc3a22 100644
>>>> --- a/arch/x86/kvm/pmu.h
>>>> +++ b/arch/x86/kvm/pmu.h
>>>> @@ -12,7 +12,8 @@
>>>>                                         MSR_IA32_MISC_ENABLE_BTS_UNAVAIL)
>>>>
>>>>    /* retrieve the 4 bits for EN and PMI out of IA32_FIXED_CTR_CTRL */
>>>> -#define fixed_ctrl_field(ctrl_reg, idx) (((ctrl_reg) >> ((idx)*4)) & 0xf)
>>>> +#define fixed_ctrl_field(ctrl_reg, idx) \
>>>> +    (((ctrl_reg) >> ((idx) * INTEL_FIXED_BITS_STRIDE)) & INTEL_FIXED_BITS_MASK)
>>>>
>>>>    #define VMWARE_BACKDOOR_PMC_HOST_TSC               0x10000
>>>>    #define VMWARE_BACKDOOR_PMC_REAL_TIME              0x10001
>>>> @@ -165,7 +166,8 @@ static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
>>>>
>>>>       if (pmc_is_fixed(pmc))
>>>>               return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
>>>> -                                    pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
>>>> +                                    pmc->idx - INTEL_PMC_IDX_FIXED) &
>>>> +                                    (INTEL_FIXED_0_KERNEL | INTEL_FIXED_0_USER);
>>>>
>>>>       return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
>>>>    }
>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>>>> index f2efa0bf7ae8..b0ac55891cb7 100644
>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>> @@ -548,8 +548,13 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>>>>               setup_fixed_pmc_eventsel(pmu);
>>>>       }
>>>>
>>>> -    for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
>>>> -            pmu->fixed_ctr_ctrl_mask &= ~(0xbull << (i * 4));
>>>> +    for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>>>> +            pmu->fixed_ctr_ctrl_mask &=
>>>> +                     ~intel_fixed_bits_by_idx(i,
>>>> +                                              INTEL_FIXED_0_KERNEL |
>>>> +                                              INTEL_FIXED_0_USER |
>>>> +                                              INTEL_FIXED_0_ENABLE_PMI);
>>>> +    }
>>>>       counter_mask = ~(((1ull << pmu->nr_arch_gp_counters) - 1) |
>>>>               (((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED));
>>>>       pmu->global_ctrl_mask = counter_mask;
>>>> @@ -595,7 +600,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>>>>                       pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE;
>>>>                       for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>>>>                               pmu->fixed_ctr_ctrl_mask &=
>>>> -                                    ~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4));
>>>
>>> OMG, this might just win the award for most obfuscated PMU code in KVM, which is
>>> saying something.  The fact that INTEL_PMC_IDX_FIXED happens to be 32, the same
>>> bit number as ICL_FIXED_0_ADAPTIVE, is 100% coincidence.  Good riddance.
>>>
>>> Argh, and this goofy code helped introduce a real bug.  reprogram_fixed_counters()
>>> doesn't account for the upper 32 bits of IA32_FIXED_CTR_CTRL.
>>>
>>> Wait, WTF?  Nothing in KVM accounts for the upper bits.  This can't possibly work.
>>>
>>> IIUC, because KVM _always_ sets precise_ip to a non-zero bit for PEBS events,
>>> perf will _always_ generate an adaptive record, even if the guest requested a
>>> basic record.  Ugh, and KVM will always generate adaptive records even if the
>>> guest doesn't support them.  This is all completely broken.  It probably kinda
>>> sorta works because the Basic info is always stored in the record, and generating
>>> more info requires a non-zero MSR_PEBS_DATA_CFG, but ugh.
>>
>> Yep, it works at least on machines with both adaptive and pebs_full features.
>>
>> I remember one generation of Atom core (? GOLDMONT) that didn't have both
>> above PEBS sub-features, so we didn't set x86_pmu.pebs_ept on that platform.
>>
>> Mingwei or others are encouraged to construct use cases in KUT::pmu_pebs.flat
>> that violate guest-pebs rules (e.g., leak host state), as we all recognize that
>> testing
>> is the right way to condemn legacy code, not just lengthy emails.
>>
>>>
>>> Oh great, and it gets worse.  intel_pmu_disable_fixed() doesn't clear the upper
>>> bits either, i.e. leaves ICL_FIXED_0_ADAPTIVE set.  Unless I'm misreading the code,
>>> intel_pmu_enable_fixed() effectively doesn't clear ICL_FIXED_0_ADAPTIVE either,
>>> as it only modifies the bit when it wants to set ICL_FIXED_0_ADAPTIVE.
>>>
>>> *sigh*
>>>
>>> I'm _very_ tempted to disable KVM PEBS support for the current PMU, and make it
>>> available only when the so-called passthrough PMU is available[*].  Because I
>>> don't see how this is can possibly be functionally correct, nor do I see a way
>>> to make it functionally correct without a rather large and invasive series.
>>
>> Considering that I've tried the idea myself, I have no inclination towards
>> "passthrough PMU", and I'd like to be able to take the time to review that
>> patchset while we all wait for a clear statement from that perf-core man,
>> who don't really care about virtualization and don't want to lose control
>> of global hardware resources.
>>
>> Before we actually get to that ideal state you want, we have to deal with
>> some intermediate state and face to any users that rely on the current code,
>> you had urged to merge in a KVM document for vPMU, not sure how far
>> along that part of the work is.
>>
>>>
>>> Ouch.  And after chatting with Mingwei, who asked the very good question of
>>> "can this leak host state?", I am pretty sure that yes, this can leak host state.
>>
>> The Basic Info has a tsc field, I suspect it's the host-state-tsc.
>>
>>>
>>> When PERF_CAP_PEBS_BASELINE is enabled for the guest, i.e. when the guest has
>>> access to adaptive records, KVM gives the guest full access to MSR_PEBS_DATA_CFG
>>>
>>>        pmu->pebs_data_cfg_mask = ~0xff00000full;
>>>
>>> which makes sense in a vacuum, because AFAICT the architecture doesn't allow
>>> exposing a subset of the four adaptive controls.
>>>
>>> GPRs and XMMs are always context switched and thus benign, but IIUC, Memory Info
>>> provides data that might now otherwise be available to the guest, e.g. if host
>>> userspace has disallowed equivalent events via KVM_SET_PMU_EVENT_FILTER.
>>
>> Indeed, KVM_SET_PMU_EVENT_FILTER doesn't work in harmony with
>> guest-pebs, and I believe there is a big problem here, especially with the
>> lack of targeted testing.
>>
>> One reason for this is that we don't use this cockamamie API in our
>> large-scale production environments, and users of vPMU want to get real
>> runtime information about physical cpus, not just virtualised hardware
>> architecture interfaces.
>>
>>>
>>> And unless I'm missing something, LBRs are a full leak of host state.  Nothing
>>> in the SDM suggests that PEBS records honor MSR intercepts, so unless KVM is
>>> also passing through LBRs, i.e. is context switching all LBR MSRs, the guest can
>>> use PEBS to read host LBRs at will.
>>
>> KVM is also passing through LBRs when guest uses LBR but not at the
>> granularity of vm-exit/entry. I'm not sure if the LBR_EN bit is required
>> to get LBR information via PEBS, also not confirmed whether PEBS-lbr
>> can be enabled at the same time as independent LBR;
>>
>> I recall that PEBS-assist, per cpu-arch, would clean up this part of the
>> record when crossing root/non-root boundaries, or not generate record.
>>
>> We're looking forward to the tests that will undermine this perception.
>>
>> There are some devilish details during the processing of vm-exit and
>> the generation of host/guest pebs, and those interested can delve into
>> the short description in this SDM section "20.9.5 EPT-Friendly PEBS".
>>
>>>
>>> Unless someone chimes in to point out how PEBS virtualization isn't a broken mess,
>>> I will post a patch to effectively disable PEBS virtualization.
>>
>> There are two factors that affect the availability of guest-pebs:
>>
>> 1. the technical need to use core-PMU in both host/guest worlds;
>> (I don't think Googlers are paying attention to this part of users' needs)
> 
> Let me clear up any misperceptions you might have that Google alone is
> foisting the pass-through PMU on the world. The work so far has been a
> collaboration between Google and Intel. Now, AMD has joined the
> collaboration as well. Mingwei is taking the lead on the project, but
> Googlers are outnumbered by the x86 CPU vendors ten to one.

This is such great news.

> 
> The pass-through PMU allows both the host and guest worlds to use the
> core PMU, more so than the existing vPMU implementation. I assume your

Can I further confirm that in any case, host/guest can use PMU resources,
such as some special more accurate counters ? Is there an end of story
for that static partitioning scheme ?

> complaint is about the desire for host software to monitor guest
> behavior with core PMU events while the guest is running. Today,
> Google Cloud does this for fleet management, and losing this
> capability is not something we are looking forward to. However, the
> writing is on the wall: Coco is going to take this capability away
> from us anyway.

Coco pays a corresponding performance cost, and it's a paradox to hide
any performance trace of coco-guests from host's point of view.

Thanks for the input, Jim. Let me try to help.

> 
>> 2. guest-pebs is temporarily disabled in the case of cross-mapping counter,
>> which reduces the number of performance samples collected by guest;
>>
>>>
>>> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
>>> index 41a4533f9989..a2f827fa0ca1 100644
>>> --- a/arch/x86/kvm/vmx/capabilities.h
>>> +++ b/arch/x86/kvm/vmx/capabilities.h
>>> @@ -392,7 +392,7 @@ static inline bool vmx_pt_mode_is_host_guest(void)
>>>
>>>    static inline bool vmx_pebs_supported(void)
>>>    {
>>> -       return boot_cpu_has(X86_FEATURE_PEBS) && kvm_pmu_cap.pebs_ept;
>>> +       return false;
>>
>> As you know, user-space VMM may disable guest-pebs by filtering out the
>> MSR_IA32_PERF_CAPABILITIE.PERF_CAP_PEBS_FORMAT or CPUID.PDCM.
>>
>> In the end, if our great KVM maintainers insist on doing this,
>> there is obviously nothing I can do about it.
>>
>> Hope you have a good day.
>>
>>>    }
>>>
>>>    static inline bool cpu_has_notify_vmexit(void)
>>>
>>