lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 29 Jul 2021 21:46:05 +0800
From:   Like Xu <like.xu.linux@...il.com>
To:     Peter Zijlstra <peterz@...radead.org>,
        Paolo Bonzini <pbonzini@...hat.com>
Cc:     Sean Christopherson <seanjc@...gle.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        Thomas Gleixner <tglx@...utronix.de>, kvm@...r.kernel.org,
        x86@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: x86/pmu: Introduce pmc->is_paused to reduce the call
 time of perf interfaces

On 29/7/2021 8:58 pm, Peter Zijlstra wrote:
> On Wed, Jul 28, 2021 at 08:07:05PM +0800, Like Xu wrote:
>> From: Like Xu <likexu@...cent.com>
>>
>> Based on our observations, after any vm-exit associated with vPMU, there
>> are at least two or more perf interfaces to be called for guest counter
>> emulation, such as perf_event_{pause, read_value, period}(), and each one
>> will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes
>> more severe when guest use counters in a multiplexed manner.
>>
>> Holding a lock once and completing the KVM request operations in the perf
>> context would introduce a set of impractical new interfaces. So we can
>> further optimize the vPMU implementation by avoiding repeated calls to
>> these interfaces in the KVM context for at least one pattern:
>>
>> After we call perf_event_pause() once, the event will be disabled and its
>> internal count will be reset to 0. So there is no need to pause it again
>> or read its value. Once the event is paused, event period will not be
>> updated until the next time it's resumed or reprogrammed. And there is
>> also no need to call perf_event_period twice for a non-running counter,
>> considering the perf_event for a running counter is never paused.
>>
>> Based on this implementation, for the following common usage of
>> sampling 4 events using perf on a 4u8g guest:
>>
>>    echo 0 > /proc/sys/kernel/watchdog
>>    echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
>>    echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate
>>    echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
>>    for i in `seq 1 1 10`
>>    do
>>    taskset -c 0 perf record \
>>    -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \
>>    /root/br_instr a
>>    done
>>
>> the average latency of the guest NMI handler is reduced from
>> 37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server.
>> Also, in addition to collecting more samples, no loss of sampling
>> accuracy was observed compared to before the optimization.
>>
>> Signed-off-by: Like Xu <likexu@...cent.com>
> 
> Looks sane I suppose.
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> 
> What kinds of VM-exits are the most common?
> 

A typical vm-exit trace is as follows:

  146820 EXTERNAL_INTERRUPT
  126301 MSR_WRITE
   17009 MSR_READ
    9710 RDPMC
    7295 EXCEPTION_NMI
    2493 EPT_VIOLATION
    1357 EPT_MISCONFIG
     567 CPUID
     107 NMI_WINDOW
      59 IO_INSTRUCTION
       2 VMCALL

including the following kvm_msr trace:

   15822 msr_write, MSR_CORE_PERF_GLOBAL_CTRL
   14558 msr_read, MSR_CORE_PERF_GLOBAL_STATUS
    7315 msr_write, IA32_X2APIC_LVT_PMI
    7250 msr_write, MSR_CORE_PERF_GLOBAL_OVF_CTRL
    2922 msr_write, MSR_IA32_PMC0
    2912 msr_write, MSR_CORE_PERF_FIXED_CTR0
    2904 msr_write, MSR_CORE_PERF_FIXED_CTR1
    2390 msr_write, MSR_CORE_PERF_FIXED_CTR_CTRL
    2390 msr_read, MSR_CORE_PERF_FIXED_CTR_CTRL
    1195 msr_write, MSR_P6_EVNTSEL1
    1195 msr_write, MSR_P6_EVNTSEL0
     976 msr_write, MSR_IA32_PMC1
     618 msr_write, IA32_X2APIC_ICR

Due to the presence of a large number of msr accesses, the latency of
the guest PMI handler is still far from that of the host handler.

I have two rough ideas that could drastically reduce the vPMU overhead
for the third time:

- Add a new paravirtualized pmu guest driver that saves all msr latest
values to the static physical memory of each vcpu to achieve a reduced
number of vm-exits and also facilitate kvm emulation access;

- Bypass the host perf_event PMI callback injection path and inject
guest PMI directly after the EXCEPTION_N/PMI vm-exit; (For TDX guest)


Any negative comments or help with additional details are welcome.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ