linux-kernel - Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZR3Ohk50rSofAnSL@google.com>
Date:   Wed, 4 Oct 2023 13:43:50 -0700
From:   Sean Christopherson <seanjc@...gle.com>
To:     Mingwei Zhang <mizhang@...gle.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Dapeng Mi <dapeng1.mi@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Kan Liang <kan.liang@...ux.intel.com>,
        Like Xu <likexu@...cent.com>,
        Mark Rutland <mark.rutland@....com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...nel.org>,
        Namhyung Kim <namhyung@...nel.org>,
        Ian Rogers <irogers@...gle.com>,
        Adrian Hunter <adrian.hunter@...el.com>, kvm@...r.kernel.org,
        linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org,
        Zhenyu Wang <zhenyuw@...ux.intel.com>,
        Zhang Xiong <xiong.y.zhang@...el.com>,
        Lv Zhiyuan <zhiyuan.lv@...el.com>,
        Yang Weijiang <weijiang.yang@...el.com>,
        Dapeng Mi <dapeng1.mi@...el.com>,
        Jim Mattson <jmattson@...gle.com>,
        David Dunn <daviddunn@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Tue, Oct 03, 2023, Mingwei Zhang wrote:
> On Mon, Oct 2, 2023 at 5:56 PM Sean Christopherson <seanjc@...gle.com> wrote:
> > The "when" is what's important.   If KVM took a literal interpretation of
> > "exclude guest" for pass-through MSRs, then KVM would context switch all those
> > MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
> > reschedule IRQ to schedule in a different task (or vCPU).  The overhead to save
> > all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
> > VM-Exit would be a non-starter.  E.g. simple VM-Exits are completely handled in
> > <1500 cycles, and "fastpath" exits are something like half that.  Switching all
> > the MSRs is likely 1000+ cycles, if not double that.
> 
> Hi Sean,
> 
> Sorry, I have no intention to interrupt the conversation, but this is
> slightly confusing to me.
> 
> I remember when doing AMX, we added gigantic 8KB memory in the FPU
> context switch. That works well in Linux today. Why can't we do the
> same for PMU? Assuming we context switch all counters, selectors and
> global stuff there?

That's what we (Google folks) are proposing.  However, there are significant
side effects if KVM context switches PMU outside of vcpu_run(), whereas the FPU
doesn't suffer the same problems.

Keeping the guest FPU resident for the duration of vcpu_run() is, in terms of
functionality, completely transparent to the rest of the kernel.  From the kernel's
perspective, the guest FPU is just a variation of a userspace FPU, and the kernel
is already designed to save/restore userspace/guest FPU state when the kernel wants
to use the FPU for whatever reason.  And crucially, kernel FPU usage is explicit
and contained, e.g. see kernel_fpu_{begin,end}(), and comes with mechanisms for
KVM to detect when the guest FPU needs to be reloaded (see TIF_NEED_FPU_LOAD).

The PMU is a completely different story.  PMU usage, a.k.a. perf, by design is
"always running".  KVM can't transparently stop host usage of the PMU, as disabling
host PMU usage stops perf events from counting/profiling whatever it is they're
supposed to profile.

Today, KVM minimizes the "downtime" of host PMU usage by context switching PMU
state at VM-Enter and VM-Exit, or at least as close as possible, e.g. for LBRs
and Intel PT.

What we are proposing would *significantly* increase the downtime, to the point
where it would almost be unbounded in some paths, e.g. if KVM faults in a page,
gup() could go swap in memory from disk, install PTEs, and so on and so forth.
If the host is trying to profile something related to swap or memory management,
they're out of luck.