linux-kernel - Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAL715WJiUYpxRRqs3FNiLiS8b6=4Pm5K0u==S6t5NYi0p=vutw@mail.gmail.com>
Date:   Wed, 23 Aug 2023 11:28:17 -0700
From:   Mingwei Zhang <mizhang@...gle.com>
To:     Like Xu <like.xu.linux@...il.com>
Cc:     Sean Christopherson <seanjc@...gle.com>,
        Roman Kagan <rkagan@...zon.de>,
        Jim Mattson <jmattson@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Eric Hankland <ehankland@...gle.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width

On Tue, Aug 22, 2023 at 2:30 AM Like Xu <like.xu.linux@...il.com> wrote:
>
> On 1/7/2023 5:26 am, Sean Christopherson wrote:
> > Ugh, yeah, de0f619564f4 created a bit of a mess.  The underlying issue that it
> > was solving is that perf_event_read_value() and friends might sleep (yay mutex),
> > and so can't be called from KVM's fastpath (IRQs disabled).
> Updating pmu counters for emulated instructions cause troubles.
>
> >
> > However, detecting overflow requires reading perf_event_read_value() to gather
> > the accumulated count from the hardware event in order to add it to the emulated
> > count from software.  E.g. if pmc->counter is X and the perf event counter is Y,
> > KVM needs to factor in Y because X+Y+1 might overflow even if X+1 does not.
> >
> > Trying to snapshot the previous counter value is a bit of a mess.  It could probably
> > made to work, but it's hard to reason about what the snapshot actually contains
> > and when it should be cleared, especially when factoring in the wrapping logic.
> >
> > Rather than snapshot the previous counter, I think it makes sense to:
> >
> >    1) Track the number of emulated counter events
>
> If events are counted separately, the challenge here is to correctly time
> the emulation of counter overflows, which can occur on both sides of the
> counter values out of sync.
>
> >    2) Accumulate and reset the counts from perf_event and emulated_counter into
> >       pmc->counter when pausing the PMC
> >    3) Pause and reprogram the PMC on writes (instead of the current approach of
> >       blindly updating the sample period)
>
> Updating the sample period is the only interface for KVM to configure hw
> behaviour on hw-ctr. I note that perf_event_set_count() will be proposed,
> and I'm pessimistic about this change.
>
> >    4) Pause the counter when stopping the perf_event to ensure pmc->counter is
> >       fresh (instead of manually updating pmc->counter)
> >
> > IMO, that yields more intuitive logic, and makes it easier to reason about
> > correctness since the behavior is easily define: pmc->counter holds the counts
> > that have been gathered and processed, perf_event and emulated_counter hold
> > outstanding counts on top.  E.g. on a WRMSR to the counter, both the emulated
> > counter and the hardware counter are reset, because whatever counts existed
> > previously are irrelevant.
>
> If we take the hardware view, a counter, emulated or not, just increments
> and overflows at the threshold. The missing logic here is when the counter
> is truncated when writing high bit-width values, and how to deal with the
> value of pmc->prev_counter was before pmc->counter was truncated.
>
> >
> > Pausing the counter_might_  make WRMSR slower, but we need to get this all
> > functionally correct before worrying too much about performance.
>
> Performance, security and correctness should all be considered at the beginning.
>

+1 on the performance part.

I did several rounds of performance testing, pausing the counter is
fast, but restarting the counter is *super* slow. The extra overhead
might just make vPMU useless especially when the guest workload takes
full CPU/memory resources in a VM (like SPEC2017 does).

> >
> > Diff below for what I'm thinking (needs to be split into multiple patches).  It's
> > *very*  lightly tested.
>
> It saddens me that no one has come up with an actual low-level counter-test for
> this issue.
>
> >
> > I'm about to disappear for a week, I'll pick this back up when I get return.  In
> > the meantime, any testing and/or input would be much appreciated!
>
> How about accepting Roman's original fix and then exercising the rewriting genius ?

+1

I think the best option would be to just apply the fix in a short term
and put the refactoring of the emulated counter in the next series.