linux-kernel - Re: [PATCH] KVM: x86: Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAL715WKn1RPiY23x3WAi7BASyLDSZuEO7CJ6FObCxOmRpBwh7Q@mail.gmail.com>
Date:   Wed, 27 Sep 2023 11:23:39 -0700
From:   Mingwei Zhang <mizhang@...gle.com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     Xin Li <xin@...or.com>, Paolo Bonzini <pbonzini@...hat.com>,
        "H. Peter Anvin" <hpa@...or.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Jim Mattson <jmattson@...gle.com>,
        Like Xu <likexu@...cent.com>, Kan Liang <kan.liang@...el.com>,
        Dapeng1 Mi <dapeng1.mi@...el.com>
Subject: Re: [PATCH] KVM: x86: Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ_NMI)

On Wed, Sep 27, 2023 at 9:10 AM Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Tue, Sep 26, 2023, Xin Li wrote:
> > On 9/26/2023 9:15 PM, Mingwei Zhang wrote:
> > > ah, typo in the subject: The 2nd KVM_REQ_NMI should be KVM_REQ_PMI.
> > > Sorry about that.
> > >
> > > On Tue, Sep 26, 2023 at 9:09 PM Mingwei Zhang <mizhang@...gle.com> wrote:
> > > >
> > > > Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ_NMI).
> >
> > Please remove it, no need to repeat the subject.
>
> Heh, from Documentation/process/maintainer-kvm-x86.rst:
>
>   Changelog
>   ~~~~~~~~~
>   Most importantly, write changelogs using imperative mood and avoid pronouns.
>
>   See :ref:`describe_changes` for more information, with one amendment: lead with
>   a short blurb on the actual changes, and then follow up with the context and
>   background.  Note!  This order directly conflicts with the tip tree's preferred
>   approach!  Please follow the tip tree's preferred style when sending patches
>   that primarily target arch/x86 code that is _NOT_ KVM code.
>
> That said, I do prefer that the changelog intro isn't just a copy+paste of the
> shortlog, and the shortlog and changelog should use conversational language instead
> of describing the literal code movement.
>
> > > > When vPMU is active use, processing each KVM_REQ_PMI will generate a
>
> This is not guaranteed.
>
> > > > KVM_REQ_NMI. Existing control flow after KVM_REQ_PMI finished will fail the
> > > > guest enter, jump to kvm_x86_cancel_injection(), and re-enter
> > > > vcpu_enter_guest(), this wasted lot of cycles and increase the overhead for
> > > > vPMU as well as the virtualization.
>
> As above, use conversational language, the changelog isn't meant to be a play-by-play.
>
> E.g.
>
>   KVM: x86: Service NMI requests *after* PMI requests in VM-Enter path
>
>   Move the handling of NMI requests after PMI requests in the VM-Enter path
>   so that KVM doesn't need to cancel and redo VM-Enter in the likely
>   scenario that the vCPU has configured its LVPTC entry to generate an NMI.
>
>   Because APIC emulation "injects" NMIs via KVM_REQ_NMI, handling PMI
>   requests after NMI requests means KVM won't detect the pending NMI request
>   until the final check for outstanding requests.  Detecting requests at the
>   final stage is costly as KVM has already loaded guest state, potentially
>   queued events for injection, disabled IRQs, dropped SRCU, etc., most of
>   which needs to be unwound.
>
> > Optimization is after correctness, so please explain if this is correct
> > first!
>
> Not first.  Leading with an in-depth description of KVM requests and NMI handling
> is not going to help understand *why* this change is being first.  But I do agree
> that this should provide an analysis of why it's ok to swap the order, specificially
> why it's architecturally ok if KVM drops an NMI due to the swapped ordering, e.g.
> if the PMI is coincident with two other NMIs (or one other NMI and NMIs are blocked).
>
> > > > So move the code snippet of kvm_check_request(KVM_REQ_NMI) to make KVM
> > > > runloop more efficient with vPMU.
> > > >
> > > > To evaluate the effectiveness of this change, we launch a 8-vcpu QEMU VM on
>
> Avoid pronouns.  There's no need for all the "fluff", just state the setup, the
> test, and the results.
>
> Really getting into the nits, but the whole "8-vcpu QEMU VM" versus
> "the setup of using single core, single thread" is confusing IMO.  If there were
> potential performance downsides and/or tradeoffs, then getting the gory details
> might be necessary, but that's not the case here, and if it were really necessary
> to drill down that deep, then I would want to better quantify the impact, e.g. in
> terms latency.
>
>   E.g. on Intel SPR running SPEC2017 benchmark and Intel vtune in the guest,
>   handling PMI requests before NMI requests reduces the number of canceled
>   runs by ~1500 per second, per vCPU (counted by probing calls to
>   vmx_cancel_injection()).
>
> > > > an Intel SPR CPU. In the VM, we run perf with all 48 events Intel vtune
> > > > uses. In addition, we use SPEC2017 benchmark programs as the workload with
> > > > the setup of using single core, single thread.
> > > >
> > > > At the host level, we probe the invocations to vmx_cancel_injection() with
> > > > the following command:
> > > >
> > > >      $ perf probe -a vmx_cancel_injection
> > > >      $ perf stat -a -e probe:vmx_cancel_injection -I 10000 # per 10 seconds
> > > >
> > > > The following is the result that we collected at beginning of the spec2017
> > > > benchmark run (so mostly for 500.perlbench_r in spec2017). Kindly forgive
> > > > the incompleteness.
> > > >
> > > > On kernel without the change:
> > > >      10.010018010              14254      probe:vmx_cancel_injection
> > > >      20.037646388              15207      probe:vmx_cancel_injection
> > > >      30.078739816              15261      probe:vmx_cancel_injection
> > > >      40.114033258              15085      probe:vmx_cancel_injection
> > > >      50.149297460              15112      probe:vmx_cancel_injection
> > > >      60.185103088              15104      probe:vmx_cancel_injection
> > > >
> > > > On kernel with the change:
> > > >      10.003595390                 40      probe:vmx_cancel_injection
> > > >      20.017855682                 31      probe:vmx_cancel_injection
> > > >      30.028355883                 34      probe:vmx_cancel_injection
> > > >      40.038686298                 31      probe:vmx_cancel_injection
> > > >      50.048795162                 20      probe:vmx_cancel_injection
> > > >      60.069057747                 19      probe:vmx_cancel_injection
> > > >
> > > >  From the above, it is clear that we save 1500 invocations per vcpu per
> > > > second to vmx_cancel_injection() for workloads like perlbench.
>
> Nit, this really should have:
>
>   Suggested-by: Sean Christopherson <seanjc@...gle.com>
>
> I personally don't care about the attribution, but (a) others often do care and
> (b) the added context is helpful.  E.g. for bad/questionable suggestsions/ideas,
> knowing that person X was also involved helps direct and/or curate questions/comments
> accordingly.

For sure! I will also pay more attention to that in the future.

Thanks.
-Mingwei