linux-kernel - Re: [PATCH 3/4] KVM: x86/mmu: Shrink pte_list

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Ys33RtxeDz0egEM0@xz-m1.local>
Date:   Tue, 12 Jul 2022 18:35:50 -0400
From:   Peter Xu <peterx@...hat.com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is
 using TDP

On Fri, Jun 24, 2022 at 11:27:34PM +0000, Sean Christopherson wrote:
> Dynamically size struct pte_list_desc's array of sptes based on whether
> or not KVM is using TDP.  Commit dc1cff969101 ("KVM: X86: MMU: Tune
> PTE_LIST_EXT to be bigger") bumped the number of entries in order to
> improve performance when using shadow paging, but its analysis that the
> larger size would not affect TDP was wrong.  Consuming pte_list_desc
> objects for nested TDP is indeed rare, but _allocating_ objects is not,
> as KVM allocates 40 objects for each per-vCPU cache.  Reducing the size
> from 128 bytes to 32 bytes reduces that per-vCPU cost from 5120 bytes to
> 1280, and also provides similar savings when eager page splitting for
> nested MMUs kicks in.
> 
> The per-vCPU overhead could be further reduced by using a custom, smaller
> capacity for the per-vCPU caches, but that's more of an "and" than
> an "or" change, e.g. it wouldn't help the eager page split use case.
> 
> Set the list size to the bare minimum without completely defeating the
> purpose of an array (and because pte_list_add() assumes the array is at
> least two entries deep).  A larger size, e.g. 4, would reduce the number
> of "allocations", but those "allocations" only become allocations in
> truth if a single vCPU depletes its cache to where a topup is needed,
> i.e. if a single vCPU "allocates" 30+ lists.  Conversely, those 2 extra
> entries consume 16 bytes * 40 * nr_vcpus in the caches the instant nested
> TDP is used.
> 
> In the unlikely event that performance of aliased gfns for nested TDP
> really is (or becomes) a priority for oddball workloads, KVM could add a
> knob to let the admin tune the array size for their environment.
> 
> Note, KVM also unnecessarily tops up the per-vCPU caches even when not
> using rmaps; this can also be addressed separately.

The only possible way of using pte_list_desc when tdp=1 is when the
hypervisor tries to map the same host pages with different GPAs?

And we don't really have a real use case of that, or.. do we?

Sorry to start with asking questions, it's just that if we know that
pte_list_desc is probably not gonna be used then could we simply skip the
cache layer as a whole?  IOW, we don't make the "array size of pte list
desc" dynamic, instead we make the whole "pte list desc cache layer"
dynamic.  Is it possible?

-- 
Peter Xu