linux-kernel - Re: [PATCH] KVM: MMU: make PTE_PREFETCH

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <YWZ0D9r5BOm/8f7d@google.com>
Date:   Wed, 13 Oct 2021 14:52:15 +0900
From:   Sergey Senozhatsky <senozhatsky@...omium.org>
To:     David Matlack <dmatlack@...gle.com>
Cc:     Sergey Senozhatsky <senozhatsky@...omium.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        Suleiman Souhlal <suleiman@...gle.com>,
        kvm list <kvm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] KVM: MMU: make PTE_PREFETCH_NUM tunable

On (21/10/12 09:50), David Matlack wrote:
> On Tue, Oct 12, 2021 at 2:16 AM Sergey Senozhatsky
> <senozhatsky@...omium.org> wrote:
> >
> > Turn PTE_PREFETCH_NUM into a module parameter, so that it
> > can be tuned per-VM.
> 
> Module parameters do not allow tuning per VM, they effect every VM on
> the machine.
> 
> If you want per-VM tuning you could introduce a VM ioctl.

ACK.

> > ---
> >  arch/x86/kvm/mmu/mmu.c | 31 ++++++++++++++++++++++---------
> 
> Please also update the shadow paging prefetching code in
> arch/x86/kvm/mmu/paging_tmpl.h, unless there is a good reason to
> diverge.

ACK.

> > @@ -732,7 +734,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >
> >         /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> >         r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> > -                                      1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> > +                                      1 + PT64_ROOT_MAX_LEVEL + pte_prefetch_num);
> 
> There is a sampling problem. What happens if the user changes
> pte_prefetch_num while a fault is being handled?

Good catch.

> > @@ -2753,20 +2755,29 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
> >                                     struct kvm_mmu_page *sp,
> >                                     u64 *start, u64 *end)
> >  {
> > -       struct page *pages[PTE_PREFETCH_NUM];
> > +       struct page **pages;
> >         struct kvm_memory_slot *slot;
> >         unsigned int access = sp->role.access;
> >         int i, ret;
> >         gfn_t gfn;
> >
> > +       pages = kmalloc_array(pte_prefetch_num, sizeof(struct page *),
> > +                             GFP_KERNEL);
> 
> This code runs with the MMU lock held. From
> In general we avoid doing any dynamic memory allocation while the MMU
> lock is held. That's why the memory caches exist. You can avoid
> allocating under a lock by allocating the prefetch array when the vCPU
> is first initialized. This would also solve the module parameter
> sampling problem because you can read it once and store it in struct
> kvm_vcpu.

I'll do per-VCPU pre-allocation, thanks. GFP_KERNEL is less of a problem
if we hold read kvm->mmu_lock, but more so if we hold write kvm->mmu_lock.

> >  static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
> > @@ -2785,10 +2798,10 @@ static void __direct_pte_prefetch(struct kvm_vcpu *vcpu,
> >
> >         WARN_ON(!sp->role.direct);
> >
> > -       i = (sptep - sp->spt) & ~(PTE_PREFETCH_NUM - 1);
> > +       i = (sptep - sp->spt) & ~(pte_prefetch_num - 1);
> 
> This code assumes pte_prefetch_num is a power of 2, which is now no
> longer guaranteed to be true.

It does. I can test if it's a pow(2) in ioctl