[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e7e9201f-76cc-08de-29a7-6d75abccf968@linux.alibaba.com>
Date: Fri, 3 Sep 2021 08:51:56 +0800
From: Lai Jiangshan <laijs@...ux.alibaba.com>
To: Sean Christopherson <seanjc@...gle.com>,
Lai Jiangshan <jiangshanlai@...il.com>
Cc: linux-kernel@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>,
Vitaly Kuznetsov <vkuznets@...hat.com>,
Wanpeng Li <wanpengli@...cent.com>,
Jim Mattson <jmattson@...gle.com>,
Joerg Roedel <joro@...tes.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>,
Marcelo Tosatti <mtosatti@...hat.com>,
Avi Kivity <avi@...hat.com>, kvm@...r.kernel.org
Subject: Re: [PATCH 2/7] KVM: X86: Synchronize the shadow pagetable before
link it
On 2021/9/3 07:40, Sean Christopherson wrote:
> On Tue, Aug 24, 2021, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@...ux.alibaba.com>
>>
>> If gpte is changed from non-present to present, the guest doesn't need
>> to flush tlb per SDM. So the host must synchronze sp before
>> link it. Otherwise the guest might use a wrong mapping.
>>
>> For example: the guest first changes a level-1 pagetable, and then
>> links its parent to a new place where the original gpte is non-present.
>> Finally the guest can access the remapped area without flushing
>> the tlb. The guest's behavior should be allowed per SDM, but the host
>> kvm mmu makes it wrong.
>
> Ah, are you saying, given:
>
> VA_x = PML4_A -> PDP_B -> PD_C -> PT_D
>
> the guest can modify PT_D, then link it with
>
> VA_y = PML4_A -> PDP_B -> PD_E -> PT_D
>
> and access it via VA_y without flushing, and so KVM must sync PT_D. Is that
> correct?
Yes. and another vcpu accesses it via VA_y without flushing.
>
>> Fixes: 4731d4c7a077 ("KVM: MMU: out of sync shadow core")
>> Signed-off-by: Lai Jiangshan <laijs@...ux.alibaba.com>
>> ---
>
> ...
>
>> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
>> index 50ade6450ace..48c7fe1b2d50 100644
>> --- a/arch/x86/kvm/mmu/paging_tmpl.h
>> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
>> @@ -664,7 +664,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
>> * emulate this operation, return 1 to indicate this case.
>> */
>> static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>> - struct guest_walker *gw)
>> + struct guest_walker *gw, unsigned long mmu_seq)
>> {
>> struct kvm_mmu_page *sp = NULL;
>> struct kvm_shadow_walk_iterator it;
>> @@ -678,6 +678,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>> top_level = vcpu->arch.mmu->root_level;
>> if (top_level == PT32E_ROOT_LEVEL)
>> top_level = PT32_ROOT_LEVEL;
>> +
>> +again:
>> /*
>> * Verify that the top-level gpte is still there. Since the page
>> * is a root page, it is either write protected (and cannot be
>> @@ -713,8 +715,28 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>> if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
>> goto out_gpte_changed;
>>
>> - if (sp)
>> + if (sp) {
>> + /*
>> + * We must synchronize the pagetable before link it
>> + * because the guest doens't need to flush tlb when
>> + * gpte is changed from non-present to present.
>> + * Otherwise, the guest may use the wrong mapping.
>> + *
>> + * For PG_LEVEL_4K, kvm_mmu_get_page() has already
>> + * synchronized it transiently via kvm_sync_page().
>> + *
>> + * For higher level pagetable, we synchronize it
>> + * via slower mmu_sync_children(). If it once
>> + * released the mmu_lock, we need to restart from
>> + * the root since we don't have reference to @sp.
>> + */
>> + if (sp->unsync_children && !mmu_sync_children(vcpu, sp, false)) {
>
> I don't like dropping mmu_lock in the page fault path. I agree that it's not
> all that different than grabbing various things in kvm_mmu_do_page_fault() long
> before acquiring mmu_lock, but I'm not 100% convinced we don't have a latent
> bug hiding somehwere in there :-), and (b) there's a possibility, however small,
> that something in FNAME(fetch) that we're missing. Case in point, this technically
> needs to do make_mmu_pages_available().
>
> And I believe kvm_mmu_get_page() already tries to handle this case by requesting
> KVM_REQ_MMU_SYNC if it uses a sp with unsync_children, it just doesn't handle SMP
> interaction, e.g. can link a sp that's immediately available to other vCPUs before
> the sync.
>
> Rather than force the sync here, what about kicking all vCPUs and retrying the
> page fault? The only gross part is that kvm_mmu_get_page() can now fail :-(
>
> ---
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/kvm/mmu/mmu.c | 9 +++++++--
> arch/x86/kvm/mmu/paging_tmpl.h | 4 ++++
> 3 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 09b256db394a..332b9fb3454c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -57,7 +57,8 @@
> #define KVM_REQ_MIGRATE_TIMER KVM_ARCH_REQ(0)
> #define KVM_REQ_REPORT_TPR_ACCESS KVM_ARCH_REQ(1)
> #define KVM_REQ_TRIPLE_FAULT KVM_ARCH_REQ(2)
> -#define KVM_REQ_MMU_SYNC KVM_ARCH_REQ(3)
> +#define KVM_REQ_MMU_SYNC \
> + KVM_ARCH_REQ_FLAGS(3, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> #define KVM_REQ_CLOCK_UPDATE KVM_ARCH_REQ(4)
> #define KVM_REQ_LOAD_MMU_PGD KVM_ARCH_REQ(5)
> #define KVM_REQ_EVENT KVM_ARCH_REQ(6)
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4853c033e6ce..03293cd3c7ae 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2143,8 +2143,10 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> }
>
> - if (sp->unsync_children)
> - kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> + if (sp->unsync_children) {
> + kvm_make_all_cpus_request(KVM_REQ_MMU_SYNC, vcpu);
> + return NULL;
> + }
>
> __clear_sp_write_flooding_count(sp);
>
> @@ -2999,6 +3001,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>
> sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
> it.level - 1, true, ACC_ALL);
> + BUG_ON(!sp);
>
> link_shadow_page(vcpu, it.sptep, sp);
> if (fault->is_tdp && fault->huge_page_disallowed &&
> @@ -3383,6 +3386,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
> + BUG_ON(!sp);
> +
> ++sp->root_count;
>
> return __pa(sp->spt);
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 50ade6450ace..f573d45e2c6f 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -704,6 +704,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> access = gw->pt_access[it.level - 2];
> sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
> it.level-1, false, access);
> + if (!sp)
> + return RET_PF_RETRY;
> }
>
> /*
> @@ -742,6 +744,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> if (!is_shadow_present_pte(*it.sptep)) {
> sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
> it.level - 1, true, direct_access);
> + BUG_ON(!sp);
> +
> link_shadow_page(vcpu, it.sptep, sp);
> if (fault->huge_page_disallowed &&
> fault->req_level >= it.level)
> --
>
Powered by blists - more mailing lists