linux-kernel - Re: [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <YGYMcdRuWOHBer24@google.com>
Date:   Thu, 1 Apr 2021 18:09:53 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Paolo Bonzini <pbonzini@...hat.com>
Cc:     Ben Gardon <bgardon@...gle.com>,
        LKML <linux-kernel@...r.kernel.org>, kvm <kvm@...r.kernel.org>,
        Peter Xu <peterx@...hat.com>, Peter Shier <pshier@...gle.com>,
        Peter Feiner <pfeiner@...gle.com>,
        Junaid Shahid <junaids@...gle.com>,
        Jim Mattson <jmattson@...gle.com>,
        Yulei Zhang <yulei.kernel@...il.com>,
        Wanpeng Li <kernellwp@...il.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Xiao Guangrong <xiaoguangrong.eric@...il.com>
Subject: Re: [PATCH v2 20/28] KVM: x86/mmu: Use atomic ops to set SPTEs in
 TDP MMU map

On Thu, Apr 01, 2021, Paolo Bonzini wrote:
> On 01/04/21 18:50, Ben Gardon wrote:
> > > retry:
> > >                   if (is_shadow_present_pte(iter.old_spte)) {
> > >                          if (is_large_pte(iter.old_spte)) {
> > >                                  if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> > >                                          break;
> > > 
> > >                                  /*
> > >                                   * The iter must explicitly re-read the SPTE because
> > >                                   * the atomic cmpxchg failed.
> > >                                   */
> > >                                  iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> > >                                  goto retry;
> > >                          }
> > >                   } else {
> > >                          ...
> > >                  }
> > > 
> > > ?
> > To be honest, that feels less readable to me. For me retry implies
> > that we failed to make progress and need to repeat an operation, but
> > the reality is that we did make progress and there are just multiple
> > steps to replace the large SPTE with a child PT.
> 
> You're right, it's makes no sense---I misremembered the direction of
> tdp_mmu_zap_spte_atomic's return value.  I was actually thinking of this:
> 
> > Another option which could improve readability and performance would
> > be to use the retry to repeat failed cmpxchgs instead of breaking out
> > of the loop. Then we could avoid retrying the page fault each time a
> > cmpxchg failed, which may happen a lot as vCPUs allocate intermediate
> > page tables on boot. (Probably less common for leaf entries, but
> > possibly useful there too.)
> 
> which would be
> 
> retry:
>                  if (is_shadow_present_pte(iter.old_spte)) {
>                        if (is_large_pte(iter.old_spte) &&
>                            !tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter)) {
>                                 /*
>                                  * The iter must explicitly re-read the SPTE because
>                                  * the atomic cmpxchg failed.
>                                  */
>                                 iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
>                                 goto retry;
>                             }
>                             /* XXX move this to tdp_mmu_zap_spte_atomic? */
>                             iter.old_spte = 0;
>                        } else {
>                             continue;

This is wrong.  If a large PTE is successfully zapped, it will leave a !PRESENT
intermediate entry.  It's probably not fatal; I'm guessing it would lead to
RET_PF_RETRY and cleaned up on the subsequent re-fault.

>                        }
>                  }
>                  sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
>                  child_pt = sp->spt;
> 
>                  new_spte = make_nonleaf_spte(child_pt,
>                                               !shadow_accessed_mask);
> 
>                  if (!tdp_mmu_set_spte_atomic(vcpu->kvm, &iter,
>                                              new_spte)) {
>                       tdp_mmu_free_sp(sp);
>                       /*
>                        * The iter must explicitly re-read the SPTE because
>                        * the atomic cmpxchg failed.
>                        */
>                       iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
>                       goto retry;

I'm not sure that _always_ retrying is correct.  The conflict means something
else is writing the same SPTE.  That could be a different vCPU handling an
identical fault, but it could also be something else blasting away the SPTE.  If
an upper level SPTE was zapped, e.g. the entire MMU instance is zapped,
installing a new SPE would be wrong.

AFAICT, the only motivation for retrying in this loop is to handle the case
where a different vCPU is handling an identical fault.  It should be safe to
handler that, but if the conflicting SPTE is not-present, I believe this needs
to break to handle any pending updates.

			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
			if (!is_shadow_present_pte(iter.old_spte))
				break;
			goto retry;

>                  }
>                  tdp_mmu_link_page(vcpu->kvm, sp, true,
>                                    huge_page_disallowed &&
>                                    req_level >= iter.level);
> 
>                  trace_kvm_mmu_get_page(sp, true);
> 
> which survives at least a quick smoke test of booting a 20-vCPU Windows
> guest.  If you agree I'll turn this into an actual patch.
>