linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aC6fmIuKgDYHcaLp@yzhao56-desk.sh.intel.com>
Date: Thu, 22 May 2025 11:52:56 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
	"Huang, Kai" <kai.huang@...el.com>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "tabba@...gle.com" <tabba@...gle.com>, "Li,
 Zhiquan1" <zhiquan1.li@...el.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "michael.roth@....com" <michael.roth@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Yamahata,
 Isaku" <isaku.yamahata@...el.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "jroedel@...e.de"
	<jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>, "pgonda@...gle.com"
	<pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Wed, May 21, 2025 at 11:40:15PM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-05-20 at 17:34 +0800, Yan Zhao wrote:
> > So, you want to disallow huge pages for non-Linux TDs, then we have no need
> > to support splitting in the fault path, right?
> > 
> > I'm OK if we don't care non-Linux TDs for now.
> > This can simplify the splitting code and we can add the support when there's a
> > need.
> 
> We do need to care about non-Linux TDs functioning, but we don't need to
> optimize for them at this point. We need to optimize for things that happen
> often. Pending-#VE using TDs are rare, and don't need to have huge pages in
> order to work.
> 
> Yesterday Kirill and I were chatting offline about the newly defined
> TDG.MEM.PAGE.RELEASE. It is kind of like an unaccept, so another possibility is:
> 1. Guest accepts at 2MB
> 2. Guest releases at 2MB (no notice to VMM)
> 3. Guest accepts at 4k, EPT violation with expectation to demote
> 
> In that case, KVM won't know to expect it, and that it needs to preemptively map
> things at 4k.
> 
> For full coverage of the issue, can we discuss a little bit about what demote in
> the fault path would look like?
For demote in the fault path, it will take mmu read lock.

So, the flow in the fault path is
1. zap with mmu read lock.
   ret = tdx_sept_zap_private_spte(kvm, gfn, level, page, true);
   if (ret <= 0)
       return ret;
2. track with mmu read lock
   ret = tdx_track(kvm, true);
   if (ret)
       return ret;
3. demote with mmu read lock
   ret = tdx_spte_demote_private_spte(kvm, gfn, level, page, true);
   if (ret)
       goto err;
4. return success or unzap as error fallback.
   tdx_sept_unzap_private_spte(kvm, gfn, level);

Steps 1-3 will return -EBUSY on busy error (which will not be very often as we
will introduce kvm_tdx->sept_lock. I can post the full lock analysis if
necessary).
Step 4 will be ensured to succeed.

Here's the detailed code for step 1, 3 and 4.

static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
                                     enum pg_level level, struct page *page,
                                     bool mmu_lock_shared)
{
        int tdx_level = pg_level_to_tdx_sept_level(level);
        struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
        gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
        u64 err, entry, level_state;

        /* Before TD runnable, large page is not supported */
        WARN_ON_ONCE(kvm_tdx->state != TD_STATE_RUNNABLE && level != PG_LEVEL_4K);

        if (mmu_lock_shared)
                lockdep_assert_held_read(&kvm->mmu_lock);
        else
                lockdep_assert_held_write(&kvm->mmu_lock);

        write_lock(&kvm_tdx->sept_lock);
        err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
        write_unlock(&kvm_tdx->sept_lock);

        if (unlikely(tdx_operand_busy(err))) {
                if (mmu_lock_shared)
                        return -EBUSY;

                /* After no vCPUs enter, the second retry is expected to succeed */
                write_lock(&kvm_tdx->sept_lock);
                tdx_no_vcpus_enter_start(kvm);
                err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
                tdx_no_vcpus_enter_stop(kvm);
                write_unlock(&kvm_tdx->sept_lock);
        }

        if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
            !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
                atomic64_dec(&kvm_tdx->nr_premapped);
                return 0;
        }

        if (KVM_BUG_ON(err, kvm)) {
                pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
                return -EIO;
        }
        return 1;
}

static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
                                        enum pg_level level, struct page *page,
                                        bool mmu_lock_shared)
{
       int tdx_level = pg_level_to_tdx_sept_level(level);
       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
       gpa_t gpa = gfn_to_gpa(gfn);
       u64 err, entry, level_state;

       do {
               read_lock(&kvm_tdx->sept_lock);
               err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
                                       &entry, &level_state);
               read_unlock(&kvm_tdx->sept_lock);
       } while (err == TDX_INTERRUPTED_RESTARTABLE);

       if (unlikely(tdx_operand_busy(err)) {
                unsigned long flags;

                if (mmu_lock_shared)
                        return -EBUSY;

                tdx_no_vcpus_enter_start(kvm);
                read_lock(&kvm_tdx->sept_lock);

                local_irq_save(flags);
                err = tdh_mem_page_demote(&kvm_tdx->td, gpa, tdx_level, page,
                                          &entry, &level_state);
                local_irq_restore(flags);
                read_unlock(&kvm_tdx->sept_lock);
                tdx_no_vcpus_enter_stop(kvm);
        }

        if (KVM_BUG_ON(err, kvm)) {
                pr_tdx_error_2(TDH_MEM_PAGE_DEMOTE, err, entry, level_state);
                return -EIO;
        }
        return 0;
}

static void tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn,
                                     enum pg_level level)
{
        int tdx_level = pg_level_to_tdx_sept_level(level);
        struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
        gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
        u64 err, entry, level_state;

        write_lock(&kvm_tdx->sept_lock);
        err = tdh_mem_range_unblock(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
        write_unlock(&kvm_tdx->sept_lock);

        if (unlikely(tdx_operand_busy(err))) {
                write_lock(&kvm_tdx->sept_lock);
                tdx_no_vcpus_enter_start(kvm);
                err = tdh_mem_range_unblock(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
                tdx_no_vcpus_enter_stop(kvm);
                write_unlock(&kvm_tdx->sept_lock);
        }

        if (KVM_BUG_ON(err, kvm)) {
                pr_tdx_error_2(TDH_MEM_RANGE_UNBLOCK, err, entry, level_state);
        }
}


> The current zapping operation that is involved
> depends on mmu write lock. And I remember you had a POC that added essentially a
> hidden exclusive lock in TDX code as a substitute. But unlike the other callers,
Right, The kvm_tdx->sept_lock is introduced as a rw lock. The write lock is held
in a very short period, around tdh_mem_sept_remove(), tdh_mem_range_block(),
tdh_mem_range_unblock().

The read/write status of the kvm_tdx->sept_lock corresponds to that in the TDX
module.

  Resources          SHARED  users              EXCLUSIVE users 
-----------------------------------------------------------------------
 secure_ept_lock   tdh_mem_sept_add            tdh_vp_enter
                   tdh_mem_page_aug            tdh_mem_sept_remove
                   tdh_mem_page_remove         tdh_mem_range_block
                   tdh_mem_page_promote        tdh_mem_range_unblock
                   tdh_mem_page_demote

> the fault path demote case could actually handle failure. So if we just returned
> busy and didn't try to force the retry, we would just run the risk of
> interfering with TDX module sept lock? Is that the only issue with a design that
> would allows failure of demote in the fault path?
The concern to support split in the fault path is mainly to avoid unnecesssary
split, e.g., when two vCPUs try to accept at different levels.

Besides that we need to introduce 3 locks inside TDX:
rwlock_t sept_lock, spinlock_t no_vcpu_enter_lock, spinlock_t track_lock.

To ensure the success of unzap (to restore the state), kicking of vCPUs in the
fault path is required, which is not ideal. But with the introduced lock and the
proposed TDX modules's change to tdg_mem_page_accept() (as in the next comment),
the chance to invoke unzap is very low.

> Let's keep in mind that we could ask for TDX module changes to enable this path.
We may need TDX module's change to let tdg_mem_page_accept() not to take lock on
an non-ACCEPTable entry to avoid contention with guest and the potential error
TDX_HOST_PRIORITY_BUSY_TIMEOUT.

> I think we could probably get away with ignoring TDG.MEM.PAGE.RELEASE if we had
> a plan to fix it up with TDX module changes. And if the ultimate root cause of
> the complication is avoiding zero-step (sept lock), we should fix that instead
> of design around it further.
Ok.

> > > I think this connects the question of whether we can pass the necessary info
> > > into fault via synthetic error code. Consider this new design:
> > > 
> > >   - tdx_gmem_private_max_mapping_level() simply returns 4k for prefetch and pre-
> > > runnable, otherwise returns 2MB
> > Why prefetch and pre-runnable faults go the first path, while
> 
> Because these are either passed into private_max_mapping_level(), or not
> associated with the fault (runnable state).
> 
> > 
> > >   - if fault has accept info 2MB size, pass 2MB size into fault. Otherwise pass
> > > 4k (i.e. VMs that are relying on #VE to do the accept won't get huge pages
> > > *yet*).
> > other faults go the second path?
> 
> This info is related to the specific fault.
> 
> >  
> > > What goes wrong? Seems simpler and no more stuffing fault info on the vcpu.
> > I tried to avoid the double paths.
> > IMHO, it's confusing to specify max_level from two paths.
> > 
> > The fault info in vcpu_tdx isn't a real problem as it's per-vCPU.
> > An existing example in KVM is vcpu->arch.mmio_gfn.
> 
> mmio_gfn isn't info about the fault though, it's info about the gfn being mmio.
> So not fault scoped.
> 
> > 
> > We don't need something like the vcpu->arch.mmio_gen because
> > tdx->violation_gfn_* and tdx->violation_request_level are reset in each
> > tdx_handle_ept_violation().
> > 
> > 
> > BTW, dug into some history:
> > 
> > In v18 of TDX basic series,
> > enforcing 4KB for pre-runnable faults were done by passing PG_LEVEL_4K to
> > kvm_mmu_map_tdp_page().
> > https://lore.kernel.org/all/1a64f798b550dad9e096603e8dae3b6e8fb2fbd5.1705965635.git.isaku.yamahata@intel.com/
> > https://lore.kernel.org/all/97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@intel.com/
> > 
> > For the other faults, it's done by altering max_level in kvm_mmu_do_page_fault(),
> > and Paolo asked to use the tdx_gmem_private_max_mapping_level() path.
> > https://lore.kernel.org/all/CABgObfbu1-Ok607uYdo4DzwZf8ZGVQnvHU+y9_M1Zae55K5xwQ@mail.gmail.com/
> > 
> > For the patch "KVM: x86/mmu: Allow per-VM override of the TDP max page level",
> > it's initially acked by Paolo in v2, and Sean's reply is at
> > https://lore.kernel.org/all/YO3%2FgvK9A3tgYfT6@google.com .
> 
> The SNP case is not checking fault info, it's closer to the other cases. I don't
> see that any of that conversation applies to this case. Can you clarify?
My concern of stuffing the error_code to pass in the fault max_level is that
if it's a good path, the TDX basic enabling code should have been implemented in
that way by always passing in 4KB.

Why Sean said
"
Looks like SNP needs a dynamic check, i.e. a kvm_x86_ops hook, to handle an edge
case in the RMP.  That's probably the better route given that this is a short-term
hack (hopefully :-D).
"
instead of suggesting TDX enable the error code path earlier and hardcode the
level to 4KB?

> On the subject of the whether to pass accept level into the fault, or stuff it
> on the vcpu, I'm still in the camp that it is better to pass it in the error
> code. If you disagree, let's see if we can flag down Sean and Paolo to weigh in.
Ok.

To document for further discussions with Sean and Paolo:

- Passing in max_level in tdx_gmem_private_max_mapping_level()
  Cons:
  a) needs to stuff info in the vcpu to get accept level info.

  Pros:
  a) a uniform approach as to SEV.
  b) dynamic. Can get more fault info, e.g. is_prefetch, gfn, pfn.
  c) can get increased/decreased level for a given gfn similarly to get the
     accept level
  d) flexibility for TDX to implement advanced features. e.g.
     1. determine an accept level after certain negotiation with guest
     2. pre-fetch memory


- To pass in max_level in error_code
  Cons:
  a) still need tdx_gmem_private_max_mapping_level() to get dynamic info.
  b) still need info stuffed on the vcpu under certain conditions. e.g.
     when promotion fails with TDX_EPT_INVALID_PROMOTE_CONDITIONS, we can skip
     the local retry by reducing the max_level.
  c) only effective in the EPT violation path.
  Pros:
     currently easy to pass in accept level info.