linux-kernel - Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce kvm_split_cross_boundary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c0b086cc-3ae3-47ed-8d6f-7f7abf488f9f@linux.intel.com>
Date: Tue, 18 Nov 2025 16:59:19 +0800
From: Binbin Wu <binbin.wu@...ux.intel.com>
To: Yan Zhao <yan.y.zhao@...el.com>, "Huang, Kai" <kai.huang@...el.com>
Cc: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
 "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
 "Hansen, Dave" <dave.hansen@...el.com>, "david@...hat.com"
 <david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
 "vbabka@...e.cz" <vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "seanjc@...gle.com" <seanjc@...gle.com>, "kas@...nel.org" <kas@...nel.org>,
 "pbonzini@...hat.com" <pbonzini@...hat.com>,
 "ackerleytng@...gle.com" <ackerleytng@...gle.com>,
 "michael.roth@....com" <michael.roth@....com>,
 "Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P" <chao.p.peng@...el.com>,
 "Yamahata, Isaku" <isaku.yamahata@...el.com>,
 "Annapurve, Vishal" <vannapurve@...gle.com>,
 "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
 "Miao, Jun" <jun.miao@...el.com>, "x86@...nel.org" <x86@...nel.org>,
 "pgonda@...gle.com" <pgonda@...gle.com>
Subject: Re: [RFC PATCH v2 12/23] KVM: x86/mmu: Introduce
 kvm_split_cross_boundary_leafs()



On 11/18/2025 2:30 PM, Yan Zhao wrote:
[...]
>>>> Something like below:
>>>>
>>>> @@ -1558,7 +1558,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct
>>>> tdp_iter *iter,
>>>>   static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>>>>                                           struct kvm_mmu_page *root,
>>>>                                           gfn_t start, gfn_t end,
>>>> -                                        int target_level, bool shared)
>>>> +                                        int target_level, bool shared,
>>>> +                                        bool only_cross_boundary,
>>>> +                                        bool *split)
>>>>   {
>>>>          struct kvm_mmu_page *sp = NULL;
>>>>          struct tdp_iter iter;
>>>> @@ -1584,6 +1586,9 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>>>>                  if (!is_shadow_present_pte(iter.old_spte) ||
>>>> !is_large_pte(iter.old_spte))
>>>>                          continue;
>>>>   
>>>> +               if (only_cross_boundary && !iter_cross_boundary(&iter, start,
>>>> end))
>>>> +                       continue;
>>>> +
>>>>                  if (!sp) {
>>>>                          rcu_read_unlock();
>>>>   
>>>> @@ -1618,6 +1623,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>>>>                          goto retry;
>>>>   
>>>>                  sp = NULL;
>>>> +               *split = true;
>>>>          }
>>>>   
>>>>          rcu_read_unlock();
>>> This looks more reasonable for tdp_mmu_split_huge_pages_root();
>>>
>>> Given that splitting only adds a new page to the paging structure (unlike page
>>> merging), I currently can't think of any current use cases that would be broken
>>> by the lack of TLB flush before tdp_mmu_iter_cond_resched() releases the
>>> mmu_lock.
>>>
>>> This is because:
>>> 1) if the split is triggered in a fault path, the hardware shouldn't have cached
>>>     the old huge translation.
>>> 2) if the split is triggered in a zap or convert path,
>>>     - there shouldn't be concurrent faults on the range due to the protection of
>>>       mmu_invalidate_range*.
>>>     - for concurrent splits on the same range, though the other vCPUs may
>>>       temporally see stale huge TLB entries after they believe they have
>>>       performed a split, they will be kicked off to flush the cache soon after
>>>       tdp_mmu_split_huge_pages_root() returns in the first vCPU/host thread.
>>>       This should be acceptable since I don't see any special guest needs that
>>>       rely on pure splits.
>> Perhaps we should just go straight to the point:
>>
>>    What does "hugepage split" do, and what's the consequence of not flushing TLB.
>>
>> Per make_small_spte(), the new child PTEs will carry all bits of hugepage PTE
>> except they clear the 'hugepage bit (obviously)', and set the 'X' bit for NX
>> hugepage thing.
>>
>> That means if we leave the stale hugepage TLB, the CPU is still able to find the
>> correct PFN and AFAICT there shouldn't be any other problem here.
The comments in tdp_mmu_split_huge_page() echo this.

     /*
      * Replace the huge spte with a pointer to the populated lower level
      * page table. Since we are making this change without a TLB flush vCPUs
      * will see a mix of the split mappings and the original huge mapping,
      * depending on what's currently in their TLB. This is fine from a
      * correctness standpoint since the translation will be the same either
      * way.
      */

>>    For any fault
>> due to the stale hugepage TLB missing the 'X' permission, AFAICT KVM will just
>> treat this as a spurious fault, which isn't nice but should have no harm.
> Right, that isn't nice, though no harm.
According to SDM "Operations that Invalidate Cached Mappings":
The following operations invalidate cached mappings as indicated:
   - ...
   - An EPT violation invalidates any guest-physical mappings (associated with
     the current EPTRTA) that would be used to translate the guest-physical
     address that caused the EPT violation. If that guest-physical address was
     the translation of a linear address, the EPT violation also invalidates any
     combined mappings for that linear address associated with the current PCID,
     the current VPID and the current EPTRTA.
   - ...

If other CPUs have the stale hugepage TLB entry, there may be one spurious
fault each.
Agree that it's not nice, but no harm.


>
> Besides, I'm thinking of a scenario which is not currently existing though.
>
>      CPU 0                                 CPU 1
> a1. split pages
> a2. write protect pages
>                                         b1. split pages
>                                         b2. write protect pages
>                                         b3. start dirty page tracking
> a3. flush TLB
> a4. start dirty page tracking
>
>
> If CPU 1 does not flush TLB after b2 (e.g., due to it finds the pages have been
> split and write protected by a1&a2), it will miss some dirty pages.
>
> Currently CPU 1 always flush TLB before b3 unconditionally, so there's no
> problem.

Yes, for this write protection case, the TLB should be flushed.
And currently, all (indirect) callers of tdp_mmu_split_huge_pages_root() do the
TLB flush unconditionally.

For the TDX case, the TLB flush is done via the hook of secure EPT hugepage
split.

It seems that the callers can decide whether the TLB flush is needed or not
based on the following actions after hugepage split, may be with the info of
whether split actually occurs or not.


>
>>> So I tend to agree with your suggestion though the implementation in this patch
>>> is safer.
>> I am perhaps still missing something, as I am still trying to precisely
>> understand in what cases you want to flush TLB when splitting hugepage.
>>
>> I kinda tend to think you eventually want to flush TLB because eventually you
>> want to _ZAP_.  But needing to flush due to zap and needing to flush due to
>> split is kinda different I think.
> Though I currently couldn't find any use cases that depend on split alone, e.g.
> if there's any feature requiring the pages must be 4KB without any additional
> permission changes, I just wanted to make the code safer in case I missed any
> edge cases.
>
> We surely don't want the window for CPUs to see huge pages and small pages lasts
> long.
>
> Flushing TLB before releasing the mmu_lock allows other threads operating on the
> same range to see updated translations timely.