linux-kernel - Re: [RFC PATCH v2 03/23] x86/tdx: Enhance tdh_phymem_page_wbinvd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aRVD4fAB7NISgY+8@yzhao56-desk.sh.intel.com>
Date: Thu, 13 Nov 2025 10:35:13 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Huang, Kai" <kai.huang@...el.com>
CC: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>,
	"tabba@...gle.com" <tabba@...gle.com>, "Du, Fan" <fan.du@...el.com>,
	"michael.roth@....com" <michael.roth@....com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "Weiny, Ira"
	<ira.weiny@...el.com>, "kas@...nel.org" <kas@...nel.org>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "Edgecombe, Rick P"
	<rick.p.edgecombe@...el.com>, "Miao, Jun" <jun.miao@...el.com>,
	"x86@...nel.org" <x86@...nel.org>, "pgonda@...gle.com" <pgonda@...gle.com>
Subject: Re: [RFC PATCH v2 03/23] x86/tdx: Enhance
 tdh_phymem_page_wbinvd_hkid() to invalidate huge pages

On Wed, Nov 12, 2025 at 06:29:11PM +0800, Huang, Kai wrote:
> On Wed, 2025-11-12 at 16:43 +0800, Yan Zhao wrote:
> > On Tue, Nov 11, 2025 at 05:23:30PM +0800, Huang, Kai wrote:
> > > On Thu, 2025-08-07 at 17:42 +0800, Yan Zhao wrote:
> > > > -u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> > > > +u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct folio *folio,
> > > > +				unsigned long start_idx, unsigned long npages)
> > > >  {
> > > > +	struct page *start = folio_page(folio, start_idx);
> > > >  	struct tdx_module_args args = {};
> > > > +	u64 err;
> > > > +
> > > > +	if (start_idx + npages > folio_nr_pages(folio))
> > > > +		return TDX_OPERAND_INVALID;
> > > >  
> > > > -	args.rcx = mk_keyed_paddr(hkid, page);
> > > > +	for (unsigned long i = 0; i < npages; i++) {
> > > > +		args.rcx = mk_keyed_paddr(hkid, nth_page(start, i));
> > > >  
> > > 
> > > Just FYI: seems there's a series to remove nth_page() completely:
> > > 
> > > https://lore.kernel.org/kvm/20250901150359.867252-1-david@redhat.com/
> > Ah, thanks!
> > Then we can get rid of the "unsigned long i".
> > 
> > -       for (unsigned long i = 0; i < npages; i++) {
> > -               args.rcx = mk_keyed_paddr(hkid, nth_page(start, i));
> > +       while (npages--) {
> > +               args.rcx = mk_keyed_paddr(hkid, start++);
> > 
> 
> You may want to be careful about doing '++' on a 'struct page *'.  I am not
Before the removing nth_page() series, linux kernel defines nth_page() like
this:

  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
  #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
  #define folio_page_idx(folio, p)        (page_to_pfn(p) - folio_pfn(folio))
  #else
  #define nth_page(page,n) ((page) + (n))
  #define folio_page_idx(folio, p)        ((p) - &(folio)->page)
  #endif

i.e., unless SPARSEMEM without SPARSEMEM_VMEMMAP, a folio's page is contiguous.

In David's removing nth_page() series, CONFIG_SPARSEMEM_VMEMMAP is auto-selected
along with CONFIG_SPARSEMEM in all architectures but sh.

David further ensures folio pages are continuous even on sh with the problematic
kernel configs (i.e., SPARSEMEM without SPARSEMEM_VMEMMAP) [1]:

: Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE
: but not SPARSEMEM_VMEMMAP: sh.
:
: Fortunately, the biggest hugetlb size sh supports is 64 MiB
: (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB
: (SECTION_SIZE_BITS == 26), so their use case is not degraded.
:
: As folios and memory sections are naturally aligned to their order-2 size
: in memory, consequently a single folio can no longer span multiple memory
: sections on these problematic kernel configs.

So it's safe to assume folio pages are continuous.

[1] https://lore.kernel.org/kvm/20250901150359.867252-12-david@redhat.com/


> expert, but I saw below discussion on the thread [*] which led to the series
> to get rid of nth_page():
>   > I wish we didn't have nth_page() at all. I really don't think it's a
>   > valid operation. It's been around forever, but I think it was broken
>   > as introduced, exactly because I don't think you can validly even have
>   > allocations that cross section boundaries.
> 
>   Ordinary buddy allocations cannot exceed a memory section, but hugetlb and
>   dax can with gigantic folios ... :(
> 
>   We had some weird bugs with that, because people keep forgetting that you
>   cannot just use page++ unconditionally with such folios.

I found Linus's reply to David [2] :
: On Tue, 5 Aug 2025 at 16:37, David Hildenbrand <david@...hat.com> wrote:
: >
: > Ordinary buddy allocations cannot exceed a memory section, but hugetlb and
: > dax can with gigantic folios ... :(
: 
: Just turn that code off. Nobody sane cares.
: 
: It sounds like people have bent over backwards to fix the insane case
: instead of saying "that's insane, let's not support it".
: 
: And yes, "that's insane" is actually fairly recent. It's not that long
: ago that we made SPARSEMEM_VMEMMAP the mandatory option on x86-64. So
: it was all sane in a historical context, but it's not sane any more.
: 
: But now it *is* the mandatory option both on x86 and arm64, so I
: really think it's time to get rid of pointless pain points.
: 
: (I think powerpc still makes it an option to do sparsemem without
: vmemmap, but it *is* an option there too)

The removing nth_page() series then ensures hugetlb and dax are Ok like changes
in [3]. The series then iterates over all pages in a hugetlb folio by invoking
page++. e.g., [4][5].

[2] https://lore.kernel.org/all/CAHk-=wiYLcax-5THGofwk-SAWYZ1RsP08b+rozXOm0wZRCE9UQ@mail.gmail.com
[3] https://lore.kernel.org/kvm/20250901150359.867252-7-david@redhat.com
[4] https://lore.kernel.org/kvm/20250901150359.867252-14-david@redhat.com
[5] https://lore.kernel.org/kvm/20250901150359.867252-16-david@redhat.com

> So, why not just get the actual page for each index within the loop?
We need to invoke folio_page() to get the actual page.

In [6], the new folio_page() implementation is

static inline struct page *folio_page(struct folio *folio, unsigned long n)
{
	return &folio->page + n;
}

So, invoking folio_page() should be equal to page++ in our case.

[6] https://lore.kernel.org/kvm/20250901150359.867252-13-david@redhat.com

 
> [*]:
> https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#m49ba78f5f630b27fa6d3d0737271f047af599c60