linux-kernel - Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEvNRgHtDJx52+KU3dZfhOMjvWxjX7eJ7WdX8y+kN+bNqpspeg@mail.gmail.com>
Date: Fri, 9 Jan 2026 10:29:47 -0800
From: Ackerley Tng <ackerleytng@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>, Dave Hansen <dave.hansen@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, linux-kernel@...r.kernel.org, 
	kvm@...r.kernel.org, x86@...nel.org, rick.p.edgecombe@...el.com, 
	kas@...nel.org, tabba@...gle.com, michael.roth@....com, david@...nel.org, 
	vannapurve@...gle.com, sagis@...gle.com, vbabka@...e.cz, 
	thomas.lendacky@....com, nik.borisov@...e.com, pgonda@...gle.com, 
	fan.du@...el.com, jun.miao@...el.com, francescolavra.fl@...il.com, 
	jgross@...e.com, ira.weiny@...el.com, isaku.yamahata@...el.com, 
	xiaoyao.li@...el.com, kai.huang@...el.com, binbin.wu@...ux.intel.com, 
	chao.p.peng@...el.com, chao.gao@...el.com
Subject: Re: [PATCH v3 01/24] x86/tdx: Enhance tdh_mem_page_aug() to support
 huge pages

Yan Zhao <yan.y.zhao@...el.com> writes:

> On Wed, Jan 07, 2026 at 08:39:55AM -0800, Dave Hansen wrote:
>> On 1/7/26 01:12, Yan Zhao wrote:
>> ...
>> > However, my understanding is that it's better for functions expecting huge pages
>> > to explicitly receive "folio" instead of "page". This way, people can tell from
>> > a function's declaration what the function expects. Is this understanding
>> > correct?
>>
>> In a perfect world, maybe.
>>
>> But, in practice, a 'struct page' can still represent huge pages and
>> *does* represent huge pages all over the kernel. There's no need to cram
>> a folio in here just because a huge page is involved.
> Ok. I can modify the param "struct page *page" to "struct page *base_page",
> explaining that it may belong to a huge folio but is not necessarily the
> head page of the folio.
>
>> > Passing "start_idx" along with "folio" is due to the requirement of mapping only
>> > a sub-range of a huge folio. e.g., we allow creating a 2MB mapping starting from
>> > the nth idx of a 1GB folio.
>> >
>> > On the other hand, if we instead pass "page" to tdh_mem_page_aug() for huge
>> > pages and have tdh_mem_page_aug() internally convert it to "folio" and
>> > "start_idx", it makes me wonder if we could have previously just passed "pfn" to
>> > tdh_mem_page_aug() and had tdh_mem_page_aug() convert it to "page".
>>
>> As a general pattern, I discourage folks from using pfns and physical
>> addresses when passing around references to physical memory. They have
>> zero type safety.
>>
>> It's also not just about type safety. A 'struct page' also *means*
>> something. It means that the kernel is, on some level, aware of and
>> managing that memory. It's not MMIO. It doesn't represent the physical
>> address of the APIC page. It's not SGX memory. It doesn't have a
>> Shared/Private bit.
>>
>> All of those properties are important and they're *GONE* if you use a
>> pfn. It's even worse if you use a raw physical address.
>>
>> Please don't go back to raw integers (pfns or paddrs).
> I understood and fully accept it.
>
> I previously wondered if we could allow KVM to pass in pfn and let the SEAMCALL
> wrapper do the pfn_to_page() conversion.
> But it was just out of curiosity. I actually prefer "struct page" too.
>
>
>> >>> -	tdx_clflush_page(page);
>> >>> +	if (start_idx + npages > folio_nr_pages(folio))
>> >>> +		return TDX_OPERAND_INVALID;
>> >>
>> >> Why is this necessary? Would it be a bug if this happens?
>> > This sanity check is due to the requirement in KVM that mapping size should be
>> > no larger than the backend folio size, which ensures the mapping pages are
>> > physically contiguous with homogeneous page attributes. (See the discussion
>> > about "EPT mapping size and folio size" in thread [1]).
>> >
>> > Failure of the sanity check could only be due to bugs in the caller (KVM). I
>> > didn't convert the sanity check to an assertion because there's already a
>> > TDX_BUG_ON_2() on error following the invocation of tdh_mem_page_aug() in KVM.
>>
>> We generally don't protect against bugs in callers. Otherwise, we'd have
>> a trillion NULL checks in every function in the kernel.
>>
>> The only reason to add caller sanity checks is to make things easier to
>> debug, and those almost always include some kind of spew:
>> WARN_ON_ONCE(), pr_warn(), etc...
>
> Would it be better if I use WARN_ON_ONCE()? like this:
>
> u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *base_page,
>                      u64 *ext_err1, u64 *ext_err2)
> {
>         unsigned long npages = tdx_sept_level_to_npages(level);
>         struct tdx_module_args args = {
>                 .rcx = gpa | level,
>                 .rdx = tdx_tdr_pa(td),
>                 .r8 = page_to_phys(base_page),
>         };
>         u64 ret;
>
>         WARN_ON_ONCE(page_folio(base_page) != page_folio(base_page + npages - 1));

This WARNs if the first and last folios are not the same folio, which
still assumes something about how pages are grouped into folios. I feel
that this is still stretching TDX code over to make assumptions about
how the kernel manages memory metadata, which is more than TDX actually
cares about.

>
>         for (int i = 0; i < npages; i++)
>                 tdx_clflush_page(base_page + i);
>
>         ret = seamcall_ret(TDH_MEM_PAGE_AUG, &args);
>
>         *ext_err1 = args.rcx;
>         *ext_err2 = args.rdx;
>
>         return ret;
> }
>
> The WARN_ON_ONCE() serves 2 purposes:
> 1. Loudly warn of subtle KVM bugs.
> 2. Ensure "page_to_pfn(base_page + i) == (page_to_pfn(base_page) + i)".
>

I disagree with checking within TDX code, but if you would still like to
check, 2. that you suggested is less dependent on the concept of how the
kernel groups pages in folios, how about:

  WARN_ON_ONCE(page_to_pfn(base_page + npages - 1) !=
               page_to_pfn(base_page) + npages - 1);

The full contiguity check will scan every page, but I think this doesn't
take too many CPU cycles, and would probably catch what you're looking
to catch in most cases.

I still don't think TDX code should check. The caller should check or
know the right thing to do.

> If you don't like using "base_page + i" (as the discussion in v2 [1]), we can
> invoke folio_page() for the ith page instead.
>
> [1] https://lore.kernel.org/all/01731a9a0346b08577fad75ae560c650145c7f39.camel@intel.com/
>
>> >>> +	for (int i = 0; i < npages; i++)
>> >>> +		tdx_clflush_page(folio_page(folio, start_idx + i));
>> >>
>> >> All of the page<->folio conversions are kinda hurting my brain. I think
>> >> we need to decide what the canonical type for these things is in TDX, do
>> >> the conversion once, and stick with it.
>> > Got it!
>> >
>> > Since passing in base "page" or base "pfn" may still require the
>> > wrappers/helpers to internally convert them to "folio" for sanity checks, could
>> > we decide that "folio" and "start_idx" are the canonical params for functions
>> > expecting huge pages? Or do you prefer KVM to do the sanity check by itself?
>>
>> I'm not convinced the sanity check is a good idea in the first place. It
>> just adds complexity.
> I'm worried about subtle bugs introduced by careless coding that might be
> silently ignored otherwise, like the one in thread [2].
>
> [2] https://lore.kernel.org/kvm/aV2A39fXgzuM4Toa@google.com/