linux-kernel - Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <77b79116-951a-7ff9-c19b-73af2af98ce9@intel.com>
Date:   Mon, 7 Nov 2022 15:30:36 -0800
From:   Dave Hansen <dave.hansen@...el.com>
To:     Erdem Aktas <erdemaktas@...gle.com>
Cc:     "Nakajima, Jun" <jun.nakajima@...el.com>,
        Guorui Yu <GuoRui.Yu@...ux.alibaba.com>,
        kirill.shutemov@...ux.intel.com, ak@...ux.intel.com, bp@...en8.de,
        dan.j.williams@...el.com, david@...hat.com,
        elena.reshetova@...el.com, hpa@...or.com,
        linux-kernel@...r.kernel.org, luto@...nel.org, mingo@...hat.com,
        peterz@...radead.org, sathyanarayanan.kuppuswamy@...ux.intel.com,
        seanjc@...gle.com, tglx@...utronix.de, thomas.lendacky@....com,
        x86@...nel.org
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the
 private memory

On 11/7/22 14:53, Erdem Aktas wrote:
> On Fri, Nov 4, 2022 at 3:50 PM Dave Hansen <dave.hansen@...el.com> wrote:
>> Could you please elaborate a bit on what you think the distinction is
>> between:
>>
>>         * Accept on first use
>> and
>>         * Accept on allocation
>>
>> Surely, for the vast majority of memory, it's allocated and then used
>> pretty quickly.  As in, most allocations are __GFP_ZERO so they're
>> allocated and "used" before they even leave the allocator.  So, in
>> practice, they're *VERY* close to equivalent.
>>
>> Where do you see them diverging?  Why does it matter?
> 
> For a VM with a very large memory size, let's say close to 800G of
> memory, it might take a really long time to finish the initialization.
> If all allocations are __GFP_ZERO, then I agree it would not matter
> but -- I need to run some benchmarks to validate --  what I remember
> was, that was not what we were observing. Let me run a few tests to
> provide more input on this but meanwhile if you have already run some
> benchmarks, that would be great.
> 
> What I see in the code is that the "accept_page" function will zero
> all the unaccepted pages even if the __GFP_ZERO flag is not set and if
> __GFP_ZERO is set, we will again zero all those pages. I see a lot of
> concerning comments like "Page acceptance can be very slow.".

I'm not following you at all here.  Yeah, page acceptance is very slow.
But, the slowest part is the probably cache coherency dance that the TDX
module has to do flushing and zeroing all the memory to initialize the
new integrity metadata.  Second to that is the cost of the TDCALL.
Third is the cost of the #VE.

Here's what Kirill is proposing, in some peudocode:

	alloc_page(order=0, __GFP_ZERO) {
		TD.accept(size=4M) {
			// TDX Module clflushes/zeroes 4M of memory
		}
		memset(4k);
		// leave 1023 accepted 4k pages in the allocator
	}

To accept 4M of memory, you do one TDCALL.  You do zero #VE's.  Using
the #VE handler, you do:

	alloc_page(order=0, __GFP_ZERO) {
		memset(4k) {
			-> #VE handler
			TD.accept(size=4k); // flush/zero 4k
		}
		// only 4k was accepted
	}
	... Take 1023 more #VE's later on for each 4k page

You do 1024 #VE's and 1024 TDCALLs.  So, let's summarize.  To do 4M
worth of 4k pages, here's how the two approaches break down if
__GFP_ZERO is in play:

	      #VE	Accept-in-allocator
#VE's:	     1024			0
TDCALLS:     1024			1
clflushes: 4k x 1024		4k x 1024
memset()s: 4k x 1024		4k x 1024

The *ONLY* downside of accept-at-allocate as implemented is that it does
4M at a time, so the TDCALL is long compared to a 4k one.  But, this is
a classing bandwidth versus latency compromise.  In this case, we choose
bandwidth.

*Both* cases need to memset() the same amount of memory.  Both cases
only memset() 4k at a time.

The *ONLY* way the #VE approach is better is if you allocate 4k and then
never touch the rest of the 4M page.  That might happen, maybe *ONE*
time per zone.  But the rest of the time, the amortization of the TDCALL
cost is going to win.

I'll be shocked if any benchmarking turns up another result.