linux-kernel - Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ac179589-bd60-a47c-2d3b-78992b6cbea2@intel.com>
Date:   Fri, 4 Nov 2022 15:50:13 -0700
From:   Dave Hansen <dave.hansen@...el.com>
To:     Erdem Aktas <erdemaktas@...gle.com>,
        "Nakajima, Jun" <jun.nakajima@...el.com>
Cc:     Guorui Yu <GuoRui.Yu@...ux.alibaba.com>,
        kirill.shutemov@...ux.intel.com, ak@...ux.intel.com, bp@...en8.de,
        dan.j.williams@...el.com, david@...hat.com,
        elena.reshetova@...el.com, hpa@...or.com,
        linux-kernel@...r.kernel.org, luto@...nel.org, mingo@...hat.com,
        peterz@...radead.org, sathyanarayanan.kuppuswamy@...ux.intel.com,
        seanjc@...gle.com, tglx@...utronix.de, thomas.lendacky@....com,
        x86@...nel.org
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the
 private memory

On 11/4/22 15:36, Erdem Aktas wrote:
> On Fri, Oct 28, 2022 at 7:12 AM Kirill A. Shutemov
> <kirill.shutemov@...ux.intel.com> wrote:
>> +        *
>> +        * Kernel has no legitimate use-cases for #VE on private memory. It is
>> +        * either a guest kernel bug (like access of unaccepted memory) or
>> +        * malicious/buggy VMM that removes guest page that is still in use.
>> +        *
> 
> I think this statement is too strong and I have few concerns on this approach.
> I understand that there is an issue of handling #VEs on private pages
> but it seems like we are just hiding the problem with this approach
> instead of fixing it - I do not have any fix in my mind- .
> First there is a feature of injecting #VE to handle unaccepted pages
> at runtime and accept them on-demand, now the statement is saying this
> was an unnecessary feature (why is it there at all then?) at all as
> there is no legitimate use case.

We're doing on-demand page acceptance.  We just don't need a #VE to
drive it.  Why is it in the TDX module then?  Inertia?  Because it got
too far along in the process before anyone asked me or some of the other
x86 kernel folks to look at it hard.

> I wonder if this will limit how we can implement the lazy TDACCEPT.
> There are multiple ideas floating now.
> https://github.com/intel/tdx/commit/9b3ef9655b695d3c67a557ec016487fded8b0e2b
> has 3 implementation choices where "Accept a block of memory on the
> first use." option is implemented.  Actually it says "Accept a block
> of memory on the first use." but it is implemented as "Accept a block
> of memory on the first allocation".  The comments in this code also
> raises concerns on the performance.
> 
> As of now, we do not know which one of those ideas will provide an
> acceptable performance for booting large size VMs. If the performance
> overhead is high, we can always implement the lazy TDACCEPT as when
> the first time a guest accesses an unaccepted memory, #VE can do the TDACCEPT.

Could you please elaborate a bit on what you think the distinction is
between:

	* Accept on first use
and
	* Accept on allocation

Surely, for the vast majority of memory, it's allocated and then used
pretty quickly.  As in, most allocations are __GFP_ZERO so they're
allocated and "used" before they even leave the allocator.  So, in
practice, they're *VERY* close to equivalent.

Where do you see them diverging?  Why does it matter?

> I am not trying to solve the lazy TDACCEPT problem here but all I am
> trying to say is that, there might be legitimate use cases for #VE on
> private memory and this patch limits any future improvement we might
> need to do on lazy TDACCEPT implementation.

The kernel can't take exceptions on arbitrary memory accesses.  I have
*ZERO* idea how to handle page acceptance on an access to a per-cpu
variable referenced in syscall entry, or the NMI stack when we've
interrupted kernel code with a user GSBASE value.

So, we either find *ALL* the kernel memory that needs to be pre-accepted
at allocation time (like kernel stacks) or we just say that all
allocated memory needs to be accepted before we let it be allocated.

One of those is really easy.  The other means a boatload of code audits.
 I know.  I had to do that kind of exercise to get KPTI to work.  I
don't want to do it again.  It was worth it for KPTI when the world was
on fire.  TDX isn't that important IMNHO.  There's an easier way.