linux-kernel - Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAYXXYw1YpZx1AaOu0TgR9yR9Foi6_jh8XkbGU4ZM2TFTM=nSA@mail.gmail.com>
Date:   Mon, 7 Nov 2022 14:53:37 -0800
From:   Erdem Aktas <erdemaktas@...gle.com>
To:     Dave Hansen <dave.hansen@...el.com>
Cc:     "Nakajima, Jun" <jun.nakajima@...el.com>,
        Guorui Yu <GuoRui.Yu@...ux.alibaba.com>,
        kirill.shutemov@...ux.intel.com, ak@...ux.intel.com, bp@...en8.de,
        dan.j.williams@...el.com, david@...hat.com,
        elena.reshetova@...el.com, hpa@...or.com,
        linux-kernel@...r.kernel.org, luto@...nel.org, mingo@...hat.com,
        peterz@...radead.org, sathyanarayanan.kuppuswamy@...ux.intel.com,
        seanjc@...gle.com, tglx@...utronix.de, thomas.lendacky@....com,
        x86@...nel.org
Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the
 private memory

On Fri, Nov 4, 2022 at 3:50 PM Dave Hansen <dave.hansen@...el.com> wrote:
>
> On 11/4/22 15:36, Erdem Aktas wrote:
> > On Fri, Oct 28, 2022 at 7:12 AM Kirill A. Shutemov
> > <kirill.shutemov@...ux.intel.com> wrote:
> >> +        *
> >> +        * Kernel has no legitimate use-cases for #VE on private memory. It is
> >> +        * either a guest kernel bug (like access of unaccepted memory) or
> >> +        * malicious/buggy VMM that removes guest page that is still in use.
> >> +        *
> >
> > I think this statement is too strong and I have few concerns on this approach.
> > I understand that there is an issue of handling #VEs on private pages
> > but it seems like we are just hiding the problem with this approach
> > instead of fixing it - I do not have any fix in my mind- .
> > First there is a feature of injecting #VE to handle unaccepted pages
> > at runtime and accept them on-demand, now the statement is saying this
> > was an unnecessary feature (why is it there at all then?) at all as
> > there is no legitimate use case.
>
> We're doing on-demand page acceptance.  We just don't need a #VE to
> drive it.  Why is it in the TDX module then?  Inertia?  Because it got
> too far along in the process before anyone asked me or some of the other
> x86 kernel folks to look at it hard.
>
> > I wonder if this will limit how we can implement the lazy TDACCEPT.
> > There are multiple ideas floating now.
> > https://github.com/intel/tdx/commit/9b3ef9655b695d3c67a557ec016487fded8b0e2b
> > has 3 implementation choices where "Accept a block of memory on the
> > first use." option is implemented.  Actually it says "Accept a block
> > of memory on the first use." but it is implemented as "Accept a block
> > of memory on the first allocation".  The comments in this code also
> > raises concerns on the performance.
> >
> > As of now, we do not know which one of those ideas will provide an
> > acceptable performance for booting large size VMs. If the performance
> > overhead is high, we can always implement the lazy TDACCEPT as when
> > the first time a guest accesses an unaccepted memory, #VE can do the TDACCEPT.
>
> Could you please elaborate a bit on what you think the distinction is
> between:
>
>         * Accept on first use
> and
>         * Accept on allocation
>
> Surely, for the vast majority of memory, it's allocated and then used
> pretty quickly.  As in, most allocations are __GFP_ZERO so they're
> allocated and "used" before they even leave the allocator.  So, in
> practice, they're *VERY* close to equivalent.
>
> Where do you see them diverging?  Why does it matter?
>

For a VM with a very large memory size, let's say close to 800G of
memory, it might take a really long time to finish the initialization.
If all allocations are __GFP_ZERO, then I agree it would not matter
but -- I need to run some benchmarks to validate --  what I remember
was, that was not what we were observing. Let me run a few tests to
provide more input on this but meanwhile if you have already run some
benchmarks, that would be great.

What I see in the code is that the "accept_page" function will zero
all the unaccepted pages even if the __GFP_ZERO flag is not set and if
__GFP_ZERO is set, we will again zero all those pages. I see a lot of
concerning comments like "Page acceptance can be very slow.".

What I mean with "Accept on allocation" is leaving the memory
allocation as it is and using the #VE handler to accept pages the
first time they have been accessed.

tLet me come back with some numbers on this which might take some time
to collect.

> > I am not trying to solve the lazy TDACCEPT problem here but all I am
> > trying to say is that, there might be legitimate use cases for #VE on
> > private memory and this patch limits any future improvement we might
> > need to do on lazy TDACCEPT implementation.
>
> The kernel can't take exceptions on arbitrary memory accesses.  I have
> *ZERO* idea how to handle page acceptance on an access to a per-cpu
> variable referenced in syscall entry, or the NMI stack when we've
> interrupted kernel code with a user GSBASE value.
>
> So, we either find *ALL* the kernel memory that needs to be pre-accepted
> at allocation time (like kernel stacks) or we just say that all
> allocated memory needs to be accepted before we let it be allocated.
>
> One of those is really easy.  The other means a boatload of code audits.
>  I know.  I had to do that kind of exercise to get KPTI to work.  I
> don't want to do it again.  It was worth it for KPTI when the world was
> on fire.  TDX isn't that important IMNHO.  There's an easier way.