linux-kernel - Re: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220317173354.rqymufl37lcrtmjh@black.fi.intel.com>
Date:   Thu, 17 Mar 2022 20:33:54 +0300
From:   "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     mingo@...hat.com, bp@...en8.de, dave.hansen@...el.com,
        luto@...nel.org, peterz@...radead.org,
        sathyanarayanan.kuppuswamy@...ux.intel.com, aarcange@...hat.com,
        ak@...ux.intel.com, dan.j.williams@...el.com, david@...hat.com,
        hpa@...or.com, jgross@...e.com, jmattson@...gle.com,
        joro@...tes.org, jpoimboe@...hat.com, knsathya@...nel.org,
        pbonzini@...hat.com, sdeep@...are.com, seanjc@...gle.com,
        tony.luck@...el.com, vkuznets@...hat.com, wanpengli@...cent.com,
        thomas.lendacky@....com, brijesh.singh@....com, x86@...nel.org,
        linux-kernel@...r.kernel.org,
        Sean Christopherson <sean.j.christopherson@...el.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>
Subject: Re: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest

On Thu, Mar 17, 2022 at 01:48:54AM +0100, Thomas Gleixner wrote:
> On Wed, Mar 16 2022 at 05:08, Kirill A. Shutemov wrote:
> Hmm?

Does the changed version below address your concerns?

	void tdx_get_ve_info(struct ve_info *ve)
	{
		struct tdx_module_output out;

		/*
		 * Called during #VE handling to retrieve the #VE info from the
		 * TDX module.
		 *
		 * This has to be called early in #VE handling.  A "nested" #VE which
		 * occurs before this will raise a #DF and is not recoverable.
		 *
		 * The call retrieves the #VE info from the TDX module, which also
		 * clears the "#VE valid" flag. This must be done before anything else
		 * because any #VE that occurs while the valid flag is set will lead to
		 * #DF.
		 *
		 * Note, the TDX module treats virtual NMIs as inhibited if the #VE
		 * valid flag is set. It means that NMI=>#VE will not result in a #DF.
		 */
		tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);

		/* Transfer the output parameters */
		ve->exit_reason = out.rcx;
		ve->exit_qual   = out.rdx;
		ve->gla         = out.r8;
		ve->gpa         = out.r9;
		ve->instr_len   = lower_32_bits(out.r10);
		ve->instr_info  = upper_32_bits(out.r10);
	}

> > +/*
> > + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
> > + * specific guest actions which may happen in either user space or the
> > + * kernel:
> > + *
> > + *  * Specific instructions (WBINVD, for example)
> > + *  * Specific MSR accesses
> > + *  * Specific CPUID leaf accesses
> > + *  * Access to specific guest physical addresses
> > + *
> > + * In the settings that Linux will run in, virtualization exceptions are
> > + * never generated on accesses to normal, TD-private memory that has been
> > + * accepted.
> > + *
> > + * Syscall entry code has a critical window where the kernel stack is not
> > + * yet set up. Any exception in this window leads to hard to debug issues
> > + * and can be exploited for privilege escalation. Exceptions in the NMI
> > + * entry code also cause issues. Returning from the exception handler with
> > + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
> > + *
> > + * For these reasons, the kernel avoids #VEs during the syscall gap and
> > + * the NMI entry code. Entry code paths do not access TD-shared memory,
> > + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> > + * that might generate #VE.
> 
> I asked that before:
> 
>   "How is that enforced or validated? What checks for a violation of that
>    assumption?"
> 
> This is still exactly the same comment which is based on testing which
> did not yet explode in your face, right?

[ Disclaimer: I have limited understanding of the entry code complexity
  and may miss some crucial details. But I try my best. ]

Yes, it is the same comment, but it is based on code audit, not only on
testing.

I claim that kernel does not do anything that can possibly trigger #VE
where kernel cannot deal with it:

 - on syscall entry code before kernel stack is set up (few instructions
   in the beginning of entry_SYSCALL_64())

 - in NMI entry code (asm_exc_nmi()) before NMI nesting is safe:
   + for NMI from user mode, before switched to thread stack
   + for NMI from kernel, up to end_repead_nmi

After that points #VE is safe.

> So what's the point of this blurb? Create expectations which are not
> accountable?

I don't have such intentions.

> The point is that any #VE in such a code path is fatal and you better
> come up with some reasonable explanation why this is not the case in
> those code pathes and how a potential violation of that assumption might
> be detected especially in rarely used corner cases. If such a violation
> is not detectable by audit, CI, static code analysis or whatever then
> document the consequences instead of pretending that the problem does
> not exist and the kernel is perfect today and forever.

It is detectable by audit. The critical windows very limited and located
in the highly scrutinized entry code. But, yes, I cannot guarantee that
this code will be perfect forever.

Consequences of #VE in these critical windows are mentioned in the
comment:

	Any exception in this window leads to hard to debug issues and can
	be exploited for privilege escalation. 

I have hard time understanding what I has to change here. Do you want
details of audit to be documented? Make consequences of #VE at the wrong
point to be more prominent in the comment? 

-- 
 Kirill A. Shutemov