[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220317173354.rqymufl37lcrtmjh@black.fi.intel.com>
Date: Thu, 17 Mar 2022 20:33:54 +0300
From: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: mingo@...hat.com, bp@...en8.de, dave.hansen@...el.com,
luto@...nel.org, peterz@...radead.org,
sathyanarayanan.kuppuswamy@...ux.intel.com, aarcange@...hat.com,
ak@...ux.intel.com, dan.j.williams@...el.com, david@...hat.com,
hpa@...or.com, jgross@...e.com, jmattson@...gle.com,
joro@...tes.org, jpoimboe@...hat.com, knsathya@...nel.org,
pbonzini@...hat.com, sdeep@...are.com, seanjc@...gle.com,
tony.luck@...el.com, vkuznets@...hat.com, wanpengli@...cent.com,
thomas.lendacky@....com, brijesh.singh@....com, x86@...nel.org,
linux-kernel@...r.kernel.org,
Sean Christopherson <sean.j.christopherson@...el.com>,
Dave Hansen <dave.hansen@...ux.intel.com>
Subject: Re: [PATCHv6 07/30] x86/traps: Add #VE support for TDX guest
On Thu, Mar 17, 2022 at 01:48:54AM +0100, Thomas Gleixner wrote:
> On Wed, Mar 16 2022 at 05:08, Kirill A. Shutemov wrote:
> Hmm?
Does the changed version below address your concerns?
void tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
/*
* Called during #VE handling to retrieve the #VE info from the
* TDX module.
*
* This has to be called early in #VE handling. A "nested" #VE which
* occurs before this will raise a #DF and is not recoverable.
*
* The call retrieves the #VE info from the TDX module, which also
* clears the "#VE valid" flag. This must be done before anything else
* because any #VE that occurs while the valid flag is set will lead to
* #DF.
*
* Note, the TDX module treats virtual NMIs as inhibited if the #VE
* valid flag is set. It means that NMI=>#VE will not result in a #DF.
*/
tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
/* Transfer the output parameters */
ve->exit_reason = out.rcx;
ve->exit_qual = out.rdx;
ve->gla = out.r8;
ve->gpa = out.r9;
ve->instr_len = lower_32_bits(out.r10);
ve->instr_info = upper_32_bits(out.r10);
}
> > +/*
> > + * Virtualization Exceptions (#VE) are delivered to TDX guests due to
> > + * specific guest actions which may happen in either user space or the
> > + * kernel:
> > + *
> > + * * Specific instructions (WBINVD, for example)
> > + * * Specific MSR accesses
> > + * * Specific CPUID leaf accesses
> > + * * Access to specific guest physical addresses
> > + *
> > + * In the settings that Linux will run in, virtualization exceptions are
> > + * never generated on accesses to normal, TD-private memory that has been
> > + * accepted.
> > + *
> > + * Syscall entry code has a critical window where the kernel stack is not
> > + * yet set up. Any exception in this window leads to hard to debug issues
> > + * and can be exploited for privilege escalation. Exceptions in the NMI
> > + * entry code also cause issues. Returning from the exception handler with
> > + * IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
> > + *
> > + * For these reasons, the kernel avoids #VEs during the syscall gap and
> > + * the NMI entry code. Entry code paths do not access TD-shared memory,
> > + * MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
> > + * that might generate #VE.
>
> I asked that before:
>
> "How is that enforced or validated? What checks for a violation of that
> assumption?"
>
> This is still exactly the same comment which is based on testing which
> did not yet explode in your face, right?
[ Disclaimer: I have limited understanding of the entry code complexity
and may miss some crucial details. But I try my best. ]
Yes, it is the same comment, but it is based on code audit, not only on
testing.
I claim that kernel does not do anything that can possibly trigger #VE
where kernel cannot deal with it:
- on syscall entry code before kernel stack is set up (few instructions
in the beginning of entry_SYSCALL_64())
- in NMI entry code (asm_exc_nmi()) before NMI nesting is safe:
+ for NMI from user mode, before switched to thread stack
+ for NMI from kernel, up to end_repead_nmi
After that points #VE is safe.
> So what's the point of this blurb? Create expectations which are not
> accountable?
I don't have such intentions.
> The point is that any #VE in such a code path is fatal and you better
> come up with some reasonable explanation why this is not the case in
> those code pathes and how a potential violation of that assumption might
> be detected especially in rarely used corner cases. If such a violation
> is not detectable by audit, CI, static code analysis or whatever then
> document the consequences instead of pretending that the problem does
> not exist and the kernel is perfect today and forever.
It is detectable by audit. The critical windows very limited and located
in the highly scrutinized entry code. But, yes, I cannot guarantee that
this code will be perfect forever.
Consequences of #VE in these critical windows are mentioned in the
comment:
Any exception in this window leads to hard to debug issues and can
be exploited for privilege escalation.
I have hard time understanding what I has to change here. Do you want
details of audit to be documented? Make consequences of #VE at the wrong
point to be more prominent in the comment?
--
Kirill A. Shutemov
Powered by blists - more mailing lists