[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231205142517.GBZW8yzVDEKIVTthSx@fat_crate.local>
Date: Tue, 5 Dec 2023 15:25:17 +0100
From: Borislav Petkov <bp@...en8.de>
To: Kai Huang <kai.huang@...el.com>
Cc: linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org,
dave.hansen@...el.com, kirill.shutemov@...ux.intel.com,
peterz@...radead.org, tony.luck@...el.com, tglx@...utronix.de,
mingo@...hat.com, hpa@...or.com, seanjc@...gle.com,
pbonzini@...hat.com, rafael@...nel.org, david@...hat.com,
dan.j.williams@...el.com, len.brown@...el.com, ak@...ux.intel.com,
isaku.yamahata@...el.com, ying.huang@...el.com, chao.gao@...el.com,
sathyanarayanan.kuppuswamy@...ux.intel.com, nik.borisov@...e.com,
bagasdotme@...il.com, sagis@...gle.com, imammedo@...hat.com
Subject: Re: [PATCH v15 22/23] x86/mce: Improve error log of kernel space TDX
#MC due to erratum
On Fri, Nov 10, 2023 at 12:55:59AM +1300, Kai Huang wrote:
> +static const char *mce_memory_info(struct mce *m)
> +{
> + if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
> + return NULL;
> +
> + /*
> + * Certain initial generations of TDX-capable CPUs have an
> + * erratum. A kernel non-temporal partial write to TDX private
> + * memory poisons that memory, and a subsequent read of that
> + * memory triggers #MC.
> + *
> + * However such #MC caused by software cannot be distinguished
> + * from the real hardware #MC. Just print additional message
> + * to show such #MC may be result of the CPU erratum.
> + */
> + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> + return NULL;
> +
> + return !tdx_is_private_mem(m->addr) ? NULL :
> + "TDX private memory error. Possible kernel bug.";
> +}
> +
> static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
> {
> struct llist_node *pending;
> struct mce_evt_llist *l;
> int apei_err = 0;
> + const char *memmsg;
>
> /*
> * Allow instrumentation around external facilities usage. Not that it
> @@ -283,6 +307,15 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
> }
> if (exp)
> pr_emerg(HW_ERR "Machine check: %s\n", exp);
> + /*
> + * Confidential computing platforms such as TDX platforms
> + * may occur MCE due to incorrect access to confidential
> + * memory. Print additional information for such error.
> + */
> + memmsg = mce_memory_info(final);
> + if (memmsg)
> + pr_emerg(HW_ERR "Machine check: %s\n", memmsg);
> +
No, this is not how this is done. First of all, this function should be
called something like
mce_dump_aux_info()
or so to state that it is dumping some auxiliary info.
Then, it does:
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
return tdx_get_mce_info();
or so and you put that tdx_get_mce_info() function in TDX code and there
you do all your picking apart of things, what needs to be dumped or what
not, checking whether it is a memory error and so on.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists