[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <807ff02d-7af0-419d-8d14-a4d6c5d5420d@intel.com>
Date: Wed, 30 Jul 2025 13:54:11 +0300
From: Adrian Hunter <adrian.hunter@...el.com>
To: Dave Hansen <dave.hansen@...el.com>, "Luck, Tony" <tony.luck@...el.com>,
"Annapurve, Vishal" <vannapurve@...gle.com>
CC: Borislav Petkov <bp@...en8.de>, Thomas Gleixner <tglx@...utronix.de>,
"Ingo Molnar" <mingo@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>,
"x86@...nel.org" <x86@...nel.org>, H Peter Anvin <hpa@...or.com>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Edgecombe, Rick P"
<rick.p.edgecombe@...el.com>, "kirill.shutemov@...ux.intel.com"
<kirill.shutemov@...ux.intel.com>, "Huang, Kai" <kai.huang@...el.com>,
"Chatre, Reinette" <reinette.chatre@...el.com>, "Li, Xiaoyao"
<xiaoyao.li@...el.com>, "tony.lindgren@...ux.intel.com"
<tony.lindgren@...ux.intel.com>, "binbin.wu@...ux.intel.com"
<binbin.wu@...ux.intel.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
"Zhao, Yan Y" <yan.y.zhao@...el.com>, "Gao, Chao" <chao.gao@...el.com>,
"pbonzini@...hat.com" <pbonzini@...hat.com>, "seanjc@...gle.com"
<seanjc@...gle.com>
Subject: Re: [PATCH 1/2] x86/mce: Fix missing address mask in recovery for
errors in TDX/SEAM non-root mode
On 27/06/2025 19:33, Dave Hansen wrote:
> On 6/27/25 09:24, Luck, Tony wrote:
>> We've been sending a combined key+address in the "mce->addr" to
>> user space for a while. Has anyone built infrastructure on top of that?
>
> I'm not sure they can do anything useful with an address that has the
> KeyID in the first place. The partitioning scheme is in an MSR, so
> they'd need to be doing silly gymnastics to even decode the address.
>
> Userspace can deal with the KeyID not being in the address. It's been
> the default for ages. So, if we take it back out, I'd expect it fixes
> more things than it breaks.
>
> So, yeah, we should carefully consider it. But it still 100% looks like
> the right thing to me to detangle the KeyID and physical address in the ABI.
Coming back to this after a bit of a break.
It feels unlikely to me that any users are expecting KeyID in mce->addr.
Looking at user space programs like mcelog and rasdaemon, gives the
impression that mce->addr contains only an address.
The UAPI header file describes addr as "Bank's MCi_ADDR MSR", but what
mce_read_aux() does tends to contradict that, especially for AMD
SMCA.
But there are also additional places where it seems like MCI_ADDR_PHYSADDR
is missing:
tdx_dump_mce_info()
paddr_is_tdx_private()
__seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args)
TDH_PHYMEM_PAGE_RDMD expects KeyID bits to be zero
skx_mce_output_error()
edac_mc_handle_error()
expects page_frame_number, so without KeyID
The KeyID is probably only useful for potentially identifying the TD, but
given that the TD incurs a FATAL error, that may be obvious anyway.
So removing the KeyID from mce->addr looks like the right thing to do.
Note AFAICT there are 3 kernel APIs that deal with the MCE address:
Device /dev/mcelog which outputs struct mce
Tracepoint mce:mce_record which outputs members from struct mce
Tracepoint ras:mc_event where the kernel constructs the address
from page_frame_number implying that KeyID should not be present
I guess it would be sensible to ask what customers think.
Vishal, do you know anyone at Google who deals with handling machine
check information, and who might have an opinion on this?
Powered by blists - more mailing lists