[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250819172846.GA578379@yaz-khff2.amd.com>
Date: Tue, 19 Aug 2025 13:28:46 -0400
From: Yazen Ghannam <yazen.ghannam@....com>
To: Adrian Hunter <adrian.hunter@...el.com>
Cc: Dave Hansen <dave.hansen@...ux.intel.com>,
Tony Luck <tony.luck@...el.com>, pbonzini@...hat.com,
seanjc@...gle.com, vannapurve@...gle.com,
Borislav Petkov <bp@...en8.de>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, x86@...nel.org,
H Peter Anvin <hpa@...or.com>, linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
rick.p.edgecombe@...el.com, kai.huang@...el.com,
reinette.chatre@...el.com, xiaoyao.li@...el.com,
tony.lindgren@...ux.intel.com, binbin.wu@...ux.intel.com,
ira.weiny@...el.com, isaku.yamahata@...el.com,
Fan Du <fan.du@...el.com>, yan.y.zhao@...el.com, chao.gao@...el.com
Subject: Re: [PATCH RESEND V2 1/2] x86/mce: Fix missing address mask in
recovery for errors in TDX/SEAM non-root mode
On Tue, Aug 19, 2025 at 07:24:34PM +0300, Adrian Hunter wrote:
> Commit 8a01ec97dc066 ("x86/mce: Mask out non-address bits from machine
> check bank") introduced a new #define MCI_ADDR_PHYSADDR for the mask of
> valid physical address bits within the machine check bank address register.
>
> This is particularly needed in the case of errors in TDX/SEAM non-root mode
> because the reported address contains the TDX KeyID. Refer to TDX and
> TME-MK documentation for more information about KeyIDs.
>
> Commit 7911f145de5fe ("x86/mce: Implement recovery for errors in TDX/SEAM
> non-root mode") uses the address to mark the affected page as poisoned, but
> omits to use the aforementioned mask.
>
> Investigation of user space expectations has concluded it would be more
> correct for the address to contain only address bits in the first place.
> Refer https://lore.kernel.org/r/807ff02d-7af0-419d-8d14-a4d6c5d5420d@intel.com
>
> Mask the address when it is read from the machine check bank address
> register. Do not use MCI_ADDR_PHYSADDR because that will be removed in a
> later patch.
>
> It is assumed __log_error() in arch/x86/kernel/cpu/mce/amd.c does not need
> similar treatment.
>
> Amend struct mce addr member description slightly to reflect that it is
> not, and never has been, an exact copy of the bank's MCi_ADDR MSR.
>
I think it would be more accurate to say that the MCi_ADDR MSR is not,
and never has been, guaranteed to be a system physical address.
We could introduce a new field that represents the system physical
address, if one exists for the error type. This way we can operate on a
value without assumption or additional checks. And we can keep the raw
MCi_ADDR MSR value in case it is of value to debug folks or hardware
designers. In my experience, they seem to appreciate having the full,
unfiltered data. We don't give them that today, but we can work towards
that goal.
I have some old work in this area:
https://github.com/AMDESE/linux/commit/76732c67cbf96c14f55ed1061804db9ff1505ea3
This isn't a quick fix, so maybe we can come back to it if folks are
happy with your current solution.
But I do think there's value in sharing the data as given to us by
hardware. And providing new interfaces to users if we need to modify
something for them to take action.
Thanks,
Yazen
Powered by blists - more mailing lists