[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <76fe899b-73ea-4f6b-9821-84240d89b0cb@amd.com>
Date: Wed, 14 Feb 2024 09:28:54 -0500
From: Yazen Ghannam <yazen.ghannam@....com>
To: Borislav Petkov <bp@...en8.de>
Cc: yazen.ghannam@....com, tony.luck@...el.com, linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org, avadhut.naik@....com, john.allen@....com,
muralidhara.mk@....com, naveenkrishna.chatradhi@....com,
sathyapriya.k@....com
Subject: Re: [PATCH 2/2] RAS: Introduce the FRU Memory Poison Manager
On 2/14/2024 4:28 AM, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
>> +config RAS_FMPM
>> + tristate "FRU Memory Poison Manager"
>> + default m
>> + depends on X86_MCE
>
> I know this is generic-ish but it needs to be enabled only on AMD for
> now. Whoever wants it somewhere else, then whoever needs to test it
> there first and then enable it there.
>
Ack.
>> + imply AMD_ATL
>> + help
>> + Support saving and restoring memory error information across reboot
>> + cycles using ACPI ERST as persistent storage. Error information is
>
> s/cycles//
>
Ack.
>> + saved with the UEFI CPER "FRU Memory Poison" section format.
>> +
>> + Memory may be retired during boot time and run time depending on
>
> s/may/is/
>
> Please check all your text - too many "may"s for something which is not
> a vendor doc. :)
>
Ack.
>> + platform-specific policies.
>> +
>> endif
>> diff --git a/drivers/ras/Makefile b/drivers/ras/Makefile
>> index 3fac80f58005..11f95d59d397 100644
>> --- a/drivers/ras/Makefile
>> +++ b/drivers/ras/Makefile
>> @@ -3,4 +3,5 @@ obj-$(CONFIG_RAS) += ras.o
>> obj-$(CONFIG_DEBUG_FS) += debugfs.o
>> obj-$(CONFIG_RAS_CEC) += cec.o
>>
>> +obj-$(CONFIG_RAS_FMPM) += amd/fmpm.o
>> obj-y += amd/atl/
>> diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
>> new file mode 100644
>> index 000000000000..077d9f35cc7d
>> --- /dev/null
>> +++ b/drivers/ras/amd/fmpm.c
>> @@ -0,0 +1,776 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * FRU (Field-Replaceable Unit) Memory Poison Manager
>> + *
>> + * Copyright (c) 2024, Advanced Micro Devices, Inc.
>> + * All Rights Reserved.
>> + *
>> + * Authors:
>> + * Naveen Krishna Chatradhi <naveenkrishna.chatradhi@....com>
>> + * Muralidhara M K <muralidhara.mk@....com>
>> + * Yazen Ghannam <Yazen.Ghannam@....com>
>> + *
>> + * Implementation notes, assumptions, and limitations:
>> + *
>> + * - FRU Memory Poison Section and Memory Poison Descriptor definitions are not yet
>> + * included in the UEFI specification. So they are defined here. Afterwards, they
>> + * may be moved to linux/cper.h, if appropriate.
>> + *
>> + * - Platforms based on AMD MI300 systems will be the first to use these structures.
>> + * There are a number of assumptions made here that will need to be generalized
>> + * to support other platforms.
>> + *
>> + * AMD MI300-based platform(s) assumptions:
>> + * - Memory errors are reported through x86 MCA.
>> + * - The entire DRAM row containing a memory error should be retired.
>> + * - There will be (1) FRU Memory Poison Section per CPER.
>> + * - The FRU will be the CPU Package (Processor Socket).
>> + * - The default number of Memory Poison Descriptor entries should be (8).
>> + * - The Platform will use ACPI ERST for persistent storage.
>> + * - All FRU records should be saved to persistent storage. Module init will
>> + * fail if any FRU record is not successfully written.
>
> Please drop all that capitalized spelling.
>
For which parts? The acronyms, structure names, or general things like package/socket?
Or all the above?
>> + * - Source code will be under 'drivers/ras/amd/' unless and until there is interest
>> + * to use this module for other vendors.
>
> This is not needed.
>
Ack.
>> + * - Boot time memory retirement may occur later than ideal due to dependencies
>> + * on other libraries and drivers. This leaves a gap where bad memory may be
>> + * accessed during early boot stages.
>> + *
>> + * - Enough memory should be pre-allocated for each FRU record to be able to hold
>> + * the expected number of descriptor entries. This, mostly empty, record is
>> + * written to storage during init time. Subsequent writes to the same record
>> + * should allow the Platform to update the stored record in-place. Otherwise,
>> + * if the record is extended, then the Platform may need to perform costly memory
>> + * management operations on the storage. For example, the Platform may spend time
>> + * in Firmware copying and invalidating memory on a relatively slow SPI ROM.
>
> That's a good thing to have here.
>
Okay.
>> +
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/cper.h>
>> +#include <linux/ras.h>
>> +
>> +#include <acpi/apei.h>
>> +
>> +#include <asm/cpu_device_id.h>
>> +#include <asm/mce.h>
>> +
>> +#pragma pack(1)
>
> Is that some ugly thing to avoid adding __packed annotation to the
> structure definitions below?
>
Yes, but __packed could be used too.
> "GCC supports several types of pragmas, primarily in order to compile
> code originally written for other compilers. Note that in general we do
> not recommend the use of pragmas; See Declaring Attributes of Functions,
> for further explanation. "
>
> Oh, that 1 is something else:
>
> -fpack-struct[=n]
>
> Without a value specified, pack all structure members together
> without holes. When a value is specified (which must be a small
> power of two), pack structure members according to this value,
> representing the maximum alignment (that is, objects with default
> alignment requirements larger than this are output potentially
> unaligned at the next fitting location.
>
> So do I understand it correctly that struct members should be aligned to
> 2^1 bytes?
>
Yes, no padding and no reordering too, I think.
> Grepping the tree, this looks like something BIOS does...
>
The BIOS does do this on its side. We need to make sure to do it on the kernel
side.
See <linux/cper.h> and <acpi/actbl1.h> for examples.
Thanks,
Yazen
Powered by blists - more mailing lists