lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <76fe899b-73ea-4f6b-9821-84240d89b0cb@amd.com>
Date: Wed, 14 Feb 2024 09:28:54 -0500
From: Yazen Ghannam <yazen.ghannam@....com>
To: Borislav Petkov <bp@...en8.de>
Cc: yazen.ghannam@....com, tony.luck@...el.com, linux-edac@...r.kernel.org,
 linux-kernel@...r.kernel.org, avadhut.naik@....com, john.allen@....com,
 muralidhara.mk@....com, naveenkrishna.chatradhi@....com,
 sathyapriya.k@....com
Subject: Re: [PATCH 2/2] RAS: Introduce the FRU Memory Poison Manager

On 2/14/2024 4:28 AM, Borislav Petkov wrote:
> On Tue, Feb 13, 2024 at 09:35:16PM -0600, Yazen Ghannam wrote:
>> +config RAS_FMPM
>> +	tristate "FRU Memory Poison Manager"
>> +	default m
>> +	depends on X86_MCE
> 
> I know this is generic-ish but it needs to be enabled only on AMD for
> now. Whoever wants it somewhere else, then whoever needs to test it
> there first and then enable it there.
>

Ack.
  
>> +	imply AMD_ATL
>> +	help
>> +	  Support saving and restoring memory error information across reboot
>> +	  cycles using ACPI ERST as persistent storage. Error information is
> 
> s/cycles//
>

Ack.
  
>> +	  saved with the UEFI CPER "FRU Memory Poison" section format.
>> +
>> +	  Memory may be retired during boot time and run time depending on
> 
> s/may/is/
> 
> Please check all your text - too many "may"s for something which is not
> a vendor doc. :)
>

Ack.
  
>> +	  platform-specific policies.
>> +
>>   endif
>> diff --git a/drivers/ras/Makefile b/drivers/ras/Makefile
>> index 3fac80f58005..11f95d59d397 100644
>> --- a/drivers/ras/Makefile
>> +++ b/drivers/ras/Makefile
>> @@ -3,4 +3,5 @@ obj-$(CONFIG_RAS)	+= ras.o
>>   obj-$(CONFIG_DEBUG_FS)	+= debugfs.o
>>   obj-$(CONFIG_RAS_CEC)	+= cec.o
>>   
>> +obj-$(CONFIG_RAS_FMPM)	+= amd/fmpm.o
>>   obj-y			+= amd/atl/
>> diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
>> new file mode 100644
>> index 000000000000..077d9f35cc7d
>> --- /dev/null
>> +++ b/drivers/ras/amd/fmpm.c
>> @@ -0,0 +1,776 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * FRU (Field-Replaceable Unit) Memory Poison Manager
>> + *
>> + * Copyright (c) 2024, Advanced Micro Devices, Inc.
>> + * All Rights Reserved.
>> + *
>> + * Authors:
>> + *	Naveen Krishna Chatradhi <naveenkrishna.chatradhi@....com>
>> + *	Muralidhara M K <muralidhara.mk@....com>
>> + *	Yazen Ghannam <Yazen.Ghannam@....com>
>> + *
>> + * Implementation notes, assumptions, and limitations:
>> + *
>> + * - FRU Memory Poison Section and Memory Poison Descriptor definitions are not yet
>> + *   included in the UEFI specification. So they are defined here. Afterwards, they
>> + *   may be moved to linux/cper.h, if appropriate.
>> + *
>> + * - Platforms based on AMD MI300 systems will be the first to use these structures.
>> + *   There are a number of assumptions made here that will need to be generalized
>> + *   to support other platforms.
>> + *
>> + *   AMD MI300-based platform(s) assumptions:
>> + *   - Memory errors are reported through x86 MCA.
>> + *   - The entire DRAM row containing a memory error should be retired.
>> + *   - There will be (1) FRU Memory Poison Section per CPER.
>> + *   - The FRU will be the CPU Package (Processor Socket).
>> + *   - The default number of Memory Poison Descriptor entries should be (8).
>> + *   - The Platform will use ACPI ERST for persistent storage.
>> + *   - All FRU records should be saved to persistent storage. Module init will
>> + *     fail if any FRU record is not successfully written.
> 
> Please drop all that capitalized spelling.
>

For which parts? The acronyms, structure names, or general things like package/socket?
Or all the above?
  
>> + * - Source code will be under 'drivers/ras/amd/' unless and until there is interest
>> + *   to use this module for other vendors.
> 
> This is not needed.
>

Ack.
  
>> + * - Boot time memory retirement may occur later than ideal due to dependencies
>> + *   on other libraries and drivers. This leaves a gap where bad memory may be
>> + *   accessed during early boot stages.
>> + *
>> + * - Enough memory should be pre-allocated for each FRU record to be able to hold
>> + *   the expected number of descriptor entries. This, mostly empty, record is
>> + *   written to storage during init time. Subsequent writes to the same record
>> + *   should allow the Platform to update the stored record in-place. Otherwise,
>> + *   if the record is extended, then the Platform may need to perform costly memory
>> + *   management operations on the storage. For example, the Platform may spend time
>> + *   in Firmware copying and invalidating memory on a relatively slow SPI ROM.
> 
> That's a good thing to have here.
>

Okay.
  
>> +
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/cper.h>
>> +#include <linux/ras.h>
>> +
>> +#include <acpi/apei.h>
>> +
>> +#include <asm/cpu_device_id.h>
>> +#include <asm/mce.h>
>> +
>> +#pragma pack(1)
> 
> Is that some ugly thing to avoid adding __packed annotation to the
> structure definitions below?
> 

Yes, but __packed could be used too.

> "GCC supports several types of pragmas, primarily in order to compile
> code originally written for other compilers. Note that in general we do
> not recommend the use of pragmas; See Declaring Attributes of Functions,
> for further explanation. "
> 
> Oh, that 1 is something else:
> 
> -fpack-struct[=n]
> 
>      Without a value specified, pack all structure members together
>      without holes. When a value is specified (which must be a small
>      power of two), pack structure members according to this value,
>      representing the maximum alignment (that is, objects with default
>      alignment requirements larger than this are output potentially
>      unaligned at the next fitting location.
> 
> So do I understand it correctly that struct members should be aligned to
> 2^1 bytes?
>

Yes, no padding and no reordering too, I think.
  
> Grepping the tree, this looks like something BIOS does...
>

The BIOS does do this on its side. We need to make sure to do it on the kernel
side.

See <linux/cper.h> and <acpi/actbl1.h> for examples.
  
Thanks,
Yazen

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ