linux-kernel - [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <17b1747a-8487-44d2-b79c-0da03b09c990@amd.com>
Date: Fri, 9 Feb 2024 13:52:21 -0600
From: "Naik, Avadhut" <avadnaik@....com>
To: Sohil Mehta <sohil.mehta@...el.com>, x86@...nel.org,
 linux-edac@...r.kernel.org
Cc: bp@...en8.de, tony.luck@...el.com, linux-kernel@...r.kernel.org,
 yazen.ghannam@....com, Avadhut Naik <avadhut.naik@....com>
Subject: [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool

Hi,

On 2/8/2024 15:09, Sohil Mehta wrote:
> On 2/7/2024 2:56 PM, Avadhut Naik wrote:
> 
>> Extend the size of MCE Records pool to better serve modern systems. The
>> increase in size depends on the CPU count of the system. Currently, since
>> size of struct mce is 124 bytes, each logical CPU of the system will have
>> space for at least 2 MCE records available in the pool. To get around the
>> allocation woes during early boot time, the same is undertaken using
>> late_initcall().
>>
> 
> I guess making this proportional to the number of CPUs is probably fine
> assuming CPUs and memory capacity *would* generally increase in sync.
> 
> But, is there some logic to having 2 MCE records per logical cpu or it
> is just a heuristic approach? In practice, the pool is shared amongst
> all MCE sources and can be filled by anyone, right?
> 
Yes, the pool is shared among all MCE sources but the logic for 256 is
that the genpool was set to 2 pages i.e. 8192 bytes in 2015.
Around that time, AFAIK, the max number of logical CPUs on a system was
32.
So, in the maximum case, each CPU will have around 256 bytes (8192/32) in
the pool. It equates to approximately 2 MCE records since sizeof(struct mce)
back then was 88 bytes.
>> Signed-off-by: Avadhut Naik <avadhut.naik@....com>
>> ---
>>  arch/x86/kernel/cpu/mce/core.c     |  3 +++
>>  arch/x86/kernel/cpu/mce/genpool.c  | 22 ++++++++++++++++++++++
>>  arch/x86/kernel/cpu/mce/internal.h |  1 +
>>  3 files changed, 26 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index b5cc557cfc37..5d6d7994d549 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -2901,6 +2901,9 @@ static int __init mcheck_late_init(void)
>>  	if (mca_cfg.recovery)
>>  		enable_copy_mc_fragile();
>>  
>> +	if (mce_gen_pool_extend())
>> +		pr_info("Couldn't extend MCE records pool!\n");
>> +
> 
> Why do this unconditionally? For a vast majority of low core-count, low
> memory systems the default 2 pages would be good enough.
> 
> Should there be a threshold beyond which the extension becomes active?
> Let's say, for example, a check for num_present_cpus() > 32 (Roughly
> based on 8Kb memory and 124b*2 estimate per logical CPU).
> 
> Whatever you choose, a comment above the code would be helpful
> describing when the extension is expected to be useful.
> 
Put it in unconditionally because IMO the increase in memory even for
low-core systems didn't seem to be substantial. Just an additional page
for systems with less than 16 CPUs.

But I do get your point. Will add a check in mcheck_late_init() for CPUs
present. Something like below:

@@ -2901,7 +2901,7 @@ static int __init mcheck_late_init(void)
    if (mca_cfg.recovery)
        enable_copy_mc_fragile();

-   if (mce_gen_pool_extend())
+   if ((num_present_cpus() > 32) && mce_gen_pool_extend())
        pr_info("Couldn't extend MCE records pool!\n");

Does this look good? The genpool extension will then be undertaken only for
systems with more than 32 CPUs. Will explain the same in a comment.

>>  	mcheck_debugfs_init();
>>  
>>  	/*
>> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
>> index fbe8b61c3413..aed01612d342 100644
>> --- a/arch/x86/kernel/cpu/mce/genpool.c
>> +++ b/arch/x86/kernel/cpu/mce/genpool.c
>> @@ -20,6 +20,7 @@
>>   * 2 pages to save MCE events for now (~80 MCE records at most).
>>   */
>>  #define MCE_POOLSZ	(2 * PAGE_SIZE)
>> +#define CPU_GEN_MEMSZ	256
>>  
> 
> The comment above MCE_POOLSZ probably needs a complete re-write. Right
> now, it reads as follows:
> 
> * This memory pool is only to be used to save MCE records in MCE context.
> * MCE events are rare, so a fixed size memory pool should be enough. Use
> * 2 pages to save MCE events for now (~80 MCE records at most).
> 
> Apart from the numbers being incorrect since sizeof(struct mce) has
> increased, this patch is based on the assumption that the current MCE
> memory pool is no longer enough in certain cases.
> 
Yes, will change the comment to something like below:

 * This memory pool is only to be used to save MCE records in MCE context.
 * Though MCE events are rare, their frequency typically depends on the
 * system's memory and CPU count.
 * Allocate 2 pages to the MCE Records pool during early boot with the
 * option to extend the pool, as needed, through command line, for systems
 * with CPU count of more than 32.
 * By default, each logical CPU can have around 2 MCE records in the pool
 * at the same time. 

Sounds good?

>>  static struct gen_pool *mce_evt_pool;
>>  static LLIST_HEAD(mce_event_llist);
>> @@ -116,6 +117,27 @@ int mce_gen_pool_add(struct mce *mce)
>>  	return 0;
>>  }
>>  
>> +int mce_gen_pool_extend(void)
>> +{
>> +	unsigned long addr, len;
> 
> s/len/size/
> 
Noted.
>> +	int ret = -ENOMEM;
>> +	u32 num_threads;
>> +
>> +	num_threads = num_present_cpus();
>> +	len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
> 
> Nit: Can the use of the num_threads variable be avoided?
> How about:
> 
> 	size = PAGE_ALIGN(num_present_cpus() * CPU_GEN_MEMSZ);
> 
Will do.
> 
> 
>> +	addr = (unsigned long)kzalloc(len, GFP_KERNEL);
> 
> Also, shouldn't the new allocation be incremental to the 2 pages already
> present?
> 
> Let's say, for example, that you have a 40-cpu system and the calculated
> size in this case comes out to 40 * 2 * 128b = 9920bytes  i.e. 3 pages.
> You only need to allocate 1 additional page to add to mce_evt_pool
> instead of the 3 pages that the current code does.
> 
Will make it incremental when genpool extension is being undertaken through
the default means. Something like below:

@@ -129,6 +134,7 @@ int mce_gen_pool_extend(void)
    } else {
        num_threads = num_present_cpus();
        len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
+       len -= MCE_POOLSZ;

Does this sound good?

-- 
Thanks,
Avadhut Naik

> Sohil
> 
>> +
>> +	if (!addr)
>> +		goto out;
>> +
>> +	ret = gen_pool_add(mce_evt_pool, addr, len, -1);
>> +	if (ret)
>> +		kfree((void *)addr);
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>>  static int mce_gen_pool_create(void)
>>  {
>>  	struct gen_pool *tmpp;
> 
>