linux-kernel - Re: [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ec7dc7e7-5808-58f8-cbe9-d8fdd2de4c35@amd.com>
Date:   Tue, 5 Apr 2022 12:24:31 -0500
From:   Carlos Bilbao <carlos.bilbao@....com>
To:     Yazen Ghannam <yazen.ghannam@....com>
Cc:     bp@...en8.de, tglx@...utronix.de, mingo@...hat.com,
        dave.hansen@...ux.intel.com, x86@...nel.org,
        linux-kernel@...r.kernel.org, linux-edac@...r.kernel.org,
        bilbao@...edu
Subject: Re: [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with
 new types of errors

On 4/5/2022 12:18 PM, Yazen Ghannam wrote:
> On Thu, Mar 31, 2022 at 11:38:49AM -0500, Carlos Bilbao wrote:
>> The MCE handler needs to understand the severity of the machine errors to
>> act accordingly. In the case of AMD, very few errors are covered in the
>> grading logic.
>>
>> Extend the MCEs severity grading of AMD to cover new types of machine
>> errors.
>>
> 
> This patch does not add new types of machine errors. Please update the commit
> message (and cover letter) to be consistent with changes made between patch
> revisions.
>  

I'm thinking "cover error cases not previously considered".

>> Signed-off-by: Carlos Bilbao <carlos.bilbao@....com>
>> ---
>>  arch/x86/kernel/cpu/mce/severity.c | 104 ++++++++++-------------------
>>  1 file changed, 37 insertions(+), 67 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
>> index 1add86935349..4d52eef21230 100644
>> --- a/arch/x86/kernel/cpu/mce/severity.c
>> +++ b/arch/x86/kernel/cpu/mce/severity.c
>> @@ -301,85 +301,55 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>>  	}
>>  }
>>  
>> -static __always_inline int mce_severity_amd_smca(struct mce *m, enum context err_ctx)
>> -{
>> -	u64 mcx_cfg;
>> -
>> -	/*
>> -	 * We need to look at the following bits:
>> -	 * - "succor" bit (data poisoning support), and
>> -	 * - TCC bit (Task Context Corrupt)
>> -	 * in MCi_STATUS to determine error severity.
>> -	 */
>> -	if (!mce_flags.succor)
>> -		return MCE_PANIC_SEVERITY;
>> -
>> -	mcx_cfg = mce_rdmsrl(MSR_AMD64_SMCA_MCx_CONFIG(m->bank));
>> -
>> -	/* TCC (Task context corrupt). If set and if IN_KERNEL, panic. */
>> -	if ((mcx_cfg & MCI_CONFIG_MCAX) &&
>> -	    (m->status & MCI_STATUS_TCC) &&
>> -	    (err_ctx == IN_KERNEL))
>> -		return MCE_PANIC_SEVERITY;
>> -
>> -	 /* ...otherwise invoke hwpoison handler. */
>> -	return MCE_AR_SEVERITY;
>> -}
>> -
>>  /*
>> - * See AMD Error Scope Hierarchy table in a newer BKDG. For example
>> - * 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"
>> + * See AMD PPR(s) section 3.1 Machine Check Architecture
> 
> I don't know that section numbers will be consistent between different PPR
> versions, so having the section name is a good idea. The "Machine Check Error
> Handling" section is what the severity grading function is based on.
> 

Ack

>>   */
>>  static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
>>  {
>> -	enum context ctx = error_context(m, regs);
>> +	int ret;
>> +
>> +	/*
>> +	 * Default return value: Action required, the error must be handled
>> +	 * immediately.
>> +	 */
>> +	ret = MCE_AR_SEVERITY;
>>  
>>  	/* Processor Context Corrupt, no need to fumble too much, die! */
>> -	if (m->status & MCI_STATUS_PCC)
>> -		return MCE_PANIC_SEVERITY;
>> +	if (m->status & MCI_STATUS_PCC) {
>> +		ret = MCE_PANIC_SEVERITY;
>> +		goto amd_severity;
>> +	}
>>  
>> -	if (m->status & MCI_STATUS_UC) {
>> +	/*
>> +	 * Evaluate the severity of deferred errors for AMD systems, for which only
>> +	 * scrub error is interesting to notify an action requirement. The poll
>> +	 * handler catches deferred errors and adds to mce_ring so memorty-failure
>> +	 * can take recovery actions.
>> +	 */
> 
> I think this whole comment can be dropped. The "scrub error" part is not
> correct. The polling function may find deferred errors, but they are most
> likely to be see by the deferred error interrupt handler on modern AMD
> systems. The "mce_ring" was removed a long time ago (in v4.3).
> 

Ack

>> +	if (m->status & MCI_STATUS_DEFERRED) {
>> +		ret = MCE_DEFERRED_SEVERITY;
>> +		goto amd_severity;
>> +	}
>>  
>> -		if (ctx == IN_KERNEL)
>> -			return MCE_PANIC_SEVERITY;
>> +	/* If the UC bit is not set, the error has been corrected */
> 
> This comment is not true. Deferred errors are an example of an uncorrectable
> error where UC is not set.
> 

Ack

>> +	if (!(m->status & MCI_STATUS_UC)) {
>> +		ret = MCE_KEEP_SEVERITY;
>> +		goto amd_severity;
>> +	}
>>  
>> -		/*
>> -		 * On older systems where overflow_recov flag is not present, we
>> -		 * should simply panic if an error overflow occurs. If
>> -		 * overflow_recov flag is present and set, then software can try
>> -		 * to at least kill process to prolong system operation.
>> -		 */
>> -		if (mce_flags.overflow_recov) {
>> -			if (mce_flags.smca)
>> -				return mce_severity_amd_smca(m, ctx);
>> -
>> -			/* kill current process */
>> -			return MCE_AR_SEVERITY;
>> -		} else {
>> -			/* at least one error was not logged */
>> -			if (m->status & MCI_STATUS_OVER)
>> -				return MCE_PANIC_SEVERITY;
>> -		}
>> -
>> -		/*
>> -		 * For any other case, return MCE_UC_SEVERITY so that we log the
>> -		 * error and exit #MC handler.
>> -		 */
>> -		return MCE_UC_SEVERITY;
>> +	if (((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov)
>> +	     || !mce_flags.succor) {
> 
> I appreciate merged two cases together that have the same result. But I feel
> keeping them separate may be easier to follow. They can also each have their
> own code comments. Or keep them together and explain each within the same
> comment block.
> 

I will divide these two cases.

> Also, there's a checkpatch "CHECK" here. You'll see it when using the
> "--strict" flag with checkpatch.
> 
>> +		ret = MCE_PANIC_SEVERITY;
>> +		goto amd_severity;
>>  	}
>>  
>> -	/*
>> -	 * deferred error: poll handler catches these and adds to mce_ring so
>> -	 * memory-failure can take recovery actions.
>> -	 */
>> -	if (m->status & MCI_STATUS_DEFERRED)
>> -		return MCE_DEFERRED_SEVERITY;
>> +	if (error_context(m, regs) == IN_KERNEL) {
>> +		ret = MCE_PANIC_SEVERITY;
>> +	}
> 
> Braces aren't needed here. The previous comment about braces was for when
> there's a block of "if/else-if/else" statements. A single "if" statement with
> a single line doesn't need braces.
> 

Ack

>>  
>> -	/*
>> -	 * corrected error: poll handler catches these and passes responsibility
>> -	 * of decoding the error to EDAC
>> -	 */
>> -	return MCE_KEEP_SEVERITY;
>> +amd_severity:
> 
> This label doesn't look right to me. Maybe I'm too used to seeing "out" and
> "err" labels.
> 
> Please see "Documentation/process/coding-style.rst" section (7) "Centralized
> exiting of functions".
> 
> Maybe something like "out_ret_severity" to indicate the code is going to exit
> and return the severity. Or maybe just use "out"? Maybe others have thoughts
> on this.
> 

"out_amd_severity" sounds good to me.

> Thanks,
> Yazen

Will send updated pachset. 

Thanks,
Carlos