[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210326190242.GI25229@zn.tnic>
Date: Fri, 26 Mar 2021 20:02:42 +0100
From: Borislav Petkov <bp@...en8.de>
To: “William Roche <william.roche@...cle.com>
Cc: linux-kernel@...r.kernel.org, Tony Luck <tony.luck@...el.com>,
linux-edac@...r.kernel.org
Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event
filtering
On Fri, Mar 26, 2021 at 02:30:29PM -0400, “William Roche wrote:
> From: William Roche <william.roche@...cle.com>
>
> The Corrected Error events collected by the cec_add_elem() have to be
> consistently filtered out.
> We fix the case where the value of find_elem() to find the slot of a pfn
> was mistakenly used as the return value of the function.
> Now the MCE notifiers chain relying on MCE_HANDLED_CEC would only report
> filtered corrected errors that reached the action threshold.
>
> Signed-off-by: William Roche <william.roche@...cle.com>
> ---
>
> Notes:
> Some machines are reporting Corrected Errors events without any
> information about a PFN Soft-offlining or Invalid pfn (report given by
> the EDAC module or the mcelog daemon).
>
> A research showed that it reflected the first occurrence of a CE error
> on the system which should have been filtered by the RAS_CEC component.
> We could also notice that if 2 PFNs are impacted by CE errors, the PFN
> on the non-zero slot gets its CE errors reported every-time instead of
> being filtered out.
>
> This problem has appeared with the introduction of commit
> de0e0624d86ff9fc512dedb297f8978698abf21a where the filtering logic has
> been modified.
>
> Could you please review this small suggested fix ?
>
> Thanks in advance for any feedback you could have.
> William.
>
> drivers/ras/cec.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
AFAIU, I think you want something like the below untested hunk:
You set it to 0 when it cannot find an element and that "ret = 1" we can
remove because callers don't care about the offlining threshold - the
only caller that looks at its retval wants to know whether it added the
VA successfully to note that it handled the error.
Makes sense?
---
diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index ddecf25b5dd4..a29994d726d8 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -341,6 +341,8 @@ static int cec_add_elem(u64 pfn)
ca->array[to] = pfn << PAGE_SHIFT;
ca->n++;
+
+ ret = 0;
}
/* Add/refresh element generation and increment count */
@@ -363,12 +365,6 @@ static int cec_add_elem(u64 pfn)
del_elem(ca, to);
- /*
- * Return a >0 value to callers, to denote that we've reached
- * the offlining threshold.
- */
- ret = 1;
-
goto unlock;
}
---
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists