[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200616192952.GO13515@zn.tnic>
Date: Tue, 16 Jun 2020 21:29:52 +0200
From: Borislav Petkov <bp@...en8.de>
To: Tony Luck <tony.luck@...el.com>
Cc: Youquan Song <youquan.song@...el.com>, x86@...nel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86/mce: Add Skylake quirk for patrol scrub reported
errors
On Mon, Jun 15, 2020 at 11:40:56AM -0700, Tony Luck wrote:
> From: Youquan Song <youquan.song@...el.com>
>
> Skylake has a mode where the system administrator can use a BIOS setup
> option to request that the memory controller report uncorrected errors
> found by the patrol scrubber as corrected. This results in them being
> signalled using CMCI, which is less disruptive than a machine check.
>
> Add a quirk to detect that a "corrected" error is actually a downgraded
> uncorrected error with model specific checks for the "MSCOD" signature in
> MCi_STATUS and that the error was reported from a memory controller bank.
>
> Adjust the severity to MCE_AO_SEVERITY so that Linux will try to take
> the affected page offline.
>
> [Tony: Wordsmith commit comment]
>
> Signed-off-by: Youquan Song <youquan.song@...el.com>
> Signed-off-by: Tony Luck <tony.luck@...el.com>
> ---
> arch/x86/kernel/cpu/mce/core.c | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index e9265e2f28c9..0dbd0a21a0bf 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -123,6 +123,8 @@ static struct irq_work mce_irq_work;
>
> static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
>
> +static void no_adjust_mce_log(struct mce *m) {};
> +static void (*adjust_mce_log)(struct mce *m) = no_adjust_mce_log;
> /*
> * CPU/chipset specific EDAC code can register a notifier call here to print
> * MCE errors in a human-readable form.
> @@ -772,6 +774,7 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
> if (mca_cfg.dont_log_ce && !mce_usable_address(&m))
> goto clear_it;
>
> + adjust_mce_log(&m);
> mce_log(&m);
Two things: can that error type be detected when #MC gets raised, i.e., in
do_machine_check() as part of scanning all banks?
If so, then the adjusting needs to happen inside mce_log().
Also, that assignment to the function pointer doesn't make much sense to
me and I think you should do the vendor/family/model checking straight
in a function adjust_mce_log() which gets called by whoever...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists