[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250519085130.GFaCrxEnZvaoETKrao@fat_crate.local>
Date: Mon, 19 May 2025 10:51:30 +0200
From: Borislav Petkov <bp@...en8.de>
To: Vijay Balakrishna <vijayb@...ux.microsoft.com>
Cc: Tony Luck <tony.luck@...el.com>, Rob Herring <robh@...nel.org>,
Krzysztof Kozlowski <krzk+dt@...nel.org>,
Conor Dooley <conor+dt@...nel.org>,
James Morse <james.morse@....com>,
Mauro Carvalho Chehab <mchehab@...nel.org>,
Robert Richter <rric@...nel.org>, linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org, Tyler Hicks <code@...icks.com>,
Marc Zyngier <maz@...nel.org>,
Sascha Hauer <s.hauer@...gutronix.de>,
Lorenzo Pieralisi <lpieralisi@...nel.org>,
devicetree@...r.kernel.org
Subject: Re: [PATCH 1/3] drivers/edac: Add L1 and L2 error detection for A72
On Thu, May 15, 2025 at 05:06:11PM -0700, Vijay Balakrishna wrote:
> Subject: Re: [PATCH 1/3] drivers/edac: Add L1 and L2 error detection for A72
git log drivers/edac/
to get inspired about proper commit titles and prefix.
> From: Sascha Hauer <s.hauer@...gutronix.de>
>
> The Cortex A72 cores have error detection capabilities for
> the L1/L2 Caches, this patch adds a driver for them. The selected errors
Avoid having "This patch" or "This commit" in the commit message. It is
tautologically useless.
Also, do
$ git grep 'This patch' Documentation/process
for more details.
> to detect/report are by reading CPU/L2 memory error syndrome registers.
>
> Unfortunately there is no robust way to inject errors into the caches,
> so this driver doesn't contain any code to actually test it. It has
> been tested though with code taken from an older version [1] of this
> driver. For reasons stated in thread [1], the error injection code is
> not suitable for mainline, so it is removed from the driver.
>
> [1] https://lore.kernel.org/all/1521073067-24348-1-git-send-email-york.sun@nxp.com/#t
>
> Signed-off-by: Sascha Hauer <s.hauer@...gutronix.de>
> Co-developed-by: Vijay Balakrishna <vijayb@...ux.microsoft.com>
> Signed-off-by: Vijay Balakrishna <vijayb@...ux.microsoft.com>
> ---
> drivers/edac/Kconfig | 8 ++
> drivers/edac/Makefile | 1 +
> drivers/edac/edac_a72.c | 233 ++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 242 insertions(+)
> create mode 100644 drivers/edac/edac_a72.c
>
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index 19ad3c3b675d..7c99bb04b0c4 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -576,4 +576,12 @@ config EDAC_LOONGSON
> errors (CE) only. Loongson-3A5000/3C5000/3D5000/3A6000/3C6000
> are compatible.
>
> +config EDAC_CORTEX_A72
> + tristate "ARM Cortex A72"
> + depends on ARM64
> + help
> + Support for L1/L2 cache error detection for ARM Cortex A72 processor.
> + The detected and reported erros are from reading CPU/L2 memory error
+ The detected and reported erros are from reading memory error
Unknown word [erros] in Kconfig help text.
Suggestions: ['errors', 'Eros', 'errs', 'euros'...
> + syndrome registers.
> +
> endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index a8f2d8f6c894..835539b5d5af 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -88,3 +88,4 @@ obj-$(CONFIG_EDAC_NPCM) += npcm_edac.o
> obj-$(CONFIG_EDAC_ZYNQMP) += zynqmp_edac.o
> obj-$(CONFIG_EDAC_VERSAL) += versal_edac.o
> obj-$(CONFIG_EDAC_LOONGSON) += loongson_edac.o
> +obj-$(CONFIG_EDAC_CORTEX_A72) += edac_a72.o
I don't know what tree you are preparing your patches against - it should be
this one:
https://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git/log/?h=edac-for-next
but the indentation level here is wrong:
obj-$(CONFIG_EDAC_ZYNQMP)^I^I+= zynqmp_edac.o$
obj-$(CONFIG_EDAC_VERSAL)^I^I+= versal_edac.o$
obj-$(CONFIG_EDAC_LOONGSON)^I^I+= loongson_edac.o$
obj-$(CONFIG_EDAC_CORTEX_A72)^I+= edac_a72.o$
^^^
after I apply your patch.
> diff --git a/drivers/edac/edac_a72.c b/drivers/edac/edac_a72.c
> new file mode 100644
> index 000000000000..13acd7e7cef0
> --- /dev/null
> +++ b/drivers/edac/edac_a72.c
> @@ -0,0 +1,233 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Cortex A72 EDAC L1 and L2 cache error detection
> + *
> + * Copyright (c) 2020 Pengutronix, Sascha Hauer <s.hauer@...gutronix.de>
> + *
> + * Based on Code from:
> + * Copyright (c) 2018, NXP Semiconductor
> + * Author: York Sun <york.sun@....com>
> + *
> + */
> +
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/bitfield.h>
> +#include <asm/smp_plat.h>
> +
> +#include "edac_module.h"
> +
> +#define DRVNAME "edac-a72"
> +
> +#define CPUMERRSR_EL1_RAMID GENMASK(30, 24)
> +
> +#define CPUMERRSR_EL1_VALID BIT(31)
> +#define CPUMERRSR_EL1_FATAL BIT(63)
> +
> +#define L1_I_TAG_RAM 0x00
> +#define L1_I_DATA_RAM 0x01
> +#define L1_D_TAG_RAM 0x08
> +#define L1_D_DATA_RAM 0x09
> +#define TLB_RAM 0x18
> +
> +#define L2MERRSR_EL1_CPUID_WAY GENMASK(21, 18)
> +
> +#define L2MERRSR_EL1_VALID BIT(31)
> +#define L2MERRSR_EL1_FATAL BIT(63)
> +
> +struct merrsr {
> + u64 cpumerr;
> + u64 l2merr;
> +};
That struct naming needs some making the names more understandable. "merrsr"
doesn't tell me anything.
> +
> +#define MESSAGE_SIZE 64
> +
> +#define SYS_CPUMERRSR_EL1 sys_reg(3, 1, 15, 2, 2)
> +#define SYS_L2MERRSR_EL1 sys_reg(3, 1, 15, 2, 3)
Please group all defines together, align them vertically and then put other
definitions below. Look at other drivers for inspiration.
> +
> +static struct cpumask compat_mask;
> +
> +static void report_errors(struct edac_device_ctl_info *edac_ctl, int cpu,
> + struct merrsr *merrsr)
> +{
> + char msg[MESSAGE_SIZE];
> + u64 cpumerr = merrsr->cpumerr;
> + u64 l2merr = merrsr->l2merr;
The edac-tree preferred ordering of variable declarations at the
beginning of a function is reverse fir tree order::
struct long_struct_name *descriptive_name;
unsigned long foo, bar;
unsigned int tmp;
int ret;
The above is faster to parse than the reverse ordering::
int ret;
unsigned int tmp;
unsigned long foo, bar;
struct long_struct_name *descriptive_name;
And even more so than random ordering::
unsigned long foo, bar;
int ret;
struct long_struct_name *descriptive_name;
unsigned int tmp;
Please check all your functions.
> +
> + if (cpumerr & CPUMERRSR_EL1_VALID) {
> + const char *str;
> + bool fatal = cpumerr & CPUMERRSR_EL1_FATAL;
> +
> + switch (FIELD_GET(CPUMERRSR_EL1_RAMID, cpumerr)) {
> + case L1_I_TAG_RAM:
> + str = "L1-I Tag RAM";
> + break;
> + case L1_I_DATA_RAM:
> + str = "L1-I Data RAM";
> + break;
> + case L1_D_TAG_RAM:
> + str = "L1-D Tag RAM";
> + break;
> + case L1_D_DATA_RAM:
> + str = "L1-D Data RAM";
> + break;
> + case TLB_RAM:
> + str = "TLB RAM";
> + break;
> + default:
> + str = "Unspecified";
> + break;
> + }
> +
> + snprintf(msg, MESSAGE_SIZE, "%s %s error(s) on CPU %d",
> + str, fatal ? "fatal" : "correctable", cpu);
> +
> + if (fatal)
> + edac_device_handle_ue(edac_ctl, cpu, 0, msg);
> + else
> + edac_device_handle_ce(edac_ctl, cpu, 0, msg);
> + }
> +
> + if (l2merr & L2MERRSR_EL1_VALID) {
> + bool fatal = l2merr & L2MERRSR_EL1_FATAL;
> +
> + snprintf(msg, MESSAGE_SIZE, "L2 %s error(s) on CPU %d CPUID/WAY 0x%lx",
> + fatal ? "fatal" : "correctable", cpu,
> + FIELD_GET(L2MERRSR_EL1_CPUID_WAY, l2merr));
> + if (fatal)
> + edac_device_handle_ue(edac_ctl, cpu, 1, msg);
> + else
> + edac_device_handle_ce(edac_ctl, cpu, 1, msg);
> + }
> +}
> +
> +static void read_errors(void *data)
> +{
> + struct merrsr *merrsr = data;
> +
> + merrsr->cpumerr = read_sysreg_s(SYS_CPUMERRSR_EL1);
> + if (merrsr->cpumerr & CPUMERRSR_EL1_VALID) {
> + write_sysreg_s(0, SYS_CPUMERRSR_EL1);
> + isb();
> + }
> + merrsr->l2merr = read_sysreg_s(SYS_L2MERRSR_EL1);
> + if (merrsr->l2merr & L2MERRSR_EL1_VALID) {
> + write_sysreg_s(0, SYS_L2MERRSR_EL1);
> + isb();
> + }
> +}
> +
> +static void cortex_arm64_edac_check(struct edac_device_ctl_info *edac_ctl)
All static functions don't need a prefix "cortex_arm64_".
> +{
> + struct merrsr merrsr;
> + int cpu;
I'd venture a guess you need to protect here against CPU hotplug...
> + for_each_cpu_and(cpu, cpu_online_mask, &compat_mask) {
> + smp_call_function_single(cpu, read_errors, &merrsr, true);
> + report_errors(edac_ctl, cpu, &merrsr);
> + }
> +}
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> + struct edac_device_ctl_info *edac_ctl;
> + struct device *dev = &pdev->dev;
> + int rc;
> +
> + edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> + num_possible_cpus(), "L", 2, 1,
> + edac_device_alloc_index());
> + if (!edac_ctl)
> + return -ENOMEM;
> +
> + edac_ctl->edac_check = cortex_arm64_edac_check;
> + edac_ctl->dev = dev;
> + edac_ctl->mod_name = dev_name(dev);
> + edac_ctl->dev_name = dev_name(dev);
> + edac_ctl->ctl_name = DRVNAME;
> + dev_set_drvdata(dev, edac_ctl);
> +
> + rc = edac_device_add_device(edac_ctl);
> + if (rc)
> + goto out_dev;
> +
> + return 0;
> +
> +out_dev:
> + edac_device_free_ctl_info(edac_ctl);
> +
> + return rc;
> +}
> +
> +static void cortex_arm64_edac_remove(struct platform_device *pdev)
> +{
> + struct edac_device_ctl_info *edac_ctl = dev_get_drvdata(&pdev->dev);
> +
> + edac_device_del_device(edac_ctl->dev);
> + edac_device_free_ctl_info(edac_ctl);
> +}
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> + { .compatible = "arm,cortex-a72" },
> + {}
> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> + .probe = cortex_arm64_edac_probe,
> + .remove = cortex_arm64_edac_remove,
> + .driver = {
> + .name = DRVNAME,
> + },
> +};
> +
> +static int __init cortex_arm64_edac_driver_init(void)
> +{
> + struct device_node *np;
> + int cpu;
> + struct platform_device *pdev;
> + int err;
> +
> + for_each_possible_cpu(cpu) {
> + np = of_get_cpu_node(cpu, NULL);
> +
^ Superfluous newline.
> + if (!np) {
> + pr_warn("failed to find device node for cpu %d\n", cpu);
In visible strings s/cpu/CPU/g
> + continue;
> + }
> + if (!of_match_node(cortex_arm64_edac_of_match, np))
> + continue;
> + if (!of_property_read_bool(np, "edac-enabled"))
> + continue;
> + cpumask_set_cpu(cpu, &compat_mask);
> + of_node_put(np);
> + }
> +
> + if (cpumask_empty(&compat_mask))
> + return 0;
> +
> + err = platform_driver_register(&cortex_arm64_edac_driver);
> + if (err)
> + return err;
> +
> + pdev = platform_device_register_simple(DRVNAME, -1, NULL, 0);
> + if (IS_ERR(pdev)) {
> + pr_err("failed to register cortex arm64 edac device\n");
That driver is called edac_a72 now.
"cortex arm64 edac" is too broad.
> + platform_driver_unregister(&cortex_arm64_edac_driver);
> + return PTR_ERR(pdev);
> + }
> +
> + return 0;
> +}
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists