linux-kernel - Re: [PATCH] Add support of NVDIMM memory error notification in ACPI 6.2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPcyv4i7bLU17QEmdUBQrtWP3AZxPRyKK0NN105XrTj8K3nAAQ@mail.gmail.com>
Date:   Wed, 7 Jun 2017 12:09:38 -0700
From:   Dan Williams <dan.j.williams@...el.com>
To:     Toshi Kani <toshi.kani@....com>
Cc:     "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Vishal L Verma <vishal.l.verma@...el.com>,
        "linux-nvdimm@...ts.01.org" <linux-nvdimm@...ts.01.org>,
        Linux ACPI <linux-acpi@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] Add support of NVDIMM memory error notification in ACPI 6.2

On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@....com> wrote:
> ACPI 6.2 defines a new ACPI notification value to NVDIMM Root Device
> in Table 5-169.
>
>  0x81 Unconsumed Uncorrectable Memory Error Detected
>       Used to pro-actively notify OSPM of uncorrectable memory errors
>       detected (for example a memory scrubbing engine that continuously
>       scans the NVDIMMs memory). This is an optional notification. Only
>       locations that were mapped in to SPA by the platform will generate
>       a notification.
>
> Add support of this notification value by initiating an ARS scan. This
> will find new error locations and add their badblocks information.
>
> Link: http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> Signed-off-by: Toshi Kani <toshi.kani@....com>
> Cc: Dan Williams <dan.j.williams@...el.com>
> Cc: Rafael J. Wysocki <rjw@...ysocki.net>
> Cc: Vishal Verma <vishal.l.verma@...el.com>
> ---
>  drivers/acpi/nfit/core.c |   28 ++++++++++++++++++++++------
>  drivers/acpi/nfit/nfit.h |    1 +
>  2 files changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index 656acb5..cc22778 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -2967,7 +2967,7 @@ static int acpi_nfit_remove(struct acpi_device *adev)
>         return 0;
>  }
>
> -void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event)
> +static void acpi_nfit_update_notify(struct device *dev, acpi_handle handle)
>  {
>         struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev);
>         struct acpi_buffer buf = { ACPI_ALLOCATE_BUFFER, NULL };
> @@ -2975,11 +2975,6 @@ void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event)
>         acpi_status status;
>         int ret;
>
> -       dev_dbg(dev, "%s: event: %d\n", __func__, event);
> -
> -       if (event != NFIT_NOTIFY_UPDATE)
> -               return;
> -
>         if (!dev->driver) {
>                 /* dev->driver may be null if we're being removed */
>                 dev_dbg(dev, "%s: no driver found for dev\n", __func__);
> @@ -3016,6 +3011,27 @@ void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event)
>                 dev_err(dev, "Invalid _FIT\n");
>         kfree(buf.pointer);
>  }
> +
> +static void acpi_nfit_uc_error_notify(struct device *dev, acpi_handle handle)
> +{
> +       struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev);
> +
> +       acpi_nfit_ars_rescan(acpi_desc);

I wonder if we should gate re-scanning with a similar:

    if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON)

...check that we do in the mce notification case? Maybe not since we
don't get an indication of where the error is without a rescan.
However, at a minimum I think we need support for the new Start ARS
flag ("If set to 1 the firmware shall return data from a previous
scrub, if any, without starting a new scrub") and use that for this
case.

Another thing that seems to be missing in both this and the mce case
is a notification to userspace that something changed. We have calls
to sysfs_notify_dirent() to notify scrub completion events and DIMM
health status change events, I think we need a similar notifier
mechanism for new un-correctable errors.