[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d57d786c-f9cb-46ba-78d0-3675666272f2@arm.com>
Date: Fri, 4 Jun 2021 15:19:01 +0100
From: James Morse <james.morse@....com>
To: Xiaofei Tan <tanxiaofei@...wei.com>, rafael@...nel.org,
rjw@...ysocki.net, lenb@...nel.org, tony.luck@...el.com,
bp@...en8.de, akpm@...ux-foundation.org, jroedel@...e.de,
peterz@...radead.org
Cc: linux-acpi@...r.kernel.org, linux-kernel@...r.kernel.org,
linuxarm@...wei.com
Subject: Re: [PATCH v5] ACPI / APEI: fix the regression of synchronous
external aborts occur in user-mode
Hi Xiaofei Tan,
Sorry for the delayed response,
this still applies and builds to v5.13-rc4.
On 10/12/2020 12:09, Xiaofei Tan wrote:
> After the commit 8fcc4ae6faf8 ("arm64: acpi: Make apei_claim_sea()
> synchronise with APEI's irq work") applied, do_sea() return directly
> for user-mode if apei_claim_sea() handled any error record. Therefore,
> each error record reported by the user-mode SEA must be effectively
> processed in APEI GHES driver.
If you describe it the other way round, it would be clearer what the problem here is.
Something like:
| Before commit 8fcc4ae6faf8 ("arm64: acpi: Make apei_claim_sea() synchronise
| with APEI's irq work"), do_sea() would unconditionally signal the affected task
| from the arch code. Since that change, the GHES driver sends the signals,.
| This exposes a problem as errors the GHES driver doesn't understand are silently
| ignored.
> Currently, GHES driver only processes Memory Error Section.(Ignore PCIe
> Error Section, as it has nothing to do with SEA).
(you're starting to confuse me! - I went and checked before I realised you were talking to
me, not describing the code...)
> It is not enough. > Because ARM Processor Error could also be used for SEA in some hardware
> platforms, such as Kunpeng9xx series. We can't ask them to switch to
> use Memory Error Section for two reasons:
> 1)The server was delivered to customers, and it will introduce
> compatibility issue.
> 2)It make sense to use ARM Processor Error Section. Because either
> cache or memory errors could generate SEA when consumed by a processor.
I think you just need to say:
| Existing firmware on Kunpeng9xx systems reports cache errors with the 'ARM Processor
| Error' CPER records.
Could you add something about why the silent-ignore is a problem? Do the errors get taken
again? Does user-space get stuck in this loop?
> Do memory failure handling for ARM Processor Error Section just like
> for Memory Error Section.
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index fce7ade..0893968 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev)
> +{
> + struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
> + struct cper_arm_err_info *err_info;
> + bool queued = false;
> + int sec_sev, i;
> +
> + log_arm_hw_error(err);
> +
> + sec_sev = ghes_severity(gdata->error_severity);
> + if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
> + return false;
> +
> + err_info = (struct cper_arm_err_info *) (err + 1);
> + for (i = 0; i < err->err_info_num; i++, err_info++) {
err_info has a version and a length, so its expected to be made bigger at some point.
It would be better to use the length instead of 'err_info++', or at least to break out of
the loop if a length > sizeof(*err_info) is seen.
With that:
Reviewed-by: James Morse <james.morse@....com>
The following nits would make this easier to read:
> + bool is_cache = (err_info->type == CPER_ARM_CACHE_ERROR);
> + bool has_pa = (err_info->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR);
> + /*
> + * The field (err_info->error_info & BIT(26)) is fixed to set to
> + * 1 in some old firmware of HiSilicon Kunpeng920. We assume that
> + * firmware won't mix corrected errors in an uncorrected section,
> + * and don't filter out 'corrected' error here.
> + */
(Nothing reads err_info->error_info, I guess this is a warning to the next person to touch
this)
> + if (!is_cache || !has_pa) {
> + pr_warn_ratelimited(FW_WARN GHES_PFX
> + "Unhandled processor error type %s\n",
> + err_info->type < ARRAY_SIZE(cper_proc_error_type_strs) ?
> + cper_proc_error_type_strs[err_info->type] : "unknown error");
> + continue;
This is hard to read. The convention is to indent the extra lines to the relevant '('.
e.g.:
| pr_warn_ratelimited(FW_WARN GHES_PFX
| "Unhandled processor error type %s\n",
You could make it shorter by working out the error_type string earlier
e.g.:
| char *error_type = "unknown_error";
|
| if (err_info->type < ARRAY_SIZE(cper_proc_error_type_strs)
| error_type = cper_proc_error_type_strs[err_info->type];
> + }
> + if (ghes_do_memory_failure(err_info->physical_fault_addr, 0))
> + queued = true;
| if (it_returned_true())
| queued = true;
Looks funny, and if you moved this earlier, your pr_warn_ratelimted() would have an extra
level of indentation to play with.
i.e.:
| if (is_cache && has_pa) {
| queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0);
| continue;
| }
Thanks,
James
Powered by blists - more mailing lists