[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1baa09bb-a42f-05bb-0523-4942d60c0619@codeaurora.org>
Date: Tue, 22 Nov 2016 10:13:19 -0700
From: "Baicar, Tyler" <tbaicar@...eaurora.org>
To: John Garry <john.garry@...wei.com>, marc.zyngier@....com,
pbonzini@...hat.com, rkrcmar@...hat.com, linux@...linux.org.uk,
catalin.marinas@....com, will.deacon@....com, rjw@...ysocki.net,
lenb@...nel.org, matt@...eblueprint.co.uk, robert.moore@...el.com,
lv.zheng@...el.com, nkaje@...eaurora.org, zjzhang@...eaurora.org,
mark.rutland@....com, james.morse@....com,
akpm@...ux-foundation.org, eun.taik.lee@...sung.com,
sandeepa.s.prabhu@...il.com, shijie.huang@....com,
rruigrok@...eaurora.org, paul.gortmaker@...driver.com,
tomasz.nowicki@...aro.org, fu.wei@...aro.org, rostedt@...dmis.org,
bristot@...hat.com, linux-arm-kernel@...ts.infradead.org,
kvmarm@...ts.cs.columbia.edu, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-acpi@...r.kernel.org,
linux-efi@...r.kernel.org, Suzuki.Poulose@....com,
punit.agrawal@....com, astone@...hat.com, harba@...eaurora.org,
hanjun.guo@...aro.org, Shiju Jose <shiju.jose@...wei.com>,
Linuxarm <linuxarm@...wei.com>, Anurup M <anurup.m@...wei.com>
Subject: Re: [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on
ARM64
Thank you John! Let me know how it goes and if you have any questions :)
Tyler
On 11/22/2016 4:11 AM, John Garry wrote:
> +
>
> We'll try and test this on our platform.
>
> Cheers,
> John
>
> On 21/11/2016 22:35, Tyler Baicar wrote:
>> When a memory error, CPU error, PCIe error, or other type of hardware
>> error
>> that's covered by RAS occurs, firmware should populate the shared
>> GHES memory
>> location with the proper GHES structures to notify the OS of the error.
>> For example, platforms that implement firmware first handling may
>> implement
>> separate GHES sources for corrected errors and uncorrected errors. If
>> the
>> error is an uncorrectable error, then the firmware will notify the OS
>> immediately since the error needs to be handled ASAP. The OS will
>> then be able
>> to take the appropriate action needed such as offlining a page. If
>> the error
>> is a corrected error, then the firmware will not interrupt the OS
>> immediately.
>> Instead, the OS will see and report the error the next time it's GHES
>> timer
>> expires. The kernel will first parse the GHES structures and report
>> the errors
>> through the kernel logs and then notify the user space through RAS trace
>> events. This allows user space applications such as RAS Daemon to see
>> the
>> errors and report them however the user desires. This patchset
>> extends the
>> kernel functionality for RAS errors based on updates in the UEFI 2.6 and
>> ACPI 6.1 specifications.
>>
>> An example flow from firmware to user space could be:
>>
>> +---------------+
>> +-------->| |
>> | | GHES polling |--+
>> +-------------+ | source | | +---------------+ +------------+
>> | | +---------------+ | | Kernel GHES | | |
>> | Firmware | +-->| CPER AER and |-->| RAS
>> trace |
>> | | +---------------+ | | EDAC drivers | | event |
>> +-------------+ | | | +---------------+ +------------+
>> | | GHES sci |--+
>> +-------->| source |
>> +---------------+
>>
>> Add support for Generic Hardware Error Source (GHES) v2, which
>> introduces the
>> capability for the OS to acknowledge the consumption of the error record
>> generated by the Reliability, Availability and Serviceability (RAS)
>> controller.
>> This eliminates potential race conditions between the OS and the RAS
>> controller.
>>
>> Add support for the timestamp field added to the Generic Error Data
>> Entry v3,
>> allowing the OS to log the time that the error is generated by the
>> firmware,
>> rather than the time the error is consumed. This improves the
>> correctness of
>> event sequences when analyzing error logs. The timestamp is added in
>> ACPI 6.1, reference Table 18-343 Generic Error Data Entry.
>>
>> Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
>> specification. ARMv8 specific processor error information is reported
>> as part of
>> the CPER records. This provides more detail on for processor error
>> logs. This
>> can help describe ARMv8 cache, tlb, and bus errors.
>>
>> Synchronous External Abort (SEA) represents a specific processor
>> error condition
>> in ARM systems. A handler is added to recognize SEA errors, and a
>> notifier is
>> added to parse and report the errors before the process is killed.
>> Refer to
>> section N.2.1.1 in the Common Platform Error Record appendix of the
>> UEFI 2.6
>> specification.
>>
>> Currently the kernel ignores CPER records that are unrecognized.
>> On the other hand, UEFI spec allows for non-standard (eg. vendor
>> proprietary) error section type in CPER (Common Platform Error Record),
>> as defined in section N2.3 of UEFI version 2.5. Therefore, user
>> is not able to see hardware error data of non-standard section.
>>
>> If section Type field of Generic Error Data Entry is unrecognized,
>> prints out the raw data in dmesg buffer, and also adds a tracepoint
>> for reporting such hardware errors.
>>
>> Currently even if an error status block's severity is fatal, the kernel
>> does not honor the severity level and panic. With the firmware first
>> model, the platform could inform the OS about a fatal hardware error
>> through the non-NMI GHES notification type. The OS should panic when a
>> hardware error record is received with this severity.
>>
>> Add support to handle SEAs that occur while a KVM guest kernel is
>> running. Currently these are unsupported by the guest abort handling.
>>
>> Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for
>> aarch64.
>> https://lkml.org/lkml/2016/8/10/231
>>
>> V5: Fix GHES goto logic for error conditions
>> Change ghes_do_read_ack to ghes_ack_error
>> Make sure data version check is >= 3
>> Use CPER helper functions in print functions
>> Make handle_guest_sea() dummy function static for arm
>> Add arm to subject line for KVM patch
>>
>> V4: Add bit offset left shift to read_ack_write value
>> Make HEST generic and generic_v2 structures a union in the ghes
>> structure
>> Move gdata v3 helper functions into ghes.h to avoid duplication
>> Reorder the timestamp print and avoid memcpy
>> Add helper functions for gdata size checking
>> Rename the SEA functions
>> Add helper function for GHES panics
>> Set fru_id to NULL UUID at variable declaration
>> Limit ARM trace event parameters to the needed structures
>> Reorder the ARM trace event variables to save space
>> Add comment for why we don't pass SEAs to the guest when it aborts
>> Move ARM trace event call into GHES driver instead of CPER
>>
>> V3: Fix unmapped address to the read_ack_register in ghes.c
>> Add helper function to get the proper payload based on generic
>> data entry
>> version
>> Move timestamp print to avoid changing function calls in cper.c
>> Remove patch "arm64: exception: handle instruction abort at
>> current EL"
>> since the el1_ia handler is already added in 4.8
>> Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
>> Add a new trace event for ARM type errors
>> Add support to handle KVM guest SEAs
>>
>> V2: Add PSCI state print for the ARMv8 error type.
>> Separate timestamp year into year and century using BCD format.
>> Rebase on top of ACPICA 20160318 release and remove header file
>> changes
>> in include/acpi/actbl1.h.
>> Add panic OS with fatal error status block patch.
>> Add processing of unrecognized CPER error section patches with
>> updates
>> from previous comments. Original patches:
>> https://lkml.org/lkml/2015/9/8/646
>>
>> V1: https://lkml.org/lkml/2016/2/5/544
>>
>> Jonathan (Zhixiong) Zhang (1):
>> acpi: apei: panic OS with fatal error status block
>>
>> Tyler Baicar (9):
>> acpi: apei: read ack upon ghes record consumption
>> ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
>> efi: parse ARMv8 processor error
>> arm64: exception: handle Synchronous External Abort
>> acpi: apei: handle SEA notification type for ARMv8
>> efi: print unrecognized CPER section
>> ras: acpi / apei: generate trace event for unrecognized CPER section
>> trace, ras: add ARM processor error trace event
>> arm/arm64: KVM: add guest SEA support
>>
>> arch/arm/include/asm/kvm_arm.h | 1 +
>> arch/arm/include/asm/system_misc.h | 5 +
>> arch/arm/kvm/mmu.c | 18 ++-
>> arch/arm64/Kconfig | 1 +
>> arch/arm64/include/asm/kvm_arm.h | 1 +
>> arch/arm64/include/asm/system_misc.h | 15 +++
>> arch/arm64/mm/fault.c | 71 ++++++++++--
>> drivers/acpi/apei/Kconfig | 14 +++
>> drivers/acpi/apei/ghes.c | 188
>> ++++++++++++++++++++++++++++---
>> drivers/acpi/apei/hest.c | 7 +-
>> drivers/firmware/efi/cper.c | 210
>> ++++++++++++++++++++++++++++++++---
>> drivers/ras/ras.c | 2 +
>> include/acpi/ghes.h | 15 ++-
>> include/linux/cper.h | 84 ++++++++++++++
>> include/ras/ras_event.h | 100 +++++++++++++++++
>> 15 files changed, 688 insertions(+), 44 deletions(-)
>>
>
>
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
Powered by blists - more mailing lists