[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0038141b-713f-4024-9f8b-e5f748f5a6a9@linux.alibaba.com>
Date: Mon, 5 Jan 2026 17:12:25 +0800
From: Ruidong Tian <tianruidong@...ux.alibaba.com>
To: Borislav Petkov <bp@...en8.de>
Cc: catalin.marinas@....com, will@...nel.org, lpieralisi@...nel.org,
guohanjun@...wei.com, sudeep.holla@....com, xueshuai@...ux.alibaba.com,
linux-kernel@...r.kernel.org, linux-acpi@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org, rafael@...nel.org, lenb@...nel.org,
tony.luck@...el.com, yazen.ghannam@....com, misono.tomohiro@...itsu.com,
fengwei_yin@...ux.alibaba.com
Subject: Re: [PATCH v5 00/17] ARM Error Source Table V2 Support
Thanks for the review and the comments.
在 2025/12/31 04:22, Borislav Petkov 写道:
> Some high-level notes first:
>
> On Tue, Dec 30, 2025 at 05:09:28PM +0800, Ruidong Tian wrote:
>> This series introduces support for the ARM Error Source Table (AEST), aligning
>> with version 2.0 of ACPI for Armv8 RAS Extensions [0].
>
> I'd like to hear from ARM folks what the strategy for this thing is...
That's a good point. I've CC'ed the ARM maintainers for their input.
>
>> AEST provides a critical mechanism for hardware to directly notify the
>> operating system kernel about RAS errors via interrupts, a concept known as
>> Kernel-first error handling. Compared to firmware-first error handling
>> (e.g., GHES), AEST offers a more lightweight approach. This efficiency allows
>> the OS to potentially report every Corrected Error (CE), enabling upper-layer
>> applications to leverage CE information for error prediction[1][2].
>>
>> This series is based on Tyler Baicar's preliminary patches [3], which have not
>> yet been sent to the mailing list as v2.
>
> I guess I'll wait for those first.
I tried to follow up with Tyler about his patches back in 2022[0] but
got no reply. Since he no longer seems active on the Linux mailing list,
I decided to pick this up and post this series myself to avoid this work
being stalled.
[0]:
https://lore.kernel.org/all/b365db02-b28c-1b22-2e87-c011cef848e2@linux.alibaba.com/
>
>> AEST Driver Architecture
>> ========================
>>
>> The AEST driver is structured into three primary components:
>> - AEST device: Responsible for handling interrupts, managing the lifecycle
>> of AEST nodes, and processing error records.
>> - AEST node: Corresponds directly to a RAS node in the hardware
>
> What is a "RAS node"?
A RAS node is the hardware interface for error reporting and control,
consisting of one or more register sets (a collection of RAS records).
It is responsible for error logging and interrupt signaling[0].
A single hardware component can feature multiple RAS nodes. For example,
a memory controller is treated as a "RAS device", where each memory
channel has its own RAS node. Interrupts generated by these nodes are
typically aggregated into a single interrupt line managed at the RAS
device level.
Comparison with x86 MCA:
RAS record ≈ MCA bank.
RAS node ≈ A set of MCA banks + CMCI on a core.
The key difference lies in uncore handling: x86 typically maps uncore
errors (like those from a memory controller) into core-based MCA banks.
In contrast, ARM requires uncore components to provide their own
standalone RAS nodes. When a component requires multiple such nodes,
they are grouped and managed as a "RAS device" in AEST driver.
[0]: https://developer.arm.com/documentation/ihi0100/latest
>
>> - AEST record: Represents a set of RAS registers associated with a specific
>> error source.
>
> ...
>
>> Address Translation
>> ===================
>>
>> As described in section 2.2 [0], error addresses reported by AEST records
>> may be "node-specific Logical Addresses" rather than the "System Physical
>> Addresses" (SPA) used by the kernel. Therefore, the driver needs to translate
>> these Logical Addresses (LA) to SPA. This translation mechanism is conceptually
>> similar to AMD's Address Translation Logic (ATL) [4], leading patch 0014 to
>> introduce a common translation function for both AMD and ARM architectures.
>
> Say what now?
>
> The ATL is very AMD-specific. What does "conceptually similar" mean exactly?
By "conceptually similar," I mean that both ARM and AMD share the same
functional requirement: translating between a System Physical Address
(SPA) and a device-specific address (like a DRAM address) for RAS purposes.
The goal here is not to share the hardware-specific translation logic,
but to provide a unified interface (an abstraction layer). The actual
implementation of the translation remains entirely architecture-specific.
> What happens if we have to change the ATL and break your use case in the
> process?
Since the implementations are decoupled, changes to the internals of
AMD's ATL will not break the ARM use case. ARM would have its own
backend implementation. They only share a common top-level API or
wrapper to allow generic RAS/EDAC code to invoke the translation.
>
> What exact functionality from the ATL do you really need here?
>
> Thx.
>
Powered by blists - more mailing lists