[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240917141519.57766bb6@imammedo.users.ipa.redhat.com>
Date: Tue, 17 Sep 2024 14:15:19 +0200
From: Igor Mammedov <imammedo@...hat.com>
To: Mauro Carvalho Chehab <mchehab+huawei@...nel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@...wei.com>, Shiju Jose
<shiju.jose@...wei.com>, "Michael S. Tsirkin" <mst@...hat.com>, Ani Sinha
<anisinha@...hat.com>, Cleber Rosa <crosa@...hat.com>, Dongjiu Geng
<gengdongjiu1@...il.com>, Eric Blake <eblake@...hat.com>, John Snow
<jsnow@...hat.com>, Markus Armbruster <armbru@...hat.com>, Michael Roth
<michael.roth@....com>, Paolo Bonzini <pbonzini@...hat.com>, Peter Maydell
<peter.maydell@...aro.org>, Shannon Zhao <shannon.zhaosl@...il.com>,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org, qemu-arm@...gnu.org,
qemu-devel@...gnu.org
Subject: Re: [PATCH v10 00/21] Add ACPI CPER firmware first error injection
on ARM emulation
On Sat, 14 Sep 2024 08:13:21 +0200
Mauro Carvalho Chehab <mchehab+huawei@...nel.org> wrote:
> This series add support for injecting generic CPER records. Such records
> are generated outside QEMU via a provided script.
>
> On this version, the patch reworking the way offsets are calculated were
> split on several other patches, to make one logical change per patch and
> make review easier.
>
> Despite the number of patches increased from 12 to 21, there is just one
> real new patch (as the other ones are a split from a big change):
>
> acpi/generic_event_device: Update GHES migration to cover hest addr
I'm done with this round of review.
Given that the series accumulated a bunch of cleanups,
I'd suggest to move all cleanups/renamings not related
to new HEST lookup and new src id mapping to the beginning
of the series, so once they reviewed they could be split up into
a separate series that could be merged while we are ironing down
the new functionality.
> ---
>
> v10:
> - Patch 1 split on several patches to make reviews easier;
> - Added a migration patch;
> - CPER QMP command was renamed;
> - Updated some comments to better reflect exact ACPI version;
> - Removed a code to reset acks when OSPM fails to read records;
> - Removed a duplicated config GHES_CPER symbol;
> - There is now an arch-independent namespace for GHES source IDs;
> - Fixed the size of hest_ghes_notify array when creating tables;
> - acpi-hest.json is now a section of ACPI;
> - QMP command renamed from @ghes-cper to inject-ghes-error.
>
> v9:
> - Patches reorganized to make easier for reviewers;
> - source ID is now guest-OS specific;
> - Some patches got a revision history since v8;
> - Several minor cleanups.
>
> v8:
> - Fix one of the BIOS links that were incorrect;
> - Changed mem error internal injection to use a common code;
> - No more hardcoded values for CPER: instead of using just the
> payload at the QAPI, it now has the full raw CPER there;
> - Error injection script now supports changing fields at the
> Generic Error Data section of the CPER;
> - Several minor cleanups.
>
> v7:
> - Change the way offsets are calculated and used on HEST table.
> Now, it is compatible with migrations as all offsets are relative
> to the HEST table;
> - GHES interface is now more generic: the entire CPER is sent via
> QMP, instead of just the payload;
> - Some code cleanups to make the code more robust;
> - The python script now uses QEMUMonitorProtocol class.
>
> v6:
> - PNP0C33 device creation moved to aml-build.c;
> - acpi_ghes record functions now use ACPI notify parameter,
> instead of source ID;
> - the number of source IDs is now automatically calculated;
> - some code cleanups and function/var renames;
> - some fixes and cleanups at the error injection script;
> - ghes cper stub now produces an error if cper JSON is not compiled;
> - Offset calculation logic for GHES was refactored;
> - Updated documentation to reflect the GHES allocated size;
> - Added a x-mpidr object for QOM usage;
> - Added a patch making usage of x-mpidr field at ARM injection
> script;
>
> v5:
> - CPER guid is now passing as string;
> - raw-data is now passed with base64 encode;
> - Removed several GPIO left-overs from arm/virt.c changes;
> - Lots of cleanups and improvements at the error injection script.
> It now better handles QMP dialog and doesn't print debug messages.
> Also, code was split on two modules, to make easier to add more
> error injection commands.
>
> v4:
> - CPER generation moved to happen outside QEMU;
> - One patch adding support for mpidr query was removed.
>
> v3:
> - patch 1 cleanups with some comment changes and adding another place where
> the poweroff GPIO define should be used. No changes on other patches (except
> due to conflict resolution).
>
> v2:
> - added a new patch using a define for GPIO power pin;
> - patch 2 changed to also use a define for generic error GPIO pin;
> - a couple cleanups at patch 2 removing uneeded else clauses.
>
> Example of generating a CPER record:
>
> $ scripts/ghes_inject.py -d arm -p 0xdeadbeef
> GUID: e19e3d16-bc11-11e4-9caa-c2051d5d46b0
> Generic Error Status Block (20 bytes):
> 00000000 01 00 00 00 00 00 00 00 00 00 00 00 90 00 00 00 ................
> 00000010 00 00 00 00 ....
>
> Generic Error Data Entry (72 bytes):
> 00000000 16 3d 9e e1 11 bc e4 11 9c aa c2 05 1d 5d 46 b0 .=...........]F.
> 00000010 00 00 00 00 00 03 00 00 48 00 00 00 00 00 00 00 ........H.......
> 00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00000040 00 00 00 00 00 00 00 00 ........
>
> Payload (72 bytes):
> 00000000 05 00 00 00 01 00 00 00 48 00 00 00 00 00 00 00 ........H.......
> 00000010 00 00 00 80 00 00 00 00 10 05 0f 00 00 00 00 00 ................
> 00000020 00 00 00 00 00 00 00 00 00 20 14 00 02 01 00 03 ......... ......
> 00000030 0f 00 91 00 00 00 00 00 ef be ad de 00 00 00 00 ................
> 00000040 ef be ad de 00 00 00 00 ........
>
> Error injected.
>
> [ 9.358364] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> [ 9.359027] {1}[Hardware Error]: event severity: recoverable
> [ 9.359586] {1}[Hardware Error]: Error 0, type: recoverable
> [ 9.360124] {1}[Hardware Error]: section_type: ARM processor error
> [ 9.360561] {1}[Hardware Error]: MIDR: 0x00000000000f0510
> [ 9.361160] {1}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000080000000
> [ 9.361643] {1}[Hardware Error]: running state: 0x0
> [ 9.362142] {1}[Hardware Error]: Power State Coordination Interface state: 0
> [ 9.362682] {1}[Hardware Error]: Error info structure 0:
> [ 9.363030] {1}[Hardware Error]: num errors: 2
> [ 9.363656] {1}[Hardware Error]: error_type: 0x02: cache error
> [ 9.364163] {1}[Hardware Error]: error_info: 0x000000000091000f
> [ 9.364834] {1}[Hardware Error]: transaction type: Data Access
> [ 9.365599] {1}[Hardware Error]: cache error, operation type: Data write
> [ 9.366441] {1}[Hardware Error]: cache level: 2
> [ 9.367005] {1}[Hardware Error]: processor context not corrupted
> [ 9.367753] {1}[Hardware Error]: physical fault address: 0x00000000deadbeef
> [ 9.374267] Memory failure: 0xdeadb: recovery action for free buddy page: Recovered
>
> Such script currently supports arm processor error CPER, but can easily be
> extended to other GHES notification types.
>
>
> Mauro Carvalho Chehab (21):
> acpi/ghes: add a firmware file with HEST address
> acpi/generic_event_device: Update GHES migration to cover hest addr
> acpi/ghes: get rid of ACPI_HEST_SRC_ID_RESERVED
> acpi/ghes: simplify acpi_ghes_record_errors() code
> acpi/ghes: better handle source_id and notification
> acpi/ghes: Remove a duplicated out of bounds check
> acpi/ghes: rework the logic to handle HEST source ID
> acpi/ghes: Change the type for source_id
> acpi/ghes: Don't hardcode the number of sources on ghes
> acpi/ghes: make the GHES record generation more generic
> acpi/ghes: don't crash QEMU if ghes GED is not found
> acpi/ghes: rename etc/hardware_error file macros
> acpi/ghes: better name GHES memory error function
> acpi/ghes: add a notifier to notify when error data is ready
> acpi/generic_event_device: add an APEI error device
> arm/virt: Wire up a GED error device for ACPI / GHES
> qapi/acpi-hest: add an interface to do generic CPER error injection
> docs: acpi_hest_ghes: fix documentation for CPER size
> scripts/ghes_inject: add a script to generate GHES error inject
> target/arm: add an experimental mpidr arm cpu property object
> scripts/arm_processor_error.py: retrieve mpidr if not filled
>
> MAINTAINERS | 10 +
> docs/specs/acpi_hest_ghes.rst | 6 +-
> hw/acpi/Kconfig | 5 +
> hw/acpi/aml-build.c | 10 +
> hw/acpi/generic_event_device.c | 19 +-
> hw/acpi/ghes-stub.c | 2 +-
> hw/acpi/ghes.c | 312 +++++++----
> hw/acpi/ghes_cper.c | 32 ++
> hw/acpi/ghes_cper_stub.c | 19 +
> hw/acpi/meson.build | 2 +
> hw/arm/virt-acpi-build.c | 12 +-
> hw/arm/virt.c | 19 +-
> include/hw/acpi/acpi_dev_interface.h | 1 +
> include/hw/acpi/aml-build.h | 2 +
> include/hw/acpi/generic_event_device.h | 1 +
> include/hw/acpi/ghes.h | 37 +-
> include/hw/arm/virt.h | 2 +
> qapi/acpi-hest.json | 35 ++
> qapi/meson.build | 1 +
> qapi/qapi-schema.json | 1 +
> scripts/arm_processor_error.py | 388 ++++++++++++++
> scripts/ghes_inject.py | 51 ++
> scripts/qmp_helper.py | 702 +++++++++++++++++++++++++
> target/arm/cpu.c | 1 +
> target/arm/cpu.h | 1 +
> target/arm/helper.c | 10 +-
> target/arm/kvm.c | 3 +-
> 27 files changed, 1552 insertions(+), 132 deletions(-)
> create mode 100644 hw/acpi/ghes_cper.c
> create mode 100644 hw/acpi/ghes_cper_stub.c
> create mode 100644 qapi/acpi-hest.json
> create mode 100644 scripts/arm_processor_error.py
> create mode 100755 scripts/ghes_inject.py
> create mode 100644 scripts/qmp_helper.py
>
Powered by blists - more mailing lists