linux-kernel - Re: [PATCH v5 0/9] efi/cxl-cper: Report CPER CXL component events through trace events

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <659cb684deb2d_127da22945a@dwillia2-xfh.jf.intel.com.notmuch>
Date: Mon, 8 Jan 2024 18:59:16 -0800
From: Dan Williams <dan.j.williams@...el.com>
To: Ira Weiny <ira.weiny@...el.com>, Dan Williams <dan.j.williams@...el.com>,
	Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>, Jonathan Cameron
	<Jonathan.Cameron@...wei.com>
CC: Dan Williams <dan.j.williams@...el.com>, Shiju Jose
	<shiju.jose@...wei.com>, Yazen Ghannam <yazen.ghannam@....com>, "Davidlohr
 Bueso" <dave@...olabs.net>, Dave Jiang <dave.jiang@...el.com>, "Alison
 Schofield" <alison.schofield@...el.com>, Vishal Verma
	<vishal.l.verma@...el.com>, Ard Biesheuvel <ardb@...nel.org>,
	<linux-efi@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linux-cxl@...r.kernel.org>, "Rafael J. Wysocki" <rafael@...nel.org>, "Bjorn
 Helgaas" <bhelgaas@...gle.com>
Subject: Re: [PATCH v5 0/9] efi/cxl-cper: Report CPER CXL component events
 through trace events

Ira Weiny wrote:
> Dan Williams wrote:
> > Smita Koralahalli wrote:
> > > On 1/8/2024 8:58 AM, Jonathan Cameron wrote:
> > > > On Wed, 20 Dec 2023 16:17:27 -0800
> > > > Ira Weiny <ira.weiny@...el.com> wrote:
> > > > 
> > > >> Series status/background
> > > >> ========================
> > > >>
> > > >> Smita has been a great help with this series.  Thank you again!
> > > >>
> > > >> Smita's testing found that the GHES code ended up printing the events
> > > >> twice.  This version avoids the duplicate print by calling the callback
> > > >> from the GHES code instead of the EFI code as suggested by Dan.
> > > > 
> > > > I'm not sure this is working as intended.
> > > > 
> > > > There is nothing gating the call in ghes_proc() of ghes_print_estatus()
> > > > and now the EFI code handling that pretty printed things is missing we get
> > > > the horrible kernel logging for an unknown block instead.
> > > > 
> > > > So I think we need some minimal code in cper.c to match the guids then not
> > > > log them (on basis we are arguing there is no need for new cper records).
> > > > Otherwise we are in for some messy kernel logs
> > > > 
> > > > Something like:
> > > > 
> > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > > > {1}[Hardware Error]: event severity: recoverable
> > > > {1}[Hardware Error]:  Error 0, type: recoverable
> > > > {1}[Hardware Error]:   section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6
> > > > {1}[Hardware Error]:   section length: 0x90
> > > > {1}[Hardware Error]:   00000000: 00000090 00000007 00000000 0d938086  ................
> > > > {1}[Hardware Error]:   00000010: 00100000 00000000 00040000 00000000  ................
> > > > {1}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > > > {1}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
> > > > {1}[Hardware Error]:   00000040: 00000000 00000000 00000000 00000000  ................
> > > > {1}[Hardware Error]:   00000050: 00000000 00000000 00000000 00000000  ................
> > > > {1}[Hardware Error]:   00000060: 00000000 00000000 00000000 00000000  ................
> > > > {1}[Hardware Error]:   00000070: 00000000 00000000 00000000 00000000  ................
> > > > {1}[Hardware Error]:   00000080: 00000000 00000000 00000000 00000000  ................
> > > > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags=''
> > > > 
> > > > (I'm filling the record with 0s currently)
> > > 
> > > Yeah, when I tested this, I thought its okay for the hexdump to be there 
> > > in dmesg from EFI as the handling is done in trace events from GHES.
> > > 
> > > If, we need to handle from EFI, then it would be a good reason to move 
> > > the GUIDs out from GHES and place it in a common location for EFI/cper 
> > > to share similar to protocol errors.
> > 
> > Ah, yes, my expectation was more aligned with Jonathan's observation to
> > do the processing in GHES code *and* skip the processing in the CPER
> > code, something like:
> > 
> 
> Agreed this was intended I did not realize the above.
> 
> > 
> > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> > index 35c37f667781..0a4eed470750 100644
> > --- a/drivers/firmware/efi/cper.c
> > +++ b/drivers/firmware/efi/cper.c
> > @@ -24,6 +24,7 @@
> >  #include <linux/bcd.h>
> >  #include <acpi/ghes.h>
> >  #include <ras/ras_event.h>
> > +#include <linux/cxl-event.h>
> >  #include "cper_cxl.h"
> >  
> >  /*
> > @@ -607,6 +608,15 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata
> >  			cper_print_prot_err(newpfx, prot_err);
> >  		else
> >  			goto err_section_too_small;
> > +	} else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) {
> > +		printk("%ssection_type: CXL General Media Error\n", newpfx);
> 
> Do we want the printk's here?  I did not realize that a generic event
> would be printed.  So intention was nothing would be done on this path.

I think we do otherwise the kernel will say

    {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
    {1}[Hardware Error]: event severity: recoverable
    {1}[Hardware Error]:  Error 0, type: recoverable
    ...

..leaving the user hanging vs:
 
    {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
    {1}[Hardware Error]: event severity: recoverable
    {1}[Hardware Error]:  Error 0, type: recoverable
    {1}[Hardware Error]:   section type: General Media Error

..as an indicator to go follow up with rasdaemon or whatever else is
doing the detailed monitoring of CXL events.