linux-kernel - Re: [PATCH] ACPI: PHAT: Add Platform Health Assessment Table support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJZ5v0g_DyQAnSuigBc-f0UNmW0mo=0yMadES+0NhphJs_k+cw@mail.gmail.com>
Date:   Mon, 21 Aug 2023 20:01:05 +0200
From:   "Rafael J. Wysocki" <rafael@...nel.org>
To:     "Limonciello, Mario" <mario.limonciello@....com>
Cc:     Avadhut Naik <avadnaik@....com>,
        "Wilczynski, Michal" <michal.wilczynski@...el.com>,
        Avadhut Naik <avadhut.naik@....com>, lenb@...nel.org,
        linux-acpi@...r.kernel.org, yazen.ghannam@....com,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] ACPI: PHAT: Add Platform Health Assessment Table support

On Mon, Aug 21, 2023 at 7:52 PM Rafael J. Wysocki <rafael@...nel.org> wrote:
>
> On Mon, Aug 21, 2023 at 7:35 PM Limonciello, Mario
> <mario.limonciello@....com> wrote:
> >
> >
> >
> > On 8/21/2023 12:29 PM, Rafael J. Wysocki wrote:
> > > On Mon, Aug 21, 2023 at 7:17 PM Limonciello, Mario
> > > <mario.limonciello@....com> wrote:
> > >>
> > >> On 8/21/2023 12:12 PM, Rafael J. Wysocki wrote:
> > >> <snip>
> > >>>> I was just talking to some colleagues about PHAT recently as well.
> > >>>>
> > >>>> The use case that jumps out is "system randomly rebooted while I was
> > >>>> doing XYZ".  You don't know what happened, but you keep using your
> > >>>> system.  Then it happens again.
> > >>>>
> > >>>> If the reason for the random reboot is captured to dmesg you can cross
> > >>>> reference your journal from the next boot after any random reboot and
> > >>>> get the reason for it.  If a user reports this to a Gitlab issue tracker
> > >>>> or Bugzilla it can be helpful in establishing a pattern.
> > >>>>
> > >>>>>> The below location may be appropriate in that case:
> > >>>>>> /sys/firmware/acpi/
> > >>>>>
> > >>>>> Yes, it may. >
> > >>>>>> We already have FPDT and BGRT being exported from there.
> > >>>>>
> > >>>>> In fact, all of the ACPI tables can be retrieved verbatim from
> > >>>>> /sys/firmware/acpi/tables/ already, so why exactly do you want the
> > >>>>> kernel to parse PHAT in particular?
> > >>>>>
> > >>>>
> > >>>> It's not to say that /sys/firmware/acpi/PHAT isn't useful, but having
> > >>>> something internal to the kernel "automatically" parsing it and saving
> > >>>> information to a place like the kernel log that is already captured by
> > >>>> existing userspace tools I think is "more" useful.
> > >>>
> > >>> What existing user space tools do you mean?  Is there anything already
> > >>> making use of the kernel's PHAT output?
> > >>>
> > >>
> > >> I was meaning things like systemd already capture the kernel long
> > >> ringbuffer.  If you save stuff like this into the kernel log, it's going
> > >> to be indexed and easier to grep for boots that had it.
> > >>
> > >>> And why can't user space simply parse PHAT by itself?
> > >>>   > There are multiple ACPI tables that could be dumped into the kernel
> > >>> log, but they aren't.  Guess why.
> > >>
> > >> Right; there's not reason it can't be done by userspace directly.
> > >>
> > >> Another way to approach this problem could be to modify tools that
> > >> excavate records from a reboot to also get PHAT.  For example
> > >> systemd-pstore will get any kernel panics from the previous boot from
> > >> the EFI pstore and put them into /var/lib/systemd/pstore.
> > >>
> > >> No reason that couldn't be done automatically for PHAT too.
> > >
> > > I'm not sure about the connection between the PHAT dump in the kernel
> > > log and pstore.
> > >
> > > The PHAT dump would be from the time before the failure, so it is
> > > unclear to me how useful it can be for diagnosing it.  However, after
> > > a reboot one should be able to retrieve PHAT data from the table
> > > directly and that may include some information regarding the failure.
> >
> > Right so the thought is that at bootup you get the last entry from PHAT
> > and save that into the log.
> >
> > Let's say you have 3 boots:
> > X - Triggered a random reboot
> > Y - Cleanly shut down
> > Z - Boot after a clean shut down
> >
> > So on boot Y you would have in your logs the reason that boot X rebooted.
>
> Yes, and the same can be retrieved from the PHAT directly from user
> space at that time, can't it?
>
> > On boot Z you would see something about how boot Y's reason.
> >
> > >
> > > With pstore, the assumption is that there will be some information
> > > relevant for diagnosing the failure in the kernel buffer, but I'm not
> > > sure how the PHAT dump from before the failure can help here?
> >
> > Alone it's not useful.
> > I had figured if you can put it together with other data it's useful.
> > For example if you had some thermal data in the logs showing which
> > component overheated or if you looked at pstore and found a NULL pointer
> > dereference.
>
> IIUC, the current PHAT content can be useful.  The PHAT content from
> boot X (before the failure) which is what will be there in pstore
> after the random reboot, is of limited value AFAICS.

To be more precise, I don't see why the kernel needs to be made a
man-in-the-middle between the firmware which is the source of the
information and user space that consumes it.