[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ya+LukojuewlomeF@yaz-ubuntu>
Date: Tue, 7 Dec 2021 16:28:42 +0000
From: Yazen Ghannam <yazen.ghannam@....com>
To: Borislav Petkov <bp@...en8.de>
Cc: linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
tony.luck@...el.com, x86@...nel.org,
Smita.KoralahalliChannabasappa@....com, mukul.joshi@....com,
alexander.deucher@....com, william.roche@...cle.com
Subject: Re: [PATCH 1/3] x86/MCE/AMD: Provide an "Unknown" MCA bank type
On Fri, Dec 03, 2021 at 11:17:45PM +0100, Borislav Petkov wrote:
> On Fri, Dec 03, 2021 at 02:00:15AM +0000, Yazen Ghannam wrote:
> > The AMD MCA Thresholding sysfs interface populates directories for each
> > bank and thresholding block. The name used for each directory is looked
> > up in a table of known bank types. However, new bank types won't match
> > in this list and will return NULL for the name. This will cause the
> > machinecheck sysfs interface to fail to be populated.
> >
> > Set new and unknown MCA bank types to the "unknown" type. Also,
> > ensure that the bank's thresholding block directories have unique names.
> > This will ensure that the machinecheck sysfs interface can be
> > initialized.
>
> What is the advantage of having a sysfs directory structure headed with
> an "unknown" entry vs not having that structure at all when the kernel
> runs on a machine for which it has not been enabled yet?
>
> IOW, if those new banks would need additional enablement, what's the
> point of having "unknown" on older kernels which do not have any
> functionality?
>
> IOW, how does this:
>
> /sys/devices/system/machinecheck/machinecheck0/unknown/unknown/
> ├── error_count
> ├── interrupt_enable
> └── threshold_limit
>
> help a user?
Yeah, I see your point.
>
> Btw, looking at the current layout:
>
> ...
> ├── insn_fetch
> │ └── insn_fetch
> │ ├── error_count
> │ ├── interrupt_enable
> │ └── threshold_limit
> ├── l2_cache
> │ └── l2_cache
> │ ├── error_count
> │ ├── interrupt_enable
> │ └── threshold_limit
> ...
>
> we have those names repeated which looks wonky and useless too. I'd
> expect them to be:
>
> ...
> ├── insn_fetch
> │ ├── error_count
> │ ├── interrupt_enable
> │ └── threshold_limit
> ├── l2_cache
> │ ├── error_count
> │ ├── interrupt_enable
> │ └── threshold_limit
> ...
>
> Can we fix that too pls?
>
Sure thing. But I don't think removing the second directory will be okay. The
layout is "bank"/"block". If the "block" has special use like DRAM ECC, or L3
Cache on older systems, then it'll have a unique name. Otherwise, the block
will take the name of the bank.
I think the more robust solution is to drop the unique names and use generic
names like "bank"/"block". A new file called "type" can be introduced into the
directory structure, and this can return the name of the bank/block. New bank
types will return "<null>" for the "type", but the directory structure should
remain the same and functional.
I've seen this in other sysfs interfaces like cpuidle,
e.g. /sys/devices/system/cpu/cpu0/cpuidle/stateX
The "blockX/type" file is like the "stateX/desc" file. Or the "type" file can
be called "desc", since it's a description of what the bank or block
represent.
Here are a couple of examples:
/sys/devices/system/machinecheck/machinecheck0/
├── th_bank0
│ ├── type ("Instruction Fetch")
│ └── th_block0
│ ├── type ("All Errors")
│ ├── error_count
│ ├── interrupt_enable
│ └── threshold_limit
├── th_bank1
│ ├── type ("Northbridge")
│ ├── th_block0
│ │ ├── type ("DRAM Errors")
│ │ ├── error_count
│ │ ├── interrupt_enable
│ │ └── threshold_limit
│ └── th_block1
│ ├── type ("Link Errors")
│ ├── error_count
│ ├── interrupt_enable
│ └── threshold_limit
...
OR
/sys/devices/system/machinecheck/machinecheck0/thresholding
├── bank0
│ ├── desc ("Instruction Fetch")
│ └── block0
│ ├── desc ("All Errors")
│ ├── error_count
│ ├── interrupt_enable
│ └── threshold_limit
├── bank1
│ ├── desc ("Northbridge")
│ ├── block0
│ │ ├── desc ("DRAM Errors")
│ │ ├── error_count
│ │ ├── interrupt_enable
│ │ └── threshold_limit
│ └── block1
│ ├── desc ("Link Errors")
│ ├── error_count
│ ├── interrupt_enable
│ └── threshold_limit
...
I'm inclined to the second option, since it keeps all the thresholding
functionality under a single directory.
What do you think?
Thanks,
Yazen
Powered by blists - more mailing lists