lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210326224310.GL25229@zn.tnic>
Date:   Fri, 26 Mar 2021 23:43:10 +0100
From:   Borislav Petkov <bp@...en8.de>
To:     William Roche <william.roche@...cle.com>
Cc:     linux-kernel@...r.kernel.org, Tony Luck <tony.luck@...el.com>,
        linux-edac@...r.kernel.org
Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event
 filtering

On Fri, Mar 26, 2021 at 11:24:43PM +0100, William Roche wrote:
> What we want is to make cec_add_elem() to return !0 value only
> when the given pfn triggered an action, so that its callers should
> log the error.

No, this is not what the CEC does - it collects those errors and when it
reaches the threshold for any pfn, it offlines the corresponding page. I
know, the comment above talks about:

  * That error event entry causes cec_add_elem() to return !0 value and thus
  * signal to its callers to log the error.

but it doesn't do that. Frankly, I don't see the point of logging the
error - it already says

	pr_err("Soft-offlining pfn: 0x%llx\n", pfn);

which pfn it has offlined. And that is probably only mildly interesting
to people - so what, 4K got offlined, servers have so much memory
nowadays.

The only moment one should start worrying is if one gets those pretty
often but then you're probably better off simply scheduling maintenance
and replacing the faulty DIMM - problem solved.

> What I'm expecting from ras_cec is to "hide" CEs until they reach the
> action threshold where an action is tried against the impacted PFN,

That it does.

> and it's now the time to log the error with the entire notifiers
> chain.

And I'm not sure why we'd want to do that. It simply offlines the page.

But maybe you could explain what you're trying to achieve...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ