lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170323172030.GA31747@intel.com>
Date:   Thu, 23 Mar 2017 10:20:31 -0700
From:   "Luck, Tony" <tony.luck@...el.com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     X86 ML <x86@...nel.org>, linux-edac <linux-edac@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

On Thu, Mar 23, 2017 at 04:22:28PM +0100, Borislav Petkov wrote:
> On Wed, Mar 22, 2017 at 07:03:39PM +0100, Borislav Petkov wrote:
> > Lemme try to write a small script exercising exactly that scenario to
> > see whether I'm actually not talking crap here :-)
> 
> Ok, here's a snapshot from the CEC after letting it run for a couple of
> hours in a guest with a script running twice in parallel and injecting
> random PFNs. We have 0 offlined pages because a PFN number doesn't
> repeat frequently enough to cause an overflow.
> 
> When I force the occurrence of a single PFN for 1023 and more times and
> do that more than once, this happens:
> 
> [ 6629.091239] RAS: Soft-offlining pfn: 0x7fff
> [ 6629.093036] __get_any_page: 0x7fff free buddy page
> [ 6653.259476] RAS: Soft-offlining pfn: 0x7fff
> [ 6653.260100] soft offline: 0x7fff page already poisoned
> 
> ...
> 
> Stats:
> CEs: 32614
> offlined pages: 2
> ^^^^^^^^^^^^^^^^^
> 
> Flags: 0x0
> Timer interval: 86400 seconds
> Decays: 254
> Action threshold: 1023
> 
> The "already poisoned" thing shouldn't happen in real life because once
> the page frame is poisoned, it shouldn't generate MCEs.

It can happen if Linux didn't actually take the page offline
(because it was a kernel page). The CEC code only knows that
it queued this page to be taken offline ... and has no way
to know if that succeeded or not.

Some people have grumbled about mcelog(8) doing the same thing.

So is it worth keeping track of the page numbers that we
tried to offline?  If they show up again we shouldn't add
them back into the array.

-Tony

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ