[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190420094120.GB29704@zn.tnic>
Date:   Sat, 20 Apr 2019 11:41:20 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     "Luck, Tony" <tony.luck@...el.com>
Cc:     Cong Wang <xiyou.wangcong@...il.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
On Fri, Apr 19, 2019 at 08:04:01AM -0700, Luck, Tony wrote:
> Now there isn't really anything better that CEC can do in
> this situation. It won't help to have a bigger array. Taking
> pages offline wouldn't solve the problem (though if that
> did happen at least it would break the silence).
> 
> Same situation for other DRAM failure modes that affect a
> wide range of pages (rank, bank, perhaps row ... though all
> the errors from a single row failure might fit in the CEC array).
> 
> Allowing the user to bypass CEC (without a reboot ... cloud folks
> hate to reboot their systems) would allow the sysadmin to see
> what is happening (either via /dev/mcelog, or via EDAC driver).
Err, this all sounds to me like the storm detection code should
*automatically* disable the CEC in such cases, I'd say. Because I
don't see a cloud admin going into the debugfs and turning it off.
Rather, if the detection heuristic we use is smart enough, disabling it
automatically should be a lot better serviceability action.
Hmmm?
-- 
Regards/Gruss,
    Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
Powered by blists - more mailing lists
 
