[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ1PR11MB60831F028E2FEB6B5A3390D9FC14A@SJ1PR11MB6083.namprd11.prod.outlook.com>
Date: Tue, 16 Sep 2025 15:20:49 +0000
From: "Luck, Tony" <tony.luck@...el.com>
To: "Meyer, Kyle" <kyle.meyer@....com>, Andrew Morton
<akpm@...ux-foundation.org>
CC: "corbet@....net" <corbet@....net>, "david@...hat.com" <david@...hat.com>,
"linmiaohe@...wei.com" <linmiaohe@...wei.com>, "shuah@...nel.org"
<shuah@...nel.org>, "jane.chu@...cle.com" <jane.chu@...cle.com>,
"jiaqiyan@...gle.com" <jiaqiyan@...gle.com>, "Liam.Howlett@...cle.com"
<Liam.Howlett@...cle.com>, "bp@...en8.de" <bp@...en8.de>,
"hannes@...xchg.org" <hannes@...xchg.org>, "jack@...e.cz" <jack@...e.cz>,
"joel.granados@...nel.org" <joel.granados@...nel.org>, "laoar.shao@...il.com"
<laoar.shao@...il.com>, "lorenzo.stoakes@...cle.com"
<lorenzo.stoakes@...cle.com>, "mclapinski@...gle.com"
<mclapinski@...gle.com>, "mhocko@...e.com" <mhocko@...e.com>,
"nao.horiguchi@...il.com" <nao.horiguchi@...il.com>, "osalvador@...e.de"
<osalvador@...e.de>, "Wysocki, Rafael J" <rafael.j.wysocki@...el.com>,
"rppt@...nel.org" <rppt@...nel.org>, "Anderson, Russ"
<russ.anderson@....com>, "Fan, Shawn" <shawn.fan@...el.com>,
"surenb@...gle.com" <surenb@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-kselftest@...r.kernel.org" <linux-kselftest@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: RE: [PATCH v2] mm/memory-failure: Support disabling soft offline for
HugeTLB pages
>> > Reported-by: Shawn Fan <shawn.fan@...el.com>
>>
>> Interesting. What did Shawn report? (Closes:!).
>
> Tony or Shawn, could you please point me to the original report? Thanks!
Original report is internal to Intel, so no useful link for the community (but
I still wanted to give credit).
Recap of original problem is that some BIOS keep track of error threshold
per-rank and use this GHES mechanism to report threshold exceeded on
the rank.
Systems that stay up a long time can accumulate enough soft errors
to trigger this threshold. But the action of taking a page offline isn't
going to help. For a 4K page this is merely annoying. For 1G page
it can mess things up badly.
My original patch for this just skipped the GHES->offline process
for huge pages. But I wasn't aware of the sysctl control. That provides
a better solution.
-Tony
Powered by blists - more mailing lists