[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aMkOCmGBhZKhKPrI@hpe.com>
Date: Tue, 16 Sep 2025 02:14:17 -0500
From: Kyle Meyer <kyle.meyer@....com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: corbet@....net, david@...hat.com, linmiaohe@...wei.com, shuah@...nel.org,
tony.luck@...el.com, jane.chu@...cle.com, jiaqiyan@...gle.com,
Liam.Howlett@...cle.com, bp@...en8.de, hannes@...xchg.org,
jack@...e.cz, joel.granados@...nel.org, laoar.shao@...il.com,
lorenzo.stoakes@...cle.com, mclapinski@...gle.com, mhocko@...e.com,
nao.horiguchi@...il.com, osalvador@...e.de, rafael.j.wysocki@...el.com,
rppt@...nel.org, russ.anderson@....com, shawn.fan@...el.com,
surenb@...gle.com, vbabka@...e.cz, linux-acpi@...r.kernel.org,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-kselftest@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v2] mm/memory-failure: Support disabling soft offline for
HugeTLB pages
On Mon, Sep 15, 2025 at 08:16:18PM -0700, Andrew Morton wrote:
> On Mon, 15 Sep 2025 19:27:41 -0500 Kyle Meyer <kyle.meyer@....com> wrote:
>
> > Soft offlining a HugeTLB page reduces the HugeTLB page pool.
> >
> > Commit 56374430c5dfc ("mm/memory-failure: userspace controls soft-offlining pages")
> > introduced the following sysctl interface to control soft offline:
> >
> > /proc/sys/vm/enable_soft_offline
> >
> > The interface does not distinguish between page types:
> >
> > 0 - Soft offline is disabled
> > 1 - Soft offline is enabled
> >
> > Convert enable_soft_offline to a bitmask and support disabling soft
> > offline for HugeTLB pages:
> >
> > Bits:
> >
> > 0 - Enable soft offline
> > 1 - Disable soft offline for HugeTLB pages
> >
> > Supported values:
> >
> > 0 - Soft offline is disabled
> > 1 - Soft offline is enabled
> > 3 - Soft offline is enabled (disabled for HugeTLB pages)
> >
> > Existing behavior is preserved.
>
> um, why? What benefit does this patch provide to our users?
> Use-cases, before-and-after scenarios, etc?
Thank you for the feedback.
Some BIOS suppress ("cloak") corrected memory errors until a threshold
is reached. Once that threshold is reached, BIOS reports a CPER with the
"error threshold exceeded" bit set via GHES and the corresponding page is
soft offlined.
BIOS does not know the page type of the corresponding page. If the
corresponding page happens to be a HugeTLB page, it will be dissolved,
permanently reducing the HugeTLB page pool. This can be problematic for
workloads that depend on a fixed number of HugeTLB pages.
Currently, soft offline must be disabled to prevent HugeTLB pages from
being soft offlined.
This patch provides a middle ground. Soft offline can be disabled for
HugeTLB pages while remaining enabled for non-HugeTLB pages, preserving
the benefits of soft offline without the risk of BIOS soft offlining
HugeTLB pages.
> > Update documentation and HugeTLB soft offline self tests.
> >
> > Reported-by: Shawn Fan <shawn.fan@...el.com>
>
> Interesting. What did Shawn report? (Closes:!).
Tony or Shawn, could you please point me to the original report? Thanks!
> > Suggested-by: Tony Luck <tony.luck@...el.com>
> > Signed-off-by: Kyle Meyer <kyle.meyer@....com>
> >
> > ...
> >
> > .../ABI/testing/sysfs-memory-page-offline | 3 ++
> > Documentation/admin-guide/sysctl/vm.rst | 28 ++++++++++++++++---
> > mm/memory-failure.c | 17 +++++++++--
> > .../selftests/mm/hugetlb-soft-offline.c | 19 ++++++++++---
> > 4 files changed, 56 insertions(+), 11 deletions(-)
>
> I'll add it because testing, but please do explain why I added it?
Thanks,
Kyle Meyer
Powered by blists - more mailing lists