[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ymk1BkSEs5uCgV6e@localhost.localdomain>
Date: Wed, 27 Apr 2022 14:20:22 +0200
From: Oscar Salvador <osalvador@...e.de>
To: David Hildenbrand <david@...hat.com>
Cc: Naoya Horiguchi <naoya.horiguchi@...ux.dev>, linux-mm@...ck.org,
Andrew Morton <akpm@...ux-foundation.org>,
Miaohe Lin <linmiaohe@...wei.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Yang Shi <shy828301@...il.com>,
Muchun Song <songmuchun@...edance.com>,
Naoya Horiguchi <naoya.horiguchi@....com>,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload
related to hugetlb and memory_hotplug
On Wed, Apr 27, 2022 at 12:48:16PM +0200, David Hildenbrand wrote:
> I raised some time ago already that I don't quite see the value of
> allowing memory offlining with poisened pages.
>
> 1) It overcomplicates the offlining code and seems to be partially
> broken
> 2) It happens rarely (ever?), so do we even care?
> 3) Once the memory is offline, we can re-online it and lost HWPoison.
> The memory can be happily used.
>
> 3) can happen easily if our DIMM consists of multiple memory blocks and
> offlining of some memory block fails -> we'll re-online all already
> offlined ones. We'll happily reuse previously HWPoisoned pages, which
> feels more dangerous to me then just leaving the DIMM around (and
> eventually hwpoisoning all pages on it such that it won't get used
> anymore?).
>
> So maybe we should just fail offlining once we stumble over a hwpoisoned
> page?
>
> Yes, we would disallow removing a semi-broken DIMM from the system that
> was onlined MOVABLE. I wonder if we really need that and how often it
> happens in real life. Most systems I am aware of don't allow for
> replacing individual DIMMs, but only complete NUMA nodes. Hm.
I teend to agree with all you said.
And to be honest, the mechanism of making a semi-broken DIMM healthy
again has always been a mistery to me.
One would think that:
1- you hot-remove the memory
2- you fix/remove it
3- you hotplug memory again
but I am not sure how many times this came to be.
And there is also the thing about losing the hwpoison information
between offline<->online transitions, so, the thing is unreliable.
And for that to work, we would have to add a bunch of code
to keep track of "offlined" pages that are hwpoisoned, so we
flag them again once they get onlined, and that means more
room for errors.
So, I would lean towards the fact of not allowing to offline
memory that contain such pages in the first place, unless that
proves to be a no-go.
--
Oscar Salvador
SUSE Labs
Powered by blists - more mailing lists