lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ymk1BkSEs5uCgV6e@localhost.localdomain>
Date:   Wed, 27 Apr 2022 14:20:22 +0200
From:   Oscar Salvador <osalvador@...e.de>
To:     David Hildenbrand <david@...hat.com>
Cc:     Naoya Horiguchi <naoya.horiguchi@...ux.dev>, linux-mm@...ck.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Yang Shi <shy828301@...il.com>,
        Muchun Song <songmuchun@...edance.com>,
        Naoya Horiguchi <naoya.horiguchi@....com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload
 related to hugetlb and memory_hotplug

On Wed, Apr 27, 2022 at 12:48:16PM +0200, David Hildenbrand wrote:
> I raised some time ago already that I don't quite see the value of
> allowing memory offlining with poisened pages.
> 
> 1) It overcomplicates the offlining code and seems to be partially
>    broken
> 2) It happens rarely (ever?), so do we even care?
> 3) Once the memory is offline, we can re-online it and lost HWPoison.
>    The memory can be happily used.
> 
> 3) can happen easily if our DIMM consists of multiple memory blocks and
> offlining of some memory block fails -> we'll re-online all already
> offlined ones. We'll happily reuse previously HWPoisoned pages, which
> feels more dangerous to me then just leaving the DIMM around (and
> eventually hwpoisoning all pages on it such that it won't get used
> anymore?).
> 
> So maybe we should just fail offlining once we stumble over a hwpoisoned
> page?
> 
> Yes, we would disallow removing a semi-broken DIMM from the system that
> was onlined MOVABLE. I wonder if we really need that and how often it
> happens in real life. Most systems I am aware of don't allow for
> replacing individual DIMMs, but only complete NUMA nodes. Hm.

I teend to agree with all you said.
And to be honest, the mechanism of making a semi-broken DIMM healthy
again has always been a mistery to me.

One would think that:

1- you hot-remove the memory
2- you fix/remove it
3- you hotplug memory again

but I am not sure how many times this came to be.

And there is also the thing about losing the hwpoison information
between offline<->online transitions, so, the thing is unreliable.

And for that to work, we would have to add a bunch of code
to keep track of "offlined" pages that are hwpoisoned, so we
flag them again once they get onlined, and that means more
room for errors.

So, I would lean towards the fact of not allowing to offline
memory that contain such pages in the first place, unless that
proves to be a no-go.


-- 
Oscar Salvador
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ