linux-kernel - Re: [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8c4d8ebe-885e-40f0-a10e-7290067c7b96@redhat.com>
Date: Thu, 9 Oct 2025 20:57:21 +0200
From: David Hildenbrand <david@...hat.com>
To: Longlong Xia <xialonglong2025@....com>, linmiaohe@...wei.com,
 nao.horiguchi@...il.com
Cc: akpm@...ux-foundation.org, wangkefeng.wang@...wei.com,
 xu.xin16@....com.cn, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 Longlong Xia <xialonglong@...inos.cn>
Subject: Re: [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory
 failures

On 09.10.25 09:00, Longlong Xia wrote:
> From: Longlong Xia <xialonglong@...inos.cn>
> 
> When a hardware memory error occurs on a KSM page, the current
> behavior is to kill all processes mapping that page. This can
> be overly aggressive when KSM has multiple duplicate pages in
> a chain where other duplicates are still healthy.
> 
> This patch introduces a recovery mechanism that attempts to migrate
> mappings from the failing page to another healthy duplicate within
> the same chain before resorting to killing processes.

An alternative could be to allocate a new page and effectively migrate 
from the old (degraded) page to the new page by copying page content 
from one of the healty duplicates.

That would keep the #mappings per page in the chain balanced.

> 
> The recovery process works as follows:
> 1. When a memory failure is detected on a KSM page, identify if the
> failing node is part of a chain (has duplicates) (maybe add dup_haed
> item to save head_node to struct stable_node?, saving searching
> the whole stable tree, or other way to find head_node)
> 2. Search for another healthy duplicate page within the same chain
> 3. For each process mapping the failing page:
> - Update the PTE to point to the healthy duplicate page ( maybe reuse
> replace_page?, or split repalce_page into smaller function and use the
> common part)
> - Migrate the rmap_item to the new stable node
> 4. If all migrations succeed, remove the failing node from the chain
> 5. Only kill processes if recovery is impossible or fails

Does not sound too crazy.

But how realistic do we consider that in practice? We need quite a bunch 
of processes to dedup the same page to end up getting duplicates in the 
chain IIRC.

So isn't this rather an improvement only for less likely scenarios in 
practice?

-- 
Cheers

David / dhildenb