[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8c4d8ebe-885e-40f0-a10e-7290067c7b96@redhat.com>
Date: Thu, 9 Oct 2025 20:57:21 +0200
From: David Hildenbrand <david@...hat.com>
To: Longlong Xia <xialonglong2025@....com>, linmiaohe@...wei.com,
nao.horiguchi@...il.com
Cc: akpm@...ux-foundation.org, wangkefeng.wang@...wei.com,
xu.xin16@....com.cn, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Longlong Xia <xialonglong@...inos.cn>
Subject: Re: [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory
failures
On 09.10.25 09:00, Longlong Xia wrote:
> From: Longlong Xia <xialonglong@...inos.cn>
>
> When a hardware memory error occurs on a KSM page, the current
> behavior is to kill all processes mapping that page. This can
> be overly aggressive when KSM has multiple duplicate pages in
> a chain where other duplicates are still healthy.
>
> This patch introduces a recovery mechanism that attempts to migrate
> mappings from the failing page to another healthy duplicate within
> the same chain before resorting to killing processes.
An alternative could be to allocate a new page and effectively migrate
from the old (degraded) page to the new page by copying page content
from one of the healty duplicates.
That would keep the #mappings per page in the chain balanced.
>
> The recovery process works as follows:
> 1. When a memory failure is detected on a KSM page, identify if the
> failing node is part of a chain (has duplicates) (maybe add dup_haed
> item to save head_node to struct stable_node?, saving searching
> the whole stable tree, or other way to find head_node)
> 2. Search for another healthy duplicate page within the same chain
> 3. For each process mapping the failing page:
> - Update the PTE to point to the healthy duplicate page ( maybe reuse
> replace_page?, or split repalce_page into smaller function and use the
> common part)
> - Migrate the rmap_item to the new stable node
> 4. If all migrations succeed, remove the failing node from the chain
> 5. Only kill processes if recovery is impossible or fails
Does not sound too crazy.
But how realistic do we consider that in practice? We need quite a bunch
of processes to dedup the same page to end up getting duplicates in the
chain IIRC.
So isn't this rather an improvement only for less likely scenarios in
practice?
--
Cheers
David / dhildenb
Powered by blists - more mailing lists