linux-kernel - Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <64953691-ce0e-71d9-1ab3-7f8cefd6ace4@huawei.com>
Date: Thu, 30 Oct 2025 10:56:39 +0800
From: Miaohe Lin <linmiaohe@...wei.com>
To: Long long Xia <xialonglong2025@....com>
CC: <markus.elfring@....de>, <nao.horiguchi@...il.com>,
	<akpm@...ux-foundation.org>, <wangkefeng.wang@...wei.com>,
	<qiuxu.zhuo@...el.com>, <xu.xin16@....com.cn>,
	<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>, Longlong Xia
	<xialonglong@...inos.cn>, <david@...hat.com>, <lance.yang@...ux.dev>
Subject: Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by
 migrating to healthy duplicate

On 2025/10/29 15:12, Long long Xia wrote:
> Thanks for the reply.
> 
> 
> 在 2025/10/29 14:40, Miaohe Lin 写道:
>> On 2025/10/28 15:54, Long long Xia wrote:
>>> Thanks for the reply.
>>>
>>> 在 2025/10/23 19:54, Miaohe Lin 写道:
>>>> On 2025/10/16 18:18, Longlong Xia wrote:
>>>>> From: Longlong Xia <xialonglong@...inos.cn>
>>>>>
>>>>> When a hardware memory error occurs on a KSM page, the current
>>>>> behavior is to kill all processes mapping that page. This can
>>>>> be overly aggressive when KSM has multiple duplicate pages in
>>>>> a chain where other duplicates are still healthy.
>>>>>
>>>>> This patch introduces a recovery mechanism that attempts to
>>>>> migrate mappings from the failing KSM page to a newly
>>>>> allocated KSM page or another healthy duplicate already
>>>>> present in the same chain, before falling back to the
>>>>> process-killing procedure.
>>>>>
>>>>> The recovery process works as follows:
>>>>> 1. Identify if the failing KSM page belongs to a stable node chain.
>>>>> 2. Locate a healthy duplicate KSM page within the same chain.
>>>>> 3. For each process mapping the failing page:
>>>>>      a. Attempt to allocate a new KSM page copy from healthy duplicate
>>>>>         KSM page. If successful, migrate the mapping to this new KSM page.
>>>>>      b. If allocation fails, migrate the mapping to the existing healthy
>>>>>         duplicate KSM page.
>>>>> 4. If all migrations succeed, remove the failing KSM page from the chain.
>>>>> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>>>>>      error) does the kernel fall back to killing the affected processes.
>>>>>
>>>>> Signed-off-by: Longlong Xia <xialonglong@...inos.cn>
>>>> Thanks for your patch. Some comments below.
>>>>
>>>>> ---
>>>>>    mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>    1 file changed, 246 insertions(+)
>>>>>
>>>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>>>> index 160787bb121c..9099bad1ab35 100644
>>>>> --- a/mm/ksm.c
>>>>> +++ b/mm/ksm.c
>>>>> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>>>>>    }
>>>>>      #ifdef CONFIG_MEMORY_FAILURE
>>>>> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
>>>>> +{
>>>>> +    struct ksm_stable_node *stable_node, *dup;
>>>>> +    struct rb_node *node;
>>>>> +    int nid;
>>>>> +
>>>>> +    if (!is_stable_node_dup(dup_node))
>>>>> +        return NULL;
>>>>> +
>>>>> +    for (nid = 0; nid < ksm_nr_node_ids; nid++) {
>>>>> +        node = rb_first(root_stable_tree + nid);
>>>>> +        for (; node; node = rb_next(node)) {
>>>>> +            stable_node = rb_entry(node,
>>>>> +                    struct ksm_stable_node,
>>>>> +                    node);
>>>>> +
>>>>> +            if (!is_stable_node_chain(stable_node))
>>>>> +                continue;
>>>>> +
>>>>> +            hlist_for_each_entry(dup, &stable_node->hlist,
>>>>> +                    hlist_dup) {
>>>>> +                if (dup == dup_node)
>>>>> +                    return stable_node;
>>>>> +            }
> may I add cond_resched(); here ？
>>>>> +        }
>>>>> +    }
>>>> Would above multiple loops take a long time in some corner cases?
>>> Thanks for the concern.
>>>
>>> I do some simple test。
>>>
>>> Test 1: 10 Virtual Machines (Real-world Scenario)
>>> Environment: 10 VMs (256MB each) with KSM enabled
>>>
>>> KSM State:
>>> pages_sharing: 262,802 (≈1GB)
>>> pages_shared: 17,374 （≈68MB）
>>> pages_unshared = 124,057 (≈485MB)
>>> total ≈1.5GB
>>> chain_count = 9, not_chain_count = 17152
>>> Red-black tree nodes to traverse:
>>> 17,161 (9 chains + 17,152 non-chains)
>>>
>>> Performance:
>>> find_chain: 898 μs (0.9 ms)
>>> collect_procs_ksm: 4,409 μs (4.4 ms)
>>> Total memory failure handling: 6,135 μs (6.1 ms)
>>>
>>>
>>> Test 2: 10GB Single Process (Extreme Case)
>>> Environment: Single process with 10GB memory,
>>> 1,310,720 page pairs (each pair identical, different from others)
>>>
>>> KSM State:
>>> pages_sharing: 1,311,740 （≈5GB)
>>> pages_shared: 1,310,724 （≈5GB)
>>> pages_unshared = 0
>>> total ≈10GB
>>> Red-black tree nodes to traverse:
>>> 1,310,721 (1 chain + 1,310,720 non-chains)
>>>
>>> Performance:
>>> find_chain: 28,822 μs (28.8 ms)
>>> collect_procs_ksm: 45,944 μs (45.9 ms)
>>> Total memory failure handling: 46,594 μs (46.6 ms)
>> Thanks for your test.
>>
>>> Summary:
>>> The find_chain function shows approximately linear scaling with the number of red-black tree nodes.
>>> With a 76x increase in nodes (17,161 → 1,310,721), latency increased by 32x (898 μs → 28,822 μs).
>>> representing 62% of total memory failure handling time (46.6ms).
>>> However, since memory failures are rare events, this latency may be acceptable
>>> as it does not impact normal system performance and only affects error recovery paths.
>>>
>> IMHO, the execution time of a kernel function must not be too long without any scheduling points.
>> Otherwise it may affect the normal scheduling of the system and leads to something like performance
>> fluctuation. Or am I miss something?
>>
>> Thanks.
>> .
> 
> I will add cond_resched()in the loop of red-black tree to allow scheduling in find_chain(), may be it is enough?

It looks work to me.

Thanks.
.