linux-kernel - Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <394cb428-c37b-44c7-8367-4f76514a6322@163.com>
Date: Wed, 29 Oct 2025 15:12:45 +0800
From: Long long Xia <xialonglong2025@....com>
To: Miaohe Lin <linmiaohe@...wei.com>
Cc: markus.elfring@....de, nao.horiguchi@...il.com,
 akpm@...ux-foundation.org, wangkefeng.wang@...wei.com, qiuxu.zhuo@...el.com,
 xu.xin16@....com.cn, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 Longlong Xia <xialonglong@...inos.cn>, david@...hat.com, lance.yang@...ux.dev
Subject: Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by
 migrating to healthy duplicate

Thanks for the reply.


在 2025/10/29 14:40, Miaohe Lin 写道:
> On 2025/10/28 15:54, Long long Xia wrote:
>> Thanks for the reply.
>>
>> 在 2025/10/23 19:54, Miaohe Lin 写道:
>>> On 2025/10/16 18:18, Longlong Xia wrote:
>>>> From: Longlong Xia <xialonglong@...inos.cn>
>>>>
>>>> When a hardware memory error occurs on a KSM page, the current
>>>> behavior is to kill all processes mapping that page. This can
>>>> be overly aggressive when KSM has multiple duplicate pages in
>>>> a chain where other duplicates are still healthy.
>>>>
>>>> This patch introduces a recovery mechanism that attempts to
>>>> migrate mappings from the failing KSM page to a newly
>>>> allocated KSM page or another healthy duplicate already
>>>> present in the same chain, before falling back to the
>>>> process-killing procedure.
>>>>
>>>> The recovery process works as follows:
>>>> 1. Identify if the failing KSM page belongs to a stable node chain.
>>>> 2. Locate a healthy duplicate KSM page within the same chain.
>>>> 3. For each process mapping the failing page:
>>>>      a. Attempt to allocate a new KSM page copy from healthy duplicate
>>>>         KSM page. If successful, migrate the mapping to this new KSM page.
>>>>      b. If allocation fails, migrate the mapping to the existing healthy
>>>>         duplicate KSM page.
>>>> 4. If all migrations succeed, remove the failing KSM page from the chain.
>>>> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>>>>      error) does the kernel fall back to killing the affected processes.
>>>>
>>>> Signed-off-by: Longlong Xia <xialonglong@...inos.cn>
>>> Thanks for your patch. Some comments below.
>>>
>>>> ---
>>>>    mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 246 insertions(+)
>>>>
>>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>>> index 160787bb121c..9099bad1ab35 100644
>>>> --- a/mm/ksm.c
>>>> +++ b/mm/ksm.c
>>>> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>>>>    }
>>>>      #ifdef CONFIG_MEMORY_FAILURE
>>>> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
>>>> +{
>>>> +    struct ksm_stable_node *stable_node, *dup;
>>>> +    struct rb_node *node;
>>>> +    int nid;
>>>> +
>>>> +    if (!is_stable_node_dup(dup_node))
>>>> +        return NULL;
>>>> +
>>>> +    for (nid = 0; nid < ksm_nr_node_ids; nid++) {
>>>> +        node = rb_first(root_stable_tree + nid);
>>>> +        for (; node; node = rb_next(node)) {
>>>> +            stable_node = rb_entry(node,
>>>> +                    struct ksm_stable_node,
>>>> +                    node);
>>>> +
>>>> +            if (!is_stable_node_chain(stable_node))
>>>> +                continue;
>>>> +
>>>> +            hlist_for_each_entry(dup, &stable_node->hlist,
>>>> +                    hlist_dup) {
>>>> +                if (dup == dup_node)
>>>> +                    return stable_node;
>>>> +            }
may I add cond_resched(); here ？
>>>> +        }
>>>> +    }
>>> Would above multiple loops take a long time in some corner cases?
>> Thanks for the concern.
>>
>> I do some simple test。
>>
>> Test 1: 10 Virtual Machines (Real-world Scenario)
>> Environment: 10 VMs (256MB each) with KSM enabled
>>
>> KSM State:
>> pages_sharing: 262,802 (≈1GB)
>> pages_shared: 17,374 （≈68MB）
>> pages_unshared = 124,057 (≈485MB)
>> total ≈1.5GB
>> chain_count = 9, not_chain_count = 17152
>> Red-black tree nodes to traverse:
>> 17,161 (9 chains + 17,152 non-chains)
>>
>> Performance:
>> find_chain: 898 μs (0.9 ms)
>> collect_procs_ksm: 4,409 μs (4.4 ms)
>> Total memory failure handling: 6,135 μs (6.1 ms)
>>
>>
>> Test 2: 10GB Single Process (Extreme Case)
>> Environment: Single process with 10GB memory,
>> 1,310,720 page pairs (each pair identical, different from others)
>>
>> KSM State:
>> pages_sharing: 1,311,740 （≈5GB)
>> pages_shared: 1,310,724 （≈5GB)
>> pages_unshared = 0
>> total ≈10GB
>> Red-black tree nodes to traverse:
>> 1,310,721 (1 chain + 1,310,720 non-chains)
>>
>> Performance:
>> find_chain: 28,822 μs (28.8 ms)
>> collect_procs_ksm: 45,944 μs (45.9 ms)
>> Total memory failure handling: 46,594 μs (46.6 ms)
> Thanks for your test.
>
>> Summary:
>> The find_chain function shows approximately linear scaling with the number of red-black tree nodes.
>> With a 76x increase in nodes (17,161 → 1,310,721), latency increased by 32x (898 μs → 28,822 μs).
>> representing 62% of total memory failure handling time (46.6ms).
>> However, since memory failures are rare events, this latency may be acceptable
>> as it does not impact normal system performance and only affects error recovery paths.
>>
> IMHO, the execution time of a kernel function must not be too long without any scheduling points.
> Otherwise it may affect the normal scheduling of the system and leads to something like performance
> fluctuation. Or am I miss something?
>
> Thanks.
> .

I will add cond_resched()in the loop of red-black tree to allow 
scheduling in find_chain(), may be it is enough?

Best regards,
Longlong Xia