linux-kernel - Re: [PATCH] mm: Fix the race between collapse and PT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d73213ba-5ca0-451c-b82b-f590d10da6d7@linux.alibaba.com>
Date: Tue, 5 Aug 2025 15:53:48 +0800
From: Baolin Wang <baolin.wang@...ux.alibaba.com>
To: Qi Zheng <zhengqi.arch@...edance.com>, Barry Song <21cnbao@...il.com>,
 akpm@...ux-foundation.org, linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>,
 "Lai, Yi" <yi1.lai@...ux.intel.com>, David Hildenbrand <david@...hat.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
 Suren Baghdasaryan <surenb@...gle.com>, Lokesh Gidra
 <lokeshgidra@...gle.com>, Tangquan Zheng <zhengtangquan@...o.com>,
 Lance Yang <ioworker0@...il.com>, Zi Yan <ziy@...dia.com>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>, Nico Pache
 <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
 Dev Jain <dev.jain@....com>
Subject: Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under
 per-vma lock



On 2025/8/5 14:42, Qi Zheng wrote:
> Hi Barry,
> 
> On 8/5/25 11:54 AM, Barry Song wrote:
>> From: Barry Song <v-songbaohua@...o.com>
>>
>> The check_pmd_still_valid() call during collapse is currently only
>> protected by the mmap_lock in write mode, which was sufficient when
>> pt_reclaim always ran under mmap_lock in read mode. However, since
>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>> is no longer valid. As a result, a race condition can occur between
>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
> 
> There is indeed a race condition here. And after applying this patch, I
> can no longer reproduce the problem locally (I was able to reproduce it
> stably locally last night).
> 
> But I still can't figure out how this race condtion causes the
> following panic:
> 
> exit_mmap
> --> mmap_read_lock()
>      unmap_vmas()
>      --> pte_offset_map_lock
>          --> rcu_read_lock()
>              check if the pmd entry is a PTE page
>              ptl = pte_lockptr(mm, &pmdval)  <-- ptl is NULL
>              spin_lock(ptl)                  <-- PANIC!!
> 
> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can not 
> be NULL.
> 
> The collapse holds mmap write lock, so it is impossible to be concurrent
> with exit_mmap().
> 
> Confusing. :(

IIUC, the issue is not caused by the concurrency between exit_mmap and 
collapse, but rather by the concurrency between pt_reclaim and collapse.

Before this patch, khugepaged might incorrectly restore a PTE pagetable 
that had already been freed.

pt_reclaim has cleared the pmd entry and freed the PTE page table. 
However, due to the race condition, check_pmd_still_valid() still passes 
and continues to attempt the collapse:

_pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd 
entry (the original pmd entry has been cleared)

pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns 
pte == NULL

Then khugepaged will restore the old PTE pagetable with an invalid pmd 
entry:

pmd_populate(mm, pmd, pmd_pgtable(_pmd));

So when the process exits and trys to free the mapping of the process, 
traversing the invalid pmd table will lead to a crash.

Barry, please correct me if I have misunderstood something.

>>   [   38.151897] Oops: general protection fault, probably for non- 
>> canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
>>   [   38.153519] KASAN: null-ptr-deref in range 
>> [0x0000000000000018-0x000000000000001f]
>>   [   38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 
>> 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
>>   [   38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>> 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
>>   [   38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
>>   [   38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 
>> 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
>>   [   38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
>>   [   38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 
>> 1ffffffff0dde60c
>>   [   38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: 
>> dffffc0000000003
>>   [   38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 
>> 0000000000000000
>>   [   38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 
>> 0000000000000018
>>   [   38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 
>> 0000000000000000
>>   [   38.166100] FS:  0000000000000000(0000) GS:ffff8880e3b40000(0000) 
>> knlGS:0000000000000000
>>   [   38.167137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [   38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 
>> 0000000000770ef0
>>   [   38.168812] PKRU: 55555554
>>   [   38.169275] Call Trace:
>>   [   38.169647]  <TASK>
>>   [   38.169975]  ? __kasan_check_byte+0x19/0x50
>>   [   38.170581]  lock_acquire+0xea/0x310
>>   [   38.171083]  ? rcu_is_watching+0x19/0xc0
>>   [   38.171615]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>>   [   38.172343]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
>>   [   38.173130]  _raw_spin_lock+0x38/0x50
>>   [   38.173707]  ? __pte_offset_map_lock+0x1a2/0x3c0
>>   [   38.174390]  __pte_offset_map_lock+0x1a2/0x3c0
>>   [   38.174987]  ? __pfx___pte_offset_map_lock+0x10/0x10
>>   [   38.175724]  ? __pfx_pud_val+0x10/0x10
>>   [   38.176308]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
>>   [   38.177183]  unmap_page_range+0xb60/0x43e0
>>   [   38.177824]  ? __pfx_unmap_page_range+0x10/0x10
>>   [   38.178485]  ? mas_next_slot+0x133a/0x1a50
>>   [   38.179079]  unmap_single_vma.constprop.0+0x15b/0x250
>>   [   38.179830]  unmap_vmas+0x1fa/0x460
>>   [   38.180373]  ? __pfx_unmap_vmas+0x10/0x10
>>   [   38.180994]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>>   [   38.181877]  exit_mmap+0x1a2/0xb40
>>   [   38.182396]  ? lock_release+0x14f/0x2c0
>>   [   38.182929]  ? __pfx_exit_mmap+0x10/0x10
>>   [   38.183474]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
>>   [   38.184188]  ? mutex_unlock+0x16/0x20
>>   [   38.184704]  mmput+0x132/0x370
>>   [   38.185208]  do_exit+0x7e7/0x28c0
>>   [   38.185682]  ? __this_cpu_preempt_check+0x21/0x30
>>   [   38.186328]  ? do_group_exit+0x1d8/0x2c0
>>   [   38.186873]  ? __pfx_do_exit+0x10/0x10
>>   [   38.187401]  ? __this_cpu_preempt_check+0x21/0x30
>>   [   38.188036]  ? _raw_spin_unlock_irq+0x2c/0x60
>>   [   38.188634]  ? lockdep_hardirqs_on+0x89/0x110
>>   [   38.189313]  do_group_exit+0xe4/0x2c0
>>   [   38.189831]  __x64_sys_exit_group+0x4d/0x60
>>   [   38.190413]  x64_sys_call+0x2174/0x2180
>>   [   38.190935]  do_syscall_64+0x6d/0x2e0
>>   [   38.191449]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> This patch moves the vma_start_write() call to precede
>> check_pmd_still_valid(), ensuring that the check is also properly
>> protected by the per-VMA lock.
>>
>> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
>> Tested-by: "Lai, Yi" <yi1.lai@...ux.intel.com>
>> Reported-by: "Lai, Yi" <yi1.lai@...ux.intel.com>
>> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
>> Cc: David Hildenbrand <david@...hat.com>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
>> Cc: Qi Zheng <zhengqi.arch@...edance.com>
>> Cc: Vlastimil Babka <vbabka@...e.cz>
>> Cc: Jann Horn <jannh@...gle.com>
>> Cc: Suren Baghdasaryan <surenb@...gle.com>
>> Cc: Lokesh Gidra <lokeshgidra@...gle.com>
>> Cc: Tangquan Zheng <zhengtangquan@...o.com>
>> Cc: Lance Yang <ioworker0@...il.com>
>> Cc: Zi Yan <ziy@...dia.com>
>> Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>
>> Cc: Liam R. Howlett <Liam.Howlett@...cle.com>
>> Cc: Nico Pache <npache@...hat.com>
>> Cc: Ryan Roberts <ryan.roberts@....com>
>> Cc: Dev Jain <dev.jain@....com>
>> Signed-off-by: Barry Song <v-songbaohua@...o.com>
>> ---
>>   mm/khugepaged.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 374a6a5193a7..6b40bdfd224c 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct 
>> *mm, unsigned long address,
>>       if (result != SCAN_SUCCEED)
>>           goto out_up_write;
>>       /* check if the pmd is still valid */
>> +    vma_start_write(vma);
>>       result = check_pmd_still_valid(mm, address, pmd);
>>       if (result != SCAN_SUCCEED)
>>           goto out_up_write;
>> -    vma_start_write(vma);
>>       anon_vma_lock_write(vma->anon_vma);
>>       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,