lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3cb53060-9769-43f4-996d-355189df107d@bytedance.com>
Date: Wed, 4 Jun 2025 14:02:12 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Jann Horn <jannh@...gle.com>, Barry Song <21cnbao@...il.com>,
 akpm@...ux-foundation.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 Barry Song <v-songbaohua@...o.com>, "Liam R. Howlett"
 <Liam.Howlett@...cle.com>, David Hildenbrand <david@...hat.com>,
 Vlastimil Babka <vbabka@...e.cz>, Suren Baghdasaryan <surenb@...gle.com>,
 Lokesh Gidra <lokeshgidra@...gle.com>,
 Tangquan Zheng <zhengtangquan@...o.com>
Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED

Hi Lorenzo,

On 6/3/25 5:54 PM, Lorenzo Stoakes wrote:
> On Tue, Jun 03, 2025 at 03:24:28PM +0800, Qi Zheng wrote:
>> Hi Jann,
>>
>> On 5/30/25 10:06 PM, Jann Horn wrote:
>>> On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@...il.com> wrote:
>>>> Certain madvise operations, especially MADV_DONTNEED, occur far more
>>>> frequently than other madvise options, particularly in native and Java
>>>> heaps for dynamic memory management.
>>>>
>>>> Currently, the mmap_lock is always held during these operations, even when
>>>> unnecessary. This causes lock contention and can lead to severe priority
>>>> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
>>>> hold the lock and block higher-priority threads.
>>>>
>>>> This patch enables the use of per-VMA locks when the advised range lies
>>>> entirely within a single VMA, avoiding the need for full VMA traversal. In
>>>> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
>>>>
>>>> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
>>>> benefits from this per-VMA lock optimization. After extended runtime,
>>>> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
>>>> only 1,231 fell back to mmap_lock.
>>>>
>>>> To simplify handling, the implementation falls back to the standard
>>>> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
>>>> userfaultfd_remove().
>>>
>>> One important quirk of this is that it can, from what I can see, cause
>>> freeing of page tables (through pt_reclaim) without holding the mmap
>>> lock at all:
>>>
>>> do_madvise [behavior=MADV_DONTNEED]
>>>     madvise_lock
>>>       lock_vma_under_rcu
>>>     madvise_do_behavior
>>>       madvise_single_locked_vma
>>>         madvise_vma_behavior
>>>           madvise_dontneed_free
>>>             madvise_dontneed_single_vma
>>>               zap_page_range_single_batched [.reclaim_pt = true]
>>>                 unmap_single_vma
>>>                   unmap_page_range
>>>                     zap_p4d_range
>>>                       zap_pud_range
>>>                         zap_pmd_range
>>>                           zap_pte_range
>>>                             try_get_and_clear_pmd
>>>                             free_pte
>>>
>>> This clashes with the assumption in walk_page_range_novma() that
>>> holding the mmap lock in write mode is sufficient to prevent
>>> concurrent page table freeing, so it can probably lead to page table
>>> UAF through the ptdump interface (see ptdump_walk_pgd()).
>>
>> Maybe not? The PTE page is freed via RCU in zap_pte_range(), so in the
>> following case:
>>
>> cpu 0				cpu 1
>>
>> ptdump_walk_pgd
>> --> walk_pte_range
>>      --> pte_offset_map (hold RCU read lock)
>> 				zap_pte_range
>> 				--> free_pte (via RCU)
>>          walk_pte_range_inner
>>          --> ptdump_pte_entry (the PTE page is not freed at this time)
>>
>> IIUC, there is no UAF issue here?
>>
>> If I missed anything please let me know.
>>
>> Thanks,
>> Qi
>>
>>
> 
> I forgot about that interesting placement of RCU lock acquisition :) I will
> obviously let Jann come back to you on this, but I wonder if I need to
> update the doc to reflect this actually.

I saw that there is already a relevant description in process_addrs.rst:


```
So accessing PTE-level page tables requires at least holding an RCU read 
lock;
   but that only suffices for readers that can tolerate racing with 
concurrent
   page table updates such that an empty PTE is observed (in a page 
table that
   has actually already been detached and marked for RCU freeing) while 
another
   new page table has been installed in the same location and filled with
   entries. Writers normally need to take the PTE lock and revalidate 
that the
   PMD entry still refers to the same PTE-level page table.
   If the writer does not care whether it is the same PTE-level page 
table, it
   can take the PMD lock and revalidate that the contents of pmd entry 
still meet
   the requirements. In particular, this also happens in 
:c:func:`!retract_page_tables`
   when handling :c:macro:`!MADV_COLLAPSE`.
```

Thanks!



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ