lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bfb56be6-d55e-4dcc-93a3-4c7e6faf790f@lucifer.local>
Date: Thu, 5 Jun 2025 15:04:02 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Qi Zheng <zhengqi.arch@...edance.com>
Cc: Jann Horn <jannh@...gle.com>, Barry Song <21cnbao@...il.com>,
        akpm@...ux-foundation.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Lokesh Gidra <lokeshgidra@...gle.com>,
        Tangquan Zheng <zhengtangquan@...o.com>
Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED

On Thu, Jun 05, 2025 at 11:23:18AM +0800, Qi Zheng wrote:
>
>
> On 6/5/25 1:50 AM, Lorenzo Stoakes wrote:
> > On Wed, Jun 04, 2025 at 02:02:12PM +0800, Qi Zheng wrote:
> > > Hi Lorenzo,
> > >
> > > On 6/3/25 5:54 PM, Lorenzo Stoakes wrote:
> > > > On Tue, Jun 03, 2025 at 03:24:28PM +0800, Qi Zheng wrote:
> > > > > Hi Jann,
> > > > >
> > > > > On 5/30/25 10:06 PM, Jann Horn wrote:
> > > > > > On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@...il.com> wrote:
> > > > > > > Certain madvise operations, especially MADV_DONTNEED, occur far more
> > > > > > > frequently than other madvise options, particularly in native and Java
> > > > > > > heaps for dynamic memory management.
> > > > > > >
> > > > > > > Currently, the mmap_lock is always held during these operations, even when
> > > > > > > unnecessary. This causes lock contention and can lead to severe priority
> > > > > > > inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> > > > > > > hold the lock and block higher-priority threads.
> > > > > > >
> > > > > > > This patch enables the use of per-VMA locks when the advised range lies
> > > > > > > entirely within a single VMA, avoiding the need for full VMA traversal. In
> > > > > > > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
> > > > > > >
> > > > > > > Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> > > > > > > benefits from this per-VMA lock optimization. After extended runtime,
> > > > > > > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> > > > > > > only 1,231 fell back to mmap_lock.
> > > > > > >
> > > > > > > To simplify handling, the implementation falls back to the standard
> > > > > > > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> > > > > > > userfaultfd_remove().
> > > > > >
> > > > > > One important quirk of this is that it can, from what I can see, cause
> > > > > > freeing of page tables (through pt_reclaim) without holding the mmap
> > > > > > lock at all:
> > > > > >
> > > > > > do_madvise [behavior=MADV_DONTNEED]
> > > > > >      madvise_lock
> > > > > >        lock_vma_under_rcu
> > > > > >      madvise_do_behavior
> > > > > >        madvise_single_locked_vma
> > > > > >          madvise_vma_behavior
> > > > > >            madvise_dontneed_free
> > > > > >              madvise_dontneed_single_vma
> > > > > >                zap_page_range_single_batched [.reclaim_pt = true]
> > > > > >                  unmap_single_vma
> > > > > >                    unmap_page_range
> > > > > >                      zap_p4d_range
> > > > > >                        zap_pud_range
> > > > > >                          zap_pmd_range
> > > > > >                            zap_pte_range
> > > > > >                              try_get_and_clear_pmd
> > > > > >                              free_pte
> > > > > >
> > > > > > This clashes with the assumption in walk_page_range_novma() that
> > > > > > holding the mmap lock in write mode is sufficient to prevent
> > > > > > concurrent page table freeing, so it can probably lead to page table
> > > > > > UAF through the ptdump interface (see ptdump_walk_pgd()).
> > > > >
> > > > > Maybe not? The PTE page is freed via RCU in zap_pte_range(), so in the
> > > > > following case:
> > > > >
> > > > > cpu 0				cpu 1
> > > > >
> > > > > ptdump_walk_pgd
> > > > > --> walk_pte_range
> > > > >       --> pte_offset_map (hold RCU read lock)
> > > > > 				zap_pte_range
> > > > > 				--> free_pte (via RCU)
> > > > >           walk_pte_range_inner
> > > > >           --> ptdump_pte_entry (the PTE page is not freed at this time)
> > > > >
> > > > > IIUC, there is no UAF issue here?
> > > > >
> > > > > If I missed anything please let me know.
> >
> > Seems to me that we don't need the VMA locks then unless I'm missing
> > something? :) Jann?
> >
> > Would this RCU-lock-acquired-by-pte_offset_map also save us from the
> > munmap() downgraded read lock scenario also? Or is the problem there
> > intermediate page table teardown I guess?
> >
>
> Right. Currently, page table pages other than PTE pages are not
> protected by RCU, so mmap write lock still needed in the munmap path
> to wait for all readers of the page table pages to exit the critical
> section.
>
> In other words, once we have achieved that all page table pages are
> protected by RCU, we can completely remove the page table pages from
> the protection of mmap locks.

Interesting - so on reclaim/migrate we are just clearing PTE entries with
the rmap lock right? Would this lead to a future where we could also tear
down page tables there?

Another point to remember is that when we are clearing down higher level
page tables in the general case, the logic assumes nothing else can touch
anything... we hold both rmap lock AND mmap/vma locks at this point.

But I guess if we're RCU-safe, we're same even from rmap right?

>
> Here are some of my previous thoughts:
>
> ```
> Another plan
> ============
>
> Currently, page table modification are protected by page table locks
> (page_table_lock or split pmd/pte lock), but the life cycle of page
> table pages are protected by mmap_lock (and vma lock). For more details,
> please refer to the latest added Documentation/mm/process_addrs.rst file.
>
> Currently we try to free the PTE pages through RCU when
> CONFIG_PT_RECLAIM is turned on. In this case, we will no longer
> need to hold mmap_lock for the read/write op on the PTE pages.
>
> So maybe we can remove the page table from the protection of the mmap
> lock (which is too big), like this:
>
> 1. free all levels of page table pages by RCU, not just PTE pages, but
>    also pmd, pud, etc.
> 2. similar to pte_offset_map/pte_unmap, add
>    [pmd|pud]_offset_map/[pmd|pud]_unmap, and make them all contain
>    rcu_read_lock/rcu_read_unlcok, and make them accept failure.
>
> In this way, we no longer need the mmap lock. For readers, such as page
> table wallers, we are already in the critical section of RCU. For
> writers, we only need to hold the page table lock.
>
> But there is a difficulty here, that is, the RCU critical section is not
> allowed to sleep, but it is possible to sleep in the callback function
> of .pmd_entry, such as mmu_notifier_invalidate_range_start().
>
> Use SRCU instead? Or use RCU + refcount method? Not sure. But I think
> it's an interesting thing to try.

Thanks for the information, RCU freeing of page tables is something of a
long-term TODO discussed back and forth :) might take a look myself if
somebody else hasn't grabbed when I have a second...

Is it _only_ the mmu notifier sleeping in this scenario? Or are there other
examples?

We could in theory always add another callback .pmd_entry_sleep or
something for this one case and document the requirement...

> ```
>
> Thanks!
>
>

Cheers, Lorenzo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ