linux-kernel - Re: [PATCH RFC v2] mm: use per_vma lock for MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com>
Date: Fri, 30 May 2025 16:06:30 +0200
From: Jann Horn <jannh@...gle.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>, 
	"Liam R. Howlett" <Liam.Howlett@...cle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>, 
	Suren Baghdasaryan <surenb@...gle.com>, Lokesh Gidra <lokeshgidra@...gle.com>, 
	Tangquan Zheng <zhengtangquan@...o.com>
Subject: Re: [PATCH RFC v2] mm: use per_vma lock for MADV_DONTNEED

On Fri, May 30, 2025 at 12:44 PM Barry Song <21cnbao@...il.com> wrote:
> Certain madvise operations, especially MADV_DONTNEED, occur far more
> frequently than other madvise options, particularly in native and Java
> heaps for dynamic memory management.
>
> Currently, the mmap_lock is always held during these operations, even when
> unnecessary. This causes lock contention and can lead to severe priority
> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> hold the lock and block higher-priority threads.
>
> This patch enables the use of per-VMA locks when the advised range lies
> entirely within a single VMA, avoiding the need for full VMA traversal. In
> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
>
> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> benefits from this per-VMA lock optimization. After extended runtime,
> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> only 1,231 fell back to mmap_lock.
>
> To simplify handling, the implementation falls back to the standard
> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> userfaultfd_remove().

One important quirk of this is that it can, from what I can see, cause
freeing of page tables (through pt_reclaim) without holding the mmap
lock at all:

do_madvise [behavior=MADV_DONTNEED]
  madvise_lock
    lock_vma_under_rcu
  madvise_do_behavior
    madvise_single_locked_vma
      madvise_vma_behavior
        madvise_dontneed_free
          madvise_dontneed_single_vma
            zap_page_range_single_batched [.reclaim_pt = true]
              unmap_single_vma
                unmap_page_range
                  zap_p4d_range
                    zap_pud_range
                      zap_pmd_range
                        zap_pte_range
                          try_get_and_clear_pmd
                          free_pte

This clashes with the assumption in walk_page_range_novma() that
holding the mmap lock in write mode is sufficient to prevent
concurrent page table freeing, so it can probably lead to page table
UAF through the ptdump interface (see ptdump_walk_pgd()).

I think before this patch can land, you'll have to introduce some new
helper like:

void mmap_write_lock_with_all_vmas(struct mm_struct *mm)
{
  mmap_write_lock(mm);
  for_each_vma(vmi, vma)
    vma_start_write(vma);
}

and use that in walk_page_range_novma() for user virtual address space
walks, and update the comment in there.

> Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> Cc: David Hildenbrand <david@...hat.com>
> Cc: Vlastimil Babka <vbabka@...e.cz>
> Cc: Jann Horn <jannh@...gle.com>
> Cc: Suren Baghdasaryan <surenb@...gle.com>
> Cc: Lokesh Gidra <lokeshgidra@...gle.com>
> Cc: Tangquan Zheng <zhengtangquan@...o.com>
> Signed-off-by: Barry Song <v-songbaohua@...o.com>
[...]
> +static void madvise_unlock(struct mm_struct *mm,
> +               struct madvise_behavior *madv_behavior)
> +{
> +       if (madv_behavior->vma)
> +               vma_end_read(madv_behavior->vma);

Please set madv_behavior->vma to NULL here, so that if madvise_lock()
was called on madv_behavior again and decided to take the mmap lock
that time, the next madvise_unlock() wouldn't take the wrong branch
here.

> +       else
> +               __madvise_unlock(mm, madv_behavior->behavior);
> +}
> +
>  static bool madvise_batch_tlb_flush(int behavior)
>  {
>         switch (behavior) {
> @@ -1714,19 +1770,24 @@ static int madvise_do_behavior(struct mm_struct *mm,
>                 unsigned long start, size_t len_in,
>                 struct madvise_behavior *madv_behavior)
>  {
> +       struct vm_area_struct *vma = madv_behavior->vma;
>         int behavior = madv_behavior->behavior;
> +
>         struct blk_plug plug;
>         unsigned long end;
>         int error;
>
>         if (is_memory_failure(behavior))
>                 return madvise_inject_error(behavior, start, start + len_in);
> -       start = untagged_addr_remote(mm, start);
> +       start = untagged_addr(start);

Why is this okay? I see that X86's untagged_addr_remote() asserts that
the mmap lock is held, which is no longer the case here with your
patch, but untagged_addr() seems wrong here, since we can be operating
on another process. I think especially on X86 with 5-level paging and
LAM, there can probably be cases where address bits are used for part
of the virtual address in one task while they need to be masked off in
another task?

I wonder if you'll have to refactor X86 and Risc-V first to make this
work... ideally by making sure that their address tagging state
updates are atomic and untagged_area_remote() works locklessly.

(Or you could try to use something like the
mmap_write_lock_with_all_vmas() I proposed above for synchronizing
against untagged_addr(), first write-lock the MM and then write-lock
all VMAs in it...)

>         end = start + PAGE_ALIGN(len_in);
>
>         blk_start_plug(&plug);
>         if (is_madvise_populate(behavior))
>                 error = madvise_populate(mm, start, end, behavior);
> +       else if (vma)
> +               error = madvise_single_locked_vma(vma, start, end,
> +                               madv_behavior, madvise_vma_behavior);
>         else
>                 error = madvise_walk_vmas(mm, start, end, madv_behavior,
>                                           madvise_vma_behavior);
> @@ -1847,7 +1908,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>
>         total_len = iov_iter_count(iter);
>
> -       ret = madvise_lock(mm, behavior);
> +       ret = __madvise_lock(mm, behavior);
>         if (ret)
>                 return ret;
>         madvise_init_tlb(&madv_behavior, mm);

(I think Lorenzo was saying that he would like madvise_lock() to also
be used in vector_madvise()? But I'll let him comment on that.)