linux-kernel - Re: [PATCH RFC] mm: use per_vma lock for MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bbf35e8a-d90a-4e06-a941-3690be6577ba@lucifer.local>
Date: Wed, 28 May 2025 10:43:55 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>,
        Jann Horn <jannh@...gle.com>, Suren Baghdasaryan <surenb@...gle.com>,
        Lokesh Gidra <lokeshgidra@...gle.com>,
        Tangquan Zheng <zhengtangquan@...o.com>
Subject: Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED

On Wed, May 28, 2025 at 05:36:40PM +0800, Barry Song wrote:
> On Tue, May 27, 2025 at 5:20 PM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
[s[nip]
> >
> > Thanks, this sounds really promising!
> >
> > I take it then you have as a result, heavily tested this change?
>
> This was extensively tested on an older Android kernel with real devices.
> As you know, running the latest mm-unstable on phones is challenging due
> to hardware driver constraints. However, I believe the reported percentage
> is accurate, since it seems pointless for userspace heaps to free memory
> across two or more VMAs. How could it possibly manage a slab-like system
> spanning multiple VMAs?

Right, yeah. mremap()'ing anon memory around might get you unexpected splits,
but generally speaking you'd expect them to be single VMAs.

[snip]

> > >  mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 34 insertions(+)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 8433ac9b27e0..da016a1d0434 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> > >
> > >       if (madvise_should_skip(start, len_in, behavior, &error))
> > >               return error;
> > > +
> > > +     /*
> > > +      * MADV_DONTNEED is commonly used with userspace heaps and most often
> > > +      * affects a single VMA. In these cases, we can use per-VMA locks to
> > > +      * reduce contention on the mmap_lock.
> > > +      */
> > > +     if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) {
> >
> > So firstly doing this here means process_madvise() doesn't get this benefit, and
> > we're inconsistent between the two which we really want to avoid.
> >
> > But secondly - we definitely need to find a better way to do this :) this
> > basically follows the 'ignore the existing approach and throw in an if
> > (special case) { ... }' pattern that I feel we really need to do all we can
> > to avoid in the kernel.
> >
> > This lies the way of uffd, hugetlb, and thus horrors beyond imagining.
> >
> > I can see why you did this as this is kind of special-cased a bit, and we
> > already do this kind of thing all over the place but let's try to avoid
> > this here.
> >
> > So I suggest:
> >
> > - Remove any code for this from do_madvise() and thus make it available to
> >   process_madvise() also.
> >
> > - Try to avoid the special casing here as much as humanly possible :)
> >
> > - Update madvise_lock()/unlock() to get passed a pointer to struct
> >   madvise_behavior to which we can add a boolean or even better I think -
> >   an enum indicating which lock type was taken (this can simplify
> >   madvise_unlock() also).
> >
> > - Update madvise_lock() to do all of the checks below, we already
> >   effectively do a switch (behavior) so it's not so crazy to do this. And
> >   you can also do the fallthrough logic there.
> >
> > - Obviously madvise_unlock() can be updated to do vma_end_read().
>
> I’ve definitely considered refactoring madvise_lock, madvise_do_behavior,
> and madvise_unlock to encapsulate the details of the per-VMA locking and
> mmap_lock handling within those functions:
> madvise_lock(mm, behavior);
> madvise_do_behavior(mm, start, len_in, behavior);
> madvise_unlock(mm, behavior);
>
> However, I’m a bit concerned that this approach might make the code messier
> by introducing extra arguments to handle different code paths. For instance,
> madvise_do_behavior might need an additional parameter to determine whether
> VMA traversal via madvise_walk_vmas is necessary.
>
> That said, I’ll give it a try and see if it actually turns out to be as ugly
> as I fear.

Your reticence is understandable, this code can be pretty hairy, however
thanks to SJ's refactoring efforts you can now use the helper struct he
introduced - madvise_behavior - to store this state, so you don't have to
pass parameters really _much_ more than now.

Unfortunately the current version of the patch isn't mergeable as it
stands, so it really does have to be sensibly generalised, however it ends
up looking.

To be clear - I appreciate all your efforts, and sorry to add some extra
work here, but in the long run it'll be worthwhile I feel, and will lay the
foundations for future use of VMA locks :)

Cheers, Lorenzo

[snip]