lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJuCfpE=QvAZB9ETMLEi7=tt-+Z9cBA-PxD9YdgvFYC12kydrw@mail.gmail.com>
Date: Mon, 17 Nov 2025 15:35:37 -0800
From: Suren Baghdasaryan <surenb@...gle.com>
To: Kefeng Wang <wangkefeng.wang@...wei.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Barry Song <21cnbao@...il.com>, 
	akpm@...ux-foundation.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Barry Song <v-songbaohua@...o.com>, "Liam R. Howlett" <Liam.Howlett@...cle.com>, 
	David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>, 
	Lokesh Gidra <lokeshgidra@...gle.com>, Tangquan Zheng <zhengtangquan@...o.com>, 
	Qi Zheng <zhengqi.arch@...edance.com>
Subject: Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED

On Tue, Nov 4, 2025 at 5:04 PM Kefeng Wang <wangkefeng.wang@...wei.com> wrote:
>
>
>
> On 2025/11/4 23:21, Lorenzo Stoakes wrote:
> > On Tue, Nov 04, 2025 at 08:09:58PM +0800, Kefeng Wang wrote:
> >>
> >>
> >> On 2025/11/4 17:01, Lorenzo Stoakes wrote:
> >>> On Tue, Nov 04, 2025 at 04:34:35PM +0800, Kefeng Wang wrote:
> >>>>> +static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior)
> >>>>>     {
> >>>>> + int behavior = madv_behavior->behavior;
> >>>>> +
> >>>>>           if (is_memory_failure(behavior))
> >>>>> -         return 0;
> >>>>> +         return MADVISE_NO_LOCK;
> >>>>> - if (madvise_need_mmap_write(behavior)) {
> >>>>> + switch (behavior) {
> >>>>> + case MADV_REMOVE:
> >>>>> + case MADV_WILLNEED:
> >>>>> + case MADV_COLD:
> >>>>> + case MADV_PAGEOUT:
> >>>>> + case MADV_FREE:
> >>>>> + case MADV_POPULATE_READ:
> >>>>> + case MADV_POPULATE_WRITE:
> >>>>> + case MADV_COLLAPSE:
> >>>>> + case MADV_GUARD_INSTALL:
> >>>>> + case MADV_GUARD_REMOVE:
> >>>>> +         return MADVISE_MMAP_READ_LOCK;
> >>>>> + case MADV_DONTNEED:
> >>>>> + case MADV_DONTNEED_LOCKED:
> >>>>> +         return MADVISE_VMA_READ_LOCK;
> >>>>
> >>>> I have a question, we will try per-vma lock for dontneed,
> >>>> but there is a mmap_assert_locked() during madvise_dontneed_free(),
> >>>
> >>> Hmm, this is only in the THP PUD huge case, and MADV_FREE is only valid for
> >>> anonymous memory, and I think only DAX can have some weird THP PUD case.
> >>>
> >>> So I don't think we can hit this.
> >>
> >> Yes, we don't support pud THP for anonymous pages.
> >
> > Right, so we can't hit this.
> >
> >>
> >>>
> >>> In any event, I think this mmap_assert_locked() is mistaken, as we should
> >>> only need a VMA lock here.
> >>>
> >>> So we could replace with a:
> >>>
> >>>     if (!rwsem_is_locked(&tlb->mm->mmap_lock))
> >>>             vma_assert_locked(vma);
> >>>
> >>> ?
> >>>
> >>
> >> The pmd dax/anon split don't have assert, for PUD dax, we maybe remove this
> >> assert?
> >
> > Well, we probably do want to assert that we hold a lock.
>
> OK, let's convert to vma_assert_locked.
> >
> >>
> >>
> >>
> >>
> >>>>
> >>>> madvise_dontneed_free
> >>>>     madvise_dontneed_single_vma
> >>>>       zap_page_range_single_batched
> >>>>         unmap_single_vma
> >>>>            unmap_page_range
> >>>>              zap_pud_range
> >>>>                mmap_assert_locked
> >>>>
> >>>> We could fix it by passing the lock_mode into zap_detial and then check
> >>>> the right lock here, but I'm not sure whether it is safe to zap page
> >>>> only with vma lock?
> >>>
> >>> It's fine to zap with the VMA lock. You need only hold the VMA stable which
> >>> a VMA lock achieves.
> >>>
> >>> See https://docs.kernel.org/mm/process_addrs.html
> >>
> >> Thanks, I will learn it.
> >
> > Hopefully useful, I made it to remind myself of these things as they're very
> > fiddly + otherwise I find myself constantly forgetting these details :)
>
> That should be definitely useful :)
>
> >
> >>
> >>>
> >>>>
> >>>> And another about 4f8ba33bbdfc ("mm: madvise: use per_vma lock
> >>>> for MADV_FREE"), it called walk_page_range_vma() in
> >>>> madvise_free_single_vma(),  but from link[1] and 5631da56c9a8
> >>>> ("fs/proc/task_mmu: read proc/pid/maps under per-vma lock"), it saids
> >>>>
> >>>>     "Note that similar approach would not work for /proc/pid/smaps
> >>>>     reading as it also walks the page table and that's not RCU-safe"
> >>>>
> >>>> We could use walk_page_range_vma() instead of walk_page_range() in
> >>>> smap_gather_stats(), and same question, why 4f8ba33bbdfc(for MADV_FREEE)
> >>>> is safe but not for show_numa_map()/show_smap()?
> >>>
> >>> We only use walk_page_range() there in case 4 listed in show_smaps_rollup()
> >>> where the mmap lock is dropped on contention.
> >>
> >> Sorry, I mean the walk_page_range() in smap_gather_stats() called by
> >> show_smap()  from /proc/pid/smaps, not the walk_page_range() in
> >> show_smaps_rollup() from /proc/pid/smaps_rollup.
> >
> > show_smaps()
> > -> smap_gather_stats(..., start = 0)
> > -> walk_page_vma()
> >
> > Because:
> >
> >       if (!start)
> >               walk_page_vma(vma, ops, mss);
> >
> > The only case where start is non-zero is show_smaps_rollup() case 4. So we are
> > already using walk_page_vma() here right?
> >
> > I may be missing something here :)
>
> You are right, I don't check start value :(
>
> >
> >>
> >>
> >>>
> >>>>
> >>>> Thanks.
> >>>>
> >>>> [1] https://lkml.kernel.org/r/20250719182854.3166724-1-surenb@google.com
> >>>
> >>> AFAICT That's referring to a previous approach that tried to walk
> >>> /proc/$pid/swaps under RCU _alone_ without VMA locks. This is not safe as
> >>> page tables can be yanked from under you not under RCU.
> >>
> >> But for now it tries per-vma lock or fallback to mmap lock, not lockless, so
> >> do you mean we could try per-vma lock for /proc/pid/numa_maps or
> >> /proc/pid/smaps ?
> >
> > Probably we could, but I'm not sure if it'd be really worth it, since traversing
> > page tables is a very heavy operation and so optimising it against contention
> > like this seems probably not all that worth it?
> >
> > Suren maybe could comment on this.
>
> They only operate a single vma(walk_page_vma), I think it is always
> better if we could only hold the vma lock, but wait for Suren comment on it.

Sorry for the delay. I had a huge email backlog and just got to this one.
Lorenzo is correct, the comment was warning agains accessing page
tables under RCU only. If you look up VMA under RCU and then lock it,
then you should be safe.

>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ