linux-kernel - Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a reference count

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ulbspoec633hfm54f3jzvoqs6ilskxou3qykk2u727pbaltvfl@lb53vjcaxnuf>
Date: Wed, 18 Dec 2024 15:38:29 -0500
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, akpm@...ux-foundation.org,
        willy@...radead.org, lorenzo.stoakes@...cle.com, mhocko@...e.com,
        vbabka@...e.cz, hannes@...xchg.org, mjguzik@...il.com,
        oliver.sang@...el.com, mgorman@...hsingularity.net, david@...hat.com,
        peterx@...hat.com, oleg@...hat.com, dave@...olabs.net,
        paulmck@...nel.org, brauner@...nel.org, dhowells@...hat.com,
        hdanton@...a.com, hughd@...gle.com, lokeshgidra@...gle.com,
        minchan@...gle.com, jannh@...gle.com, shakeel.butt@...ux.dev,
        souravpanda@...gle.com, pasha.tatashin@...een.com,
        klarasmodin@...il.com, corbet@....net, linux-doc@...r.kernel.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        kernel-team@...roid.com
Subject: Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a
 reference count

* Suren Baghdasaryan <surenb@...gle.com> [241218 15:01]:
> On Wed, Dec 18, 2024 at 11:38 AM 'Liam R. Howlett' via kernel-team
> <kernel-team@...roid.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@...gle.com> [241218 14:29]:
> > > On Wed, Dec 18, 2024 at 11:07 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> > > >
> > > > On Wed, Dec 18, 2024 at 11:00 AM 'Liam R. Howlett' via kernel-team
> > > > <kernel-team@...roid.com> wrote:
> > > > >
> > > > > * Suren Baghdasaryan <surenb@...gle.com> [241218 12:58]:
> > > > > > On Wed, Dec 18, 2024 at 9:44 AM Peter Zijlstra <peterz@...radead.org> wrote:
> > > > > > >
> > > > > > > On Wed, Dec 18, 2024 at 09:36:42AM -0800, Suren Baghdasaryan wrote:
> > > > > > >
> > > > > > > > > You will not. vms_complete_munmap_vmas() will call remove_vma() to
> > > > > > > > > remove PTEs IIRC, and if you do start_write() and detach() before
> > > > > > > > > dropping mmap_lock_write, you should be good.
> > > > > > > >
> > > > > > > > Ok, I think we will have to move mmap_write_downgrade() inside
> > > > > > > > vms_complete_munmap_vmas() to be called after remove_vma().
> > > > > > > > vms_clear_ptes() is using vmas, so we can't move remove_vma() before
> > > > > > > > mmap_write_downgrade().
> > > > > > >
> > > > > > > Why ?!
> > > > > > >
> > > > > > > vms_clear_ptes() and remove_vma() are fine where they are -- there is no
> > > > > > > concurrency left at this point.
> > > > > > >
> > > > > > > Note that by doing vma_start_write() inside vms_complete_munmap_vmas(),
> > > > > > > which is *after* the vmas have been unhooked from the mm, you wait for
> > > > > > > any concurrent user to go away.
> > > > > > >
> > > > > > > And since they're unhooked, there can't be any new users.
> > > > > > >
> > > > > > > So you're the one and only user left, and code is fine the way it is.
> > > > > >
> > > > > > Ok, let me make sure I understand this part of your proposal. From
> > > > > > your earlier email:
> > > > > >
> > > > > > @@ -1173,6 +1173,11 @@ static void vms_complete_munmap_vmas(struct
> > > > > > vma_munmap_struct *vms,
> > > > > >         struct vm_area_struct *vma;
> > > > > >         struct mm_struct *mm;
> > > > > >
> > > > > > +       mas_for_each(mas_detach, vma, ULONG_MAX) {
> > > > > > +               vma_start_write(next);
> > > > > > +               vma_mark_detached(next, true);
> > > > > > +       }
> > > > > > +
> > > > > >         mm = current->mm;
> > > > > >         mm->map_count -= vms->vma_count;
> > > > > >         mm->locked_vm -= vms->locked_vm;
> > > > > >
> > > > > > This would mean:
> > > > > >
> > > > > > vms_complete_munmap_vmas
> > > > > >            vma_start_write
> > > > > >            vma_mark_detached
> > > > > >            mmap_write_downgrade
> > > > > >            vms_clear_ptes
> > > > > >            remove_vma
> > > > > >
> > > > > > And remove_vma will be just freeing the vmas. Is that correct?
> > > > > > I'm a bit confused because the original thinking was that
> > > > > > vma_mark_detached() would drop the last refcnt and if it's 0 we would
> > > > > > free the vma right there. If that's still what we want to do then I
> > > > > > think the above sequence should look like this:
> > > > > >
> > > > > > vms_complete_munmap_vmas
> > > > > >            vms_clear_ptes
> > > > > >            remove_vma
> > > > > >                vma_start_write
> > > > > >                vma_mark_detached
> > > > > >            mmap_write_downgrade
> > > > > >
> > > > > > because vma_start_write+vma_mark_detached should be done under  mmap_write_lock.
> > > > > > Please let me know which way you want to move forward.
> > > > > >
> > > > >
> > > > > Are we sure we're not causing issues with the MAP_FIXED path here?
> > > > >
> > > > > With the above change, we'd be freeing the PTEs before marking the vmas
> > > > > as detached or vma_start_write().
> > > >
> > > > IIUC when we call vms_complete_munmap_vmas() all vmas inside
> > > > mas_detach have been already write-locked, no?
> >
> > That's the way it is today - but I thought you were moving the lock to
> > the complete stage, not adding a new one? (why add a new one otherwise?)
> 
> Is my understanding correct that mas_detach is populated by
> vms_gather_munmap_vmas() only with vmas that went through
> __split_vma() (and were write-locked there)? I don't see any path that
> would add any other vma into mas_detach but maybe I'm missing
> something?

No, that is not correct.

vms_gather_munmap_vmas() calls split on the first vma, then adds all
vmas that are within the range of the munmap() call.  Potentially
splitting the last vma and adding that in the
"if (next->vm_end > vms->end)" block.

Sometimes this is a single vma that gets split twice, sometimes no
splits happen and entire vmas are unmapped, sometimes it's just one vma
that isn't split.

My observation is the common case is a single vma, but besides that we
see 3, and sometimes 7 at a time, but it could be any number of vmas and
not all of them are split.

There is a loop for_each_vma_range() that does:

vma_start_write(next);
mas_set(mas_detach, vms->mas_count++);
mas_store_gfp(mas_detach, next, GFP_KERNEL);


> 
> >
> > >
> > > Yeah, I think we can simply do this:
> > >
> > > vms_complete_munmap_vmas
> > >            vms_clear_ptes
> > >            remove_vma
> > >                vma_mark_detached
> > >            mmap_write_downgrade
> > >
> > > If my assumption is incorrect, assertion inside vma_mark_detached()
> > > should trigger. I tried a quick test and so far nothing exploded.
> > >
> >
> > If they are write locked, then the page faults are not a concern.  There
> > is also the rmap race that Jann found in mmap_region() [1].  This is
> > probably also fine since you are keeping the write lock in place earlier
> > on in the gather stage.  Note the ptes will already be cleared by the
> > time vms_complete_munmap_vmas() is called in this case.
> >
> > [1] https://lore.kernel.org/all/CAG48ez0ZpGzxi=-5O_uGQ0xKXOmbjeQ0LjZsRJ1Qtf2X5eOr1w@mail.gmail.com/
> >
> > Thanks,
> > Liam
> >
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@...roid.com.
> >