linux-kernel - Re: [PATCH v3 15/15] mm/mmap: Change vma iteration order in do_vmi_align

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230815072907.fsvetn4dzohgt2z5@revolver>
Date:   Tue, 15 Aug 2023 03:29:07 -0400
From:   "Liam R. Howlett" <Liam.Howlett@...cle.com>
To:     Jann Horn <jannh@...gle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Suren Baghdasaryan <surenb@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 15/15] mm/mmap: Change vma iteration order in
 do_vmi_align_munmap()

* Jann Horn <jannh@...gle.com> [230814 17:22]:
> On Mon, Aug 14, 2023 at 10:32 PM Liam R. Howlett
> <Liam.Howlett@...cle.com> wrote:
> > * Jann Horn <jannh@...gle.com> [230814 11:44]:
> > > @akpm
> > >
> > > On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <Liam.Howlett@...cle.com> wrote:
> > > > Since prev will be set later in the function, it is better to reverse
> > > > the splitting direction of the start VMA (modify the new_below argument
> > > > to __split_vma).
> > >
> > > It might be a good idea to reorder "mm: always lock new vma before
> > > inserting into vma tree" before this patch.
> > >
> > > If you apply this patch without "mm: always lock new vma before
> > > inserting into vma tree", I think move_vma(), when called with a start
> > > address in the middle of a VMA, will behave like this:
> > >
> > >  - vma_start_write() [lock the VMA to be moved]
> > >  - move_page_tables() [moves page table entries]
> > >  - do_vmi_munmap()
> > >    - do_vmi_align_munmap()
> > >      - __split_vma()
> > >        - creates a new VMA **covering the moved range** that is **not locked**
> > >        - stores the new VMA in the VMA tree **without locking it** [1]
> > >      - new VMA is locked and removed again [2]
> > > [...]
> > >
> > > So after the page tables in the region have already been moved, I
> > > believe there will be a brief window (between [1] and [2]) where page
> > > faults in the region can happen again, which could probably cause new
> > > page tables and PTEs to be created in the region again in that window.
> > > (This can't happen in Linus' current tree because the new VMA created
> > > by __split_vma() only covers the range that is not being moved.)
> >
> > Ah, so my reversing of which VMA to keep to the first split call opens a
> > window where the VMA being removed is not locked.  Good catch.

Looking at this again, I think it exists in Linus' tree and my change
actually removes this window:

-               error = __split_vma(vmi, vma, start, 0);
+               error = __split_vma(vmi, vma, start, 1);
                if (error)
                        goto start_split_failed;

The last argument is "new_below", which means the new VMA will be at the
lower address.  I don't love the argument of int or the name, also the
documentation is lacking for the split function.

So, once we split at "start", vm_end = "start" in the new VMA while
start will be in the old VMA.  I then lock the old vma to be removed
(again) and add it to the detached maple tree.

Before my patch, we split the VMA and took the new unlocked VMA for
removal.. until I locked the new vma to be removed and add it to the
detached maple tree.  So there is a window that we write the new split
VMA into the tree prior to locking the VMA, but it is locked before
removal.

This change actually aligns the splitting with the other callers who use
the split_vma() wrapper.

> >
> > >
> > > Though I guess that's not going to lead to anything bad, since
> > > do_vmi_munmap() anyway cleans up PTEs and page tables in the region?
> > > So maybe it's not that important.
> >
> > do_vmi_munmap() will clean up PTEs from the end of the previous VMA to
> > the start of the next
> 
> Alright, I guess no action is needed here then.

I don't see a difference between this and the race that exists after the
page fault ends and a task unmaps the area prior to the first task using
the faulted areas?

> 
> > I don't have any objections in the ordering or see an issue resulting
> > from having it this way... Except for maybe lockdep, so maybe we should
> > change the ordering of the patch sets just to be safe?
> >
> > In fact, should we add another check somewhere to ensure we do generate
> > the warning?  Perhaps to remove_mt() to avoid the exit path hitting it?
> 
> I'm not sure which lockdep check you mean. do_vmi_align_munmap() is
> going to lock the VMAs again before it operates on them; I guess the
> only checks that would catch this would be the page table validation
> logic or the RSS counter checks on exit?
> 

I'm trying to add a lockdep to detect this potential window in the
future, but it won't work as you pointed out since it will be locked
before removal.  I'm not sure it's worth it since Suren added more
lockdep checks in his series.

I appreciate you really looking at these changes and thinking them
through.

Regards,
Liam