linux-kernel - Re: maple tree change made it possible for VMA iteration to see same VMA twice due to late vma

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230816191851.wo2xhthmfq7uzoc3@revolver>
Date:   Wed, 16 Aug 2023 15:18:51 -0400
From:   "Liam R. Howlett" <Liam.Howlett@...cle.com>
To:     Jann Horn <jannh@...gle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        kernel list <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>
Subject: Re: maple tree change made it possible for VMA iteration to see same
 VMA twice due to late vma_merge() failure

* Jann Horn <jannh@...gle.com> [230816 13:13]:
> On Wed, Aug 16, 2023 at 6:18 PM Liam R. Howlett <Liam.Howlett@...cle.com> wrote:
> > * Jann Horn <jannh@...gle.com> [230815 15:37]:
> > > commit 18b098af2890 ("vma_merge: set vma iterator to correct
> > > position.") added a vma_prev(vmi) call to vma_merge() at a point where
> > > it's still possible to bail out. My understanding is that this moves
> > > the VMA iterator back by one VMA.
> > >
> > > If you patch some extra logging into the kernel and inject a fake
> > > out-of-memory error at the vma_iter_prealloc() call in vma_split() (a
> > > real out-of-memory error there is very unlikely to happen in practice,
> > > I think - my understanding is that the kernel will basically kill
> > > every process on the system except for init before it starts failing
> > > GFP_KERNEL allocations that fit within a single slab, unless the
> > > allocation uses GFP_ACCOUNT or stuff like that, which the maple tree
> > > doesn't):
> [...]
> > > then you'll get this fun log output, showing that the same VMA
> > > (ffff88810c0b5e00) was visited by two iterations of the VMA iteration
> > > loop, and on the second iteration, prev==vma:
> > >
> > > [  326.765586] userfaultfd_register: begin vma iteration
> > > [  326.766985] userfaultfd_register: prev=ffff88810c0b5ef0,
> > > vma=ffff88810c0b5e00 (0000000000101000-0000000000102000)
> > > [  326.768786] userfaultfd_register: vma_merge returned 0000000000000000
> > > [  326.769898] userfaultfd_register: prev=ffff88810c0b5e00,
> > > vma=ffff88810c0b5e00 (0000000000101000-0000000000102000)
> > >
> > > I don't know if this can lead to anything bad but it seems pretty
> > > clearly unintended?
> >
> > Yes, unintended.
> >
> > So we are running out of memory, but since vma_merge() doesn't
> > differentiate between failure and 'nothing to merge', we end up in a
> > situation that we will revisit the same VMA.
> >
> > I've been thinking about a way to work this into the interface and I
> > don't see a clean way because we (could) do different things before the
> > call depending on the situation.
> >
> > I think we need to undo any vma iterator changes in the failure
> > scenarios if there is a chance of the iterator continuing to be used,
> > which is probably not limited to just this case.
> 
> I don't fully understand the maple tree interface - in the specific
> case of vma_merge(), could you move the vma_prev() call down below the
> point of no return, after vma_iter_prealloc()? Or does
> vma_iter_prealloc() require that the iterator is already in the insert
> position?

Yes, but maybe it shouldn't.  I detect a write going beyond the end of a
node and take corrective action, but not to the front of a node.

If I change the internal code to figure out the preallocations without
being pointed at the insert location, I still cannot take corrective
action on failure since I don't know where I should have been within the
tree structure, that is, I have lost the original range.

I'm still looking at this, but I'm wondering if I should change my
interface for preallocations so I can handle this internally.  That
would be a bigger change.

> 
> > I will audit these areas and CC you on the result.
> 
> Thanks!