[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230516164907.5bf2qe7nn5bo7iaa@revolver>
Date: Tue, 16 May 2023 12:49:07 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Peter Xu <peterx@...hat.com>
Cc: Lorenzo Stoakes <lstoakes@...il.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Mike Rapoport <rppt@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>
Subject: Re: [PATCH v2] mm: userfaultfd: avoid passing an invalid range to
vma_merge()
* Peter Xu <peterx@...hat.com> [230516 11:06]:
...
> > > This seems like you're relying on:-
> > >
> > > ***
> > > CCCCCNNNNN -> CCNNNNNNNN
>
> I had a closer look today, I still think this patch is not really the right
> one. The split/merge order is something we use everywhere and I am not
> convinced it must change as drastic. At least so far it still seems to me
> if we do with what current patch proposed we can have vma fragmentations.
>
> I think I see what you meant, but here I think it's a legal case where we
> should have PPPP rather than CCCC (PPPPPNNNN --> PPNNNNNNNN).
>
> To be explicit, for register I think it _should_ be the case 0 where we
> cannot merge (note: the current code is indeed wrong though, see below):
>
> ****
> PPPPPPNNNNNN
> cannot merge
>
> While for the unregister case here it's case 4:
>
> ****
> PPPPPPNNNNNN
> might become
> PPNNNNNNNNNN
> case 4 below
>
> Here the problem is not that we need to do split then merge (I think it'll
> have the problem of vma defragmentation here), the problem is we simply
> passed in the wrong "prev" vma pointer, IMHO. I've patches attached
> showing what I meant.
>
> I checked the original commit from Andrea and I found that it _was_ correct:
>
> commit 86039bd3b4e6a1129318cbfed4e0a6e001656635
> Author: Andrea Arcangeli <aarcange@...hat.com>
> Date: Fri Sep 4 15:46:31 2015 -0700
>
> userfaultfd: add new syscall to provide memory externalization
>
> I had a feeling that it's broken during the recent rework on vma (or maybe
> even not that close), but I'm not yet sure and need to further check.
I believe this was 69dbe6daf1041e32e003f966d71f70f20c63af53, which is my
work on this area. Any patches will need to be backported to before
the vma iterator for 6.1
I think keeping the asserts and ensuring we are calling vma_merge() with
arguments that make sense is safer.
>
> > >
> > > By specifying parameters that are compatible with N even though you're only
> > > partially spanning C?
> > >
> > > This is crazy, and isn't how this should be used. vma_merge() is not
> > > supposed to do partial merges. If it works (presumably it does) this is not
> > > by design unless I've lost my mind and I (and others) have somehow not
> > > noticed this??
> > >
> > > I think you're right that now we'll end up with more fragmentation, but
> > > what you're suggesting is not how vma_merge() is supposed to work.
> > >
> > > As I said above, giving vma_merge() invalid parameters is very dangerous as
> > > you could end up merging over empty ranges in theory (and could otherwise
> > > have corruption).
> > >
> > > I guess we should probably be passing 0 to the last parameter in
> > > split_vma() here then to ensure we do a merge pass too. Will experiment
> > > with this.
> > >
> > > I'm confused as to how the remove from case 8 is not proceeding. I'll look
> > > into this some more...
> > >
> > > Happy to be corrected if I'm misconstruing this!
> > >
> >
> > OK, so I wrote a small program to do perform exactly this case [0] and it seems
> > that the outcome is the same before and after this patch - vma_merge() is
> > clearly rejecting the case 8 merge (phew!) and in both instances you end up with
> > 3 VMAs.
> >
> > So this patch doesn't change this behaviour and everything is as it was
> > before. Ideally we'd let it go for another pass, so maybe we should change the
> > split to add a new VMA _afterwards_. Will experiment with that, separately.
> >
> > But looks like the patch is good as it is.
> >
> > (if you notice something wrong with the repro, etc. do let me know!)
> >
> > [0]: https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e
>
> I think your test case is fine, but... no, this is still not expected. The
> vma should just merge.
>
> So I have another closer look on this specific issue, it seems we have a
> long standing bug on pgoff calculation here when passing that to
> vma_merge(). I've got another patch attached to show what I meant.
>
> To summarize.. now I've got two patches attached:
>
> Patch 1 is something I'd like to propose to replace this patch that fixes
> incorrect use of vma_merge() so it should also eliminate the assertion
> being triggered (I still think this is a regression but I need to check
> which I will do later; I'm not super familiar with maple tree work, maybe
> you or Liam can quickly spot).
>
> Patch 2 fixes the long standing issue of vma not being able to merge in
> above case, and with patch 2 applied it should start merging all right.
>
> Please have a look, thanks.
>
> ---8<---
> From 6bc39028bba246394bb0bafdaeaab7b8dfd1694f Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@...hat.com>
> Date: Tue, 16 May 2023 09:03:22 -0400
> Subject: [PATCH 1/2] mm/uffd: Fix vma operation where start addr cuts part of
> vma
>
> It seems vma merging with uffd paths is broken with either
> register/unregister, where right now we can feed wrong parameters to
> vma_merge() and it's found by recent patch which moved asserts upwards in
> vma_merge():
>
> https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/
>
> The problem is in the current code base we didn't fixup "prev" for the case
> where "start" address can be within the "prev" vma section. In that case
> we should have "prev" points to the current vma rather than the previous
> one when feeding to vma_merge().
>
> This will eliminate the report and make sure vma_merge() calls will become
> legal again.
>
> Signed-off-by: Peter Xu <peterx@...hat.com>
> ---
> fs/userfaultfd.c | 27 +++++++++++++++++++++------
> 1 file changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 0fd96d6e39ce..7eb88bc74d00 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1458,10 +1458,17 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> BUG_ON(!found);
>
> vma_iter_set(&vmi, start);
> - prev = vma_prev(&vmi);
> + vma = vma_find(&vmi, end);
> + if (!vma)
> + goto out_unlock;
> +
> + if (vma->vm_start < start)
> + prev = vma;
> + else
> + prev = vma_prev(&vmi);
>
> ret = 0;
> - for_each_vma_range(vmi, vma, end) {
> + do {
The iterator may be off by one, depending on if vma_prev() is called or
not.
Perhaps:
prev = vma_prev(&vmi); /* could be wrong, or null */
vma = vma_find(&vmi, end);
if (!vma)
goto out_unlock;
if (vma->vm_start < start)
prev = vma;
now we know we are at vma with the iterator..
ret = 0
do{
...
> cond_resched();
>
> BUG_ON(!vma_can_userfault(vma, vm_flags));
> @@ -1517,7 +1524,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> skip:
> prev = vma;
> start = vma->vm_end;
> - }
> + } for_each_vma_range(vmi, vma, end);
>
> out_unlock:
> mmap_write_unlock(mm);
> @@ -1624,9 +1631,17 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> BUG_ON(!found);
>
> vma_iter_set(&vmi, start);
> - prev = vma_prev(&vmi);
> + vma = vma_find(&vmi, end);
> + if (!vma)
> + goto out_unlock;
> +
> + if (vma->vm_start < start)
> + prev = vma;
> + else
> + prev = vma_prev(&vmi);
> +
> ret = 0;
> - for_each_vma_range(vmi, vma, end) {
> + do {
> cond_resched();
>
> BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
> @@ -1692,7 +1707,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> skip:
> prev = vma;
> start = vma->vm_end;
> - }
> + } for_each_vma_range(vmi, vma, end);
>
> out_unlock:
> mmap_write_unlock(mm);
> --
> 2.39.1
>
> From bf61f3c937e9e2ab96ab2bed0005cb7dc74db231 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@...hat.com>
> Date: Tue, 16 May 2023 09:39:38 -0400
> Subject: [PATCH 2/2] mm/uffd: Allow vma to merge as much as possible
>
> We used to not pass in the pgoff correctly when register/unregister uffd
> regions, it caused incorrect behavior on vma merging.
>
> For example, when we have:
>
> vma1(range 0-9, with uffd), vma2(range 10-19, no uffd)
>
> Then someone unregisters uffd on range (5-9), it should become:
>
> vma1(range 0-4, with uffd), vma2(range 5-19, no uffd)
>
> But with current code it's:
>
> vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd)
>
> This patch allows such merge to happen correctly.
>
> This behavior seems to have existed since the 1st day of uffd, keep it just
> as a performance optmization and not copy stable.
>
> Cc: Andrea Arcangeli <aarcange@...hat.com>
> Cc: Mike Rapoport (IBM) <rppt@...nel.org>
> Signed-off-by: Peter Xu <peterx@...hat.com>
> ---
> fs/userfaultfd.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 7eb88bc74d00..891048b4799f 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1332,6 +1332,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> bool basic_ioctls;
> unsigned long start, end, vma_end;
> struct vma_iterator vmi;
> + pgoff_t pgoff;
>
> user_uffdio_register = (struct uffdio_register __user *) arg;
>
> @@ -1489,8 +1490,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> vma_end = min(end, vma->vm_end);
>
> new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
> + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
I don't think this is safe. You are telling vma_merge() something that
is not true and will result in can_vma_merge_before() passing. I mean,
sure it will become true after you split (unless you can't?), but I
don't know if you can just merge a VMA that doesn't pass
can_vma_merge_before(), even for a short period?
> prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
> - vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> + vma->anon_vma, vma->vm_file, pgoff,
> vma_policy(vma),
> ((struct vm_userfaultfd_ctx){ ctx }),
> anon_vma_name(vma));
> @@ -1570,6 +1572,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> unsigned long start, end, vma_end;
> const void __user *buf = (void __user *)arg;
> struct vma_iterator vmi;
> + pgoff_t pgoff;
>
> ret = -EFAULT;
> if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
> @@ -1677,8 +1680,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> uffd_wp_range(vma, start, vma_end - start, false);
>
> new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
> - vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> + vma->anon_vma, vma->vm_file, pgoff,
> vma_policy(vma),
> NULL_VM_UFFD_CTX, anon_vma_name(vma));
> if (prev) {
> --
> 2.39.1
> ---8<---
>
> --
> Peter Xu
>
Powered by blists - more mailing lists