lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230816161758.avedpxvqpwngzmut@revolver>
Date:   Wed, 16 Aug 2023 12:17:58 -0400
From:   "Liam R. Howlett" <Liam.Howlett@...cle.com>
To:     Jann Horn <jannh@...gle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        kernel list <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>
Subject: Re: maple tree change made it possible for VMA iteration to see same
 VMA twice due to late vma_merge() failure

* Jann Horn <jannh@...gle.com> [230815 15:37]:
> commit 18b098af2890 ("vma_merge: set vma iterator to correct
> position.") added a vma_prev(vmi) call to vma_merge() at a point where
> it's still possible to bail out. My understanding is that this moves
> the VMA iterator back by one VMA.
> 
> If you patch some extra logging into the kernel and inject a fake
> out-of-memory error at the vma_iter_prealloc() call in vma_split() (a
> real out-of-memory error there is very unlikely to happen in practice,
> I think - my understanding is that the kernel will basically kill
> every process on the system except for init before it starts failing
> GFP_KERNEL allocations that fit within a single slab, unless the
> allocation uses GFP_ACCOUNT or stuff like that, which the maple tree
> doesn't):
> 
> ```
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 7cecd49e078b..a7be4d6a5db6 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1454,9 +1454,16 @@ static int userfaultfd_register(struct
> userfaultfd_ctx *ctx,
>                 prev = vma;
> 
>         ret = 0;
> +       if (strcmp(current->comm, "FAILME") == 0)
> +               pr_warn("%s: begin vma iteration\n", __func__);
>         for_each_vma_range(vmi, vma, end) {
>                 cond_resched();
> 
> +               if (strcmp(current->comm, "FAILME") == 0) {
> +                       pr_warn("%s: prev=%px, vma=%px (%016lx-%016lx)\n",
> +                               __func__, prev, vma, vma->vm_start,
> vma->vm_end);
> +               }
> +
>                 BUG_ON(!vma_can_userfault(vma, vm_flags));
>                 BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
>                        vma->vm_userfaultfd_ctx.ctx != ctx);
> @@ -1481,6 +1488,8 @@ static int userfaultfd_register(struct
> userfaultfd_ctx *ctx,
>                                  vma_policy(vma),
>                                  ((struct vm_userfaultfd_ctx){ ctx }),
>                                  anon_vma_name(vma));
> +               if (strcmp(current->comm, "FAILME") == 0)
> +                       pr_warn("%s: vma_merge returned %px\n", __func__, prev);
>                 if (prev) {
>                         /* vma_merge() invalidated the mas */
>                         vma = prev;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3937479d0e07..fd435c40c43d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -990,7 +990,7 @@ struct vm_area_struct *vma_merge(struct
> vma_iterator *vmi, struct mm_struct *mm,
>         if (err)
>                 return NULL;
> 
> -       if (vma_iter_prealloc(vmi))
> +       if (strcmp(current->comm, "FAILME")==0 || vma_iter_prealloc(vmi))
>                 return NULL;
> 
>         init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
> ```
> 
> and then you run this reproducer:
> ```
> #define _GNU_SOURCE
> #include <err.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/syscall.h>
> #include <sys/prctl.h>
> #include <sys/mman.h>
> #include <sys/ioctl.h>
> #include <linux/userfaultfd.h>
> 
> #ifndef UFFD_USER_MODE_ONLY
> #define UFFD_USER_MODE_ONLY 1
> #endif
> 
> #define SYSCHK(x) ({          \
>   typeof(x) __res = (x);      \
>   if (__res == (typeof(x))-1) \
>     err(1, "SYSCHK(" #x ")"); \
>   __res;                      \
> })
> 
> int main(void) {
>   int uffd = SYSCHK(syscall(__NR_userfaultfd, UFFD_USER_MODE_ONLY));
> 
>   struct uffdio_api api = { .api = UFFD_API, .features = 0 };
>   SYSCHK(ioctl(uffd, UFFDIO_API, &api));
> 
>   /* create vma1 */
>   SYSCHK(mmap((void*)0x100000UL, 0x1000, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0));
> 
>   /* set uffd on vma1 */
>   struct uffdio_register reg1 = {
>     .range = { .start = 0x100000, .len = 0x1000 },
>     .mode = UFFDIO_REGISTER_MODE_MISSING
>   };
>   SYSCHK(ioctl(uffd, UFFDIO_REGISTER, &reg1));
> 
>   /* create vma2 */
>   SYSCHK(mmap((void*)0x101000UL, 0x1000, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0));
> 
>   /* tries to merge vma2 into vma1, with injected allocation failure
> causing merge failure */
>   SYSCHK(prctl(PR_SET_NAME, "FAILME"));
>   struct uffdio_register reg2 = {
>     .range = { .start = 0x101000, .len = 0x1000 },
>     .mode = UFFDIO_REGISTER_MODE_MISSING
>   };
>   SYSCHK(ioctl(uffd, UFFDIO_REGISTER, &reg2));
>   SYSCHK(prctl(PR_SET_NAME, "normal"));
> }
> ```
> 
> then you'll get this fun log output, showing that the same VMA
> (ffff88810c0b5e00) was visited by two iterations of the VMA iteration
> loop, and on the second iteration, prev==vma:
> 
> [  326.765586] userfaultfd_register: begin vma iteration
> [  326.766985] userfaultfd_register: prev=ffff88810c0b5ef0,
> vma=ffff88810c0b5e00 (0000000000101000-0000000000102000)
> [  326.768786] userfaultfd_register: vma_merge returned 0000000000000000
> [  326.769898] userfaultfd_register: prev=ffff88810c0b5e00,
> vma=ffff88810c0b5e00 (0000000000101000-0000000000102000)
> 
> I don't know if this can lead to anything bad but it seems pretty
> clearly unintended?

Yes, unintended.

So we are running out of memory, but since vma_merge() doesn't
differentiate between failure and 'nothing to merge', we end up in a
situation that we will revisit the same VMA.

I've been thinking about a way to work this into the interface and I
don't see a clean way because we (could) do different things before the
call depending on the situation.

I think we need to undo any vma iterator changes in the failure
scenarios if there is a chance of the iterator continuing to be used,
which is probably not limited to just this case.

I will audit these areas and CC you on the result.

Thanks,
Liam

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ