linux-kernel - Re: kmemleak splat on copy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Wed, 17 May 2017 12:09:23 +0100
From:   Catalin Marinas <catalin.marinas@....com>
To:     Andy Lutomirski <luto@...nel.org>
Cc:     Michal Hocko <mhocko@...nel.org>,
        "Luis R. Rodriguez" <mcgrof@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Ingo Molnar <mingo@...nel.org>,
        Kees Cook <keescook@...omium.org>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Mateusz Guzik <mguzik@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: kmemleak splat on copy_process()

On Tue, May 16, 2017 at 04:55:28PM -0700, Andy Lutomirski wrote:
> On Tue, May 16, 2017 at 6:39 AM, Catalin Marinas
> <catalin.marinas@....com> wrote:
> > Thanks for cc'ing me. The vmalloc allocations have always been tricky
> > for kmemleak since there are 2-3 other memory locations with the same
> > value as the vmalloc'ed object: vm_struct.addr and vmap_area.va_start;
> > occasionally we have vmap_area.va_end pointing to the next
> > vmap_area.va_start.
> >
> > To have a more reliable object reference counting, kmemleak avoids
> > scanning most of the vmap_area structure (apart from vmap_area.vm) and
> > requires a minimum count of 2 references for a vmalloc'ed object to not
> > be considered a leak (that is vm_struct.addr and the caller's location,
> > in this case task->stack).
> >
> >> On Mon 15-05-17 23:53:18, Luis R. Rodriguez wrote:
> >> > root@...gy:~# cat /sys/kernel/debug/kmemleak
> >> > unreferenced object 0xffffa07500d4c000 (size 16384):
> >> >   comm "bash", pid 1349, jiffies 4294895999 (age 263.204s)
> >> >   hex dump (first 32 bytes):
> >> >     9d 6e ac 57 00 00 00 00 00 00 00 00 00 00 00 00  .n.W............
> >> >     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> >> >   backtrace:
> >> >     [<ffffffffa5464cca>] kmemleak_alloc+0x4a/0xa0
> >> >     [<ffffffffa4fdfe6c>] __vmalloc_node_range+0x20c/0x2b0
> >> >     [<ffffffffa4e7d1a2>] copy_process.part.37+0x5c2/0x1af0
> >> >     [<ffffffffa4e7e89f>] _do_fork+0xcf/0x390
> >> >     [<ffffffffa4e7ec09>] SyS_clone+0x19/0x20
> >> >     [<ffffffffa4e03b0b>] do_syscall_64+0x5b/0xc0
> >> >     [<ffffffffa547072b>] return_from_SYSCALL_64+0x0/0x6a
> >> >     [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > This stack here was allocated by bash for a child process via
> > alloc_thread_stack_node() when CONFIG_VMAP_STACK is enabled. When the
> > child died, free_stack_thread() did not call vfree_atomic() but rather
> > stored the corresponding vm_struct pointer in cached_stacks[i]. However,
> > the original task->stack pointer was lost (child task_struct freed), so
> > when scanning the memory kmemleak can only find a single reference to
> > the original stack (via vm_struct cached_stacks[i]) rather than 2 as
> > required for vmalloc'ed objects.
> >
> > BTW, you can get more info about a specific object, including the number
> > of references found, with:
> >
> >   echo dump=0xffffa07500d4c000 > /sys/kernel/debug/kmemleak
> >
> > So basically kernel/fork.c has its own thread stack allocator on top of
> > vmalloc and kmemleak gets confused. A quick workaround:
> >
> > ------------8<------------------------------
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 06d759ab4c62..785907fccf67 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -214,6 +214,7 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
> >                 this_cpu_write(cached_stacks[i], NULL);
> >
> >                 tsk->stack_vm_area = s;
> > +               kmemleak_alloc(s->addr, THREAD_SIZE, 2, GFP_ATOMIC);
> >                 local_irq_enable();
> >                 return s->addr;
> >         }
> > @@ -254,6 +255,7 @@ static inline void free_thread_stack(struct task_struct *tsk)
> >                                 continue;
> >
> >                         this_cpu_write(cached_stacks[i], tsk->stack_vm_area);
> > +                       kmemleak_free(tsk->stack);
> >                         local_irq_restore(flags);
> >                         return;
> >                 }
> > ------------8<------------------------------
> 
> I would do this differently.  The problem only affects cached stacks,
> right? How about kmemleal_ignoring when caching a stack and
> unignoring when uncaching a stack?

Indeed ;), that was my paragraph just below the patch:

> > A better solution I think would be something like kmemleak_ignore() on
> > the free path paired with a new kmemleak_unignore() in order to avoid
> > the freeing/re-allocation of the kmemleak metadata.

> Also, this needs a serious comment.  How about "kmemleak does not
> presently understand that a reference to a vm_area_struct implies a
> reference to the vmalloced memory"?

Yes, as most of the other kmemleak annotations.

> > A more complex solution (which I'm not yet convinced it works) may be to
> > pass the number of references of an objects down to the objects it
> > refers. In the vmalloc case we normally have a single reference to
> > vm_struct and two to the vmalloc'ed object. If a reference to the
> > vmalloc'ed object disappeared we could balance it with an increased
> > number of references to the vm_struct object. But this option requires
> > more research.
> 
> The idea being that, with this change, kmemleak would start to
> understand that surplus references to vm_area_struct cause it to think
> the memory itself is reachable?

That's the idea, possibly with a new kmemleak API specific to vmalloc
that would link the vm_struct object with the vmalloc'ed one. The
advantage is that we don't need to annotate the stack caching (that's
the only place that I'm aware of), though the actual implementation is a
bit more complex than kmemleak_ignore/unignore (and assuming the idea is
sane).

-- 
Catalin