linux-kernel - Re: kmemleak splat on copy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170516133925.GA9453@e104818-lin.cambridge.arm.com>
Date:   Tue, 16 May 2017 14:39:27 +0100
From:   Catalin Marinas <catalin.marinas@....com>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     "Luis R. Rodriguez" <mcgrof@...nel.org>,
        Andy Lutomirski <luto@...capital.net>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Ingo Molnar <mingo@...nel.org>,
        Andy Lutomirski <luto@...nel.org>,
        Kees Cook <keescook@...omium.org>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Mateusz Guzik <mguzik@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: kmemleak splat on copy_process()

Thanks for cc'ing me. The vmalloc allocations have always been tricky
for kmemleak since there are 2-3 other memory locations with the same
value as the vmalloc'ed object: vm_struct.addr and vmap_area.va_start;
occasionally we have vmap_area.va_end pointing to the next
vmap_area.va_start.

To have a more reliable object reference counting, kmemleak avoids
scanning most of the vmap_area structure (apart from vmap_area.vm) and
requires a minimum count of 2 references for a vmalloc'ed object to not
be considered a leak (that is vm_struct.addr and the caller's location,
in this case task->stack).

> On Mon 15-05-17 23:53:18, Luis R. Rodriguez wrote:
> > root@...gy:~# cat /sys/kernel/debug/kmemleak
> > unreferenced object 0xffffa07500d4c000 (size 16384):
> >   comm "bash", pid 1349, jiffies 4294895999 (age 263.204s)
> >   hex dump (first 32 bytes):
> >     9d 6e ac 57 00 00 00 00 00 00 00 00 00 00 00 00  .n.W............
> >     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> >   backtrace:
> >     [<ffffffffa5464cca>] kmemleak_alloc+0x4a/0xa0
> >     [<ffffffffa4fdfe6c>] __vmalloc_node_range+0x20c/0x2b0
> >     [<ffffffffa4e7d1a2>] copy_process.part.37+0x5c2/0x1af0
> >     [<ffffffffa4e7e89f>] _do_fork+0xcf/0x390
> >     [<ffffffffa4e7ec09>] SyS_clone+0x19/0x20
> >     [<ffffffffa4e03b0b>] do_syscall_64+0x5b/0xc0
> >     [<ffffffffa547072b>] return_from_SYSCALL_64+0x0/0x6a
> >     [<ffffffffffffffff>] 0xffffffffffffffff

This stack here was allocated by bash for a child process via
alloc_thread_stack_node() when CONFIG_VMAP_STACK is enabled. When the
child died, free_stack_thread() did not call vfree_atomic() but rather
stored the corresponding vm_struct pointer in cached_stacks[i]. However,
the original task->stack pointer was lost (child task_struct freed), so
when scanning the memory kmemleak can only find a single reference to
the original stack (via vm_struct cached_stacks[i]) rather than 2 as
required for vmalloc'ed objects.

BTW, you can get more info about a specific object, including the number
of references found, with:

  echo dump=0xffffa07500d4c000 > /sys/kernel/debug/kmemleak

So basically kernel/fork.c has its own thread stack allocator on top of
vmalloc and kmemleak gets confused. A quick workaround:

------------8<------------------------------
diff --git a/kernel/fork.c b/kernel/fork.c
index 06d759ab4c62..785907fccf67 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -214,6 +214,7 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 		this_cpu_write(cached_stacks[i], NULL);

 		tsk->stack_vm_area = s;
+		kmemleak_alloc(s->addr, THREAD_SIZE, 2, GFP_ATOMIC);
 		local_irq_enable();
 		return s->addr;
 	}
@@ -254,6 +255,7 @@ static inline void free_thread_stack(struct task_struct *tsk)
 				continue;

 			this_cpu_write(cached_stacks[i], tsk->stack_vm_area);
+			kmemleak_free(tsk->stack);
 			local_irq_restore(flags);
 			return;
 		}
------------8<------------------------------

A better solution I think would be something like kmemleak_ignore() on
the free path paired with a new kmemleak_unignore() in order to avoid
the freeing/re-allocation of the kmemleak metadata.

A more complex solution (which I'm not yet convinced it works) may be to
pass the number of references of an objects down to the objects it
refers. In the vmalloc case we normally have a single reference to
vm_struct and two to the vmalloc'ed object. If a reference to the
vmalloc'ed object disappeared we could balance it with an increased
number of references to the vm_struct object. But this option requires
more research.

Unless there are better ideas, I'll post a patch adding
kmemleak_ignore/unignore annotations to kerne/fork.c (and the
corresponding mm/kmemleak.c changes).

Thanks.

-- 
Catalin