netdev - kmemleak reports related to napi_get_frags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <f4539cfe06be027d210d2af121c37d1c67a6a53a.camel@redhat.com>
Date: Fri, 08 Nov 2024 14:40:24 -0500
From: Radu Rendec <rrendec@...hat.com>
To: linux-mm@...ck.org, netdev@...r.kernel.org
Subject: kmemleak reports related to napi_get_frags_check

Hi everyone,

I'm investigating some kmemleak reports that are related to the
napi_get_frags_check() function. Similar issues have been reported
before in [1] and [2], and the upper part of the stack trace, starting
at gro_cells_init(), is identical in my case.

I am pretty sure this is a kmemleak false-positive, which is not
surprising, and I am approaching this from a different perspective -
trying to understand how a false-positive in this particular case is
even possible. So far, I have been unsuccessful. Like Eric Dumazet
pointed out in his reply to [1], napi_get_frags_check() is very self-
contained. It allocates an skb and then immediately frees it.

I would appreciate if anyone could offer any insights or new ideas to
try to explain this behavior. Again, this is not about fixing the
networking code (because I believe there's nothing to fix there) but
rather finding a solid explanation for how the kmemleak report is
possible. That might lead to either direct (code) or indirect (usage)
improvements to kmemleak.

My understanding is that kmemleak immediately removes an object from
its internal list of tracked objects upon deallocation of the object.
It also has a built-in object age threshold of 5 seconds before it
reports a leak, specifically to avoid false-positives when pointers to
the allocated objects are in flight and/or temporarily stored in CPU
registers. Since in this case the deallocation is done immediately
after the allocation and it's unconditional, I can't even imagine how
it can escape the object age guard check.

For the record, this is the kmemleak report that I'm seeing:

unreferenced object 0xffff4fc0425ede40 (size 240):
  comm "(ostnamed)", pid 25664, jiffies 4296402173
  hex dump (first 32 bytes):
    e0 99 5f 27 c1 4f ff ff 40 c3 5e 42 c0 4f ff ff  .._'.O..@....O..
    00 c0 24 15 c0 4f ff ff 00 00 00 00 00 00 00 00  ..$..O..........
  backtrace (crc 1f19ed80):
    [<ffffbc229bc23c04>] kmemleak_alloc+0xb4/0xc4
    [<ffffbc229a16cfcc>] slab_post_alloc_hook+0xac/0x120
    [<ffffbc229a172608>] kmem_cache_alloc_bulk+0x158/0x1a0
    [<ffffbc229b645e18>] napi_skb_cache_get+0xe8/0x160
    [<ffffbc229b64af64>] __napi_build_skb+0x24/0x60
    [<ffffbc229b650240>] napi_alloc_skb+0x17c/0x2dc
    [<ffffbc229b76c65c>] napi_get_frags+0x5c/0xb0
    [<ffffbc229b65b3e8>] napi_get_frags_check+0x38/0xb0
    [<ffffbc229b697794>] netif_napi_add_weight+0x4f0/0x84c
    [<ffffbc229b7d2704>] gro_cells_init+0x1a4/0x2d0
    [<ffffbc2250d8553c>] ip_tunnel_init+0x19c/0x660 [ip_tunnel]
    [<ffffbc2250e020c0>] ipip_tunnel_init+0xe0/0x110 [ipip]
    [<ffffbc229b6c5480>] register_netdevice+0x440/0xea4
    [<ffffbc2250d846b0>] __ip_tunnel_create+0x280/0x444 [ip_tunnel]
    [<ffffbc2250d88978>] ip_tunnel_init_net+0x264/0x42c [ip_tunnel]
    [<ffffbc2250e02150>] ipip_init_net+0x30/0x40 [ipip] 

The obvious test, which I already did, is to create/delete ip tunnel
interfaces in a loop. I let this test run for more than 24 hours, and
kmemleak did *not* detect anything. I also attached a kprobe inside
napi_skb_cache_get() right after the call to kmem_cache_alloc_bulk(),
and successfully verified that the allocation path is indeed exercised
by the test i.e., the skb is *not* always returned from the per-cpu
napi cache pool. In other words, I was unable to find a way to
reproduce these kmemleak reports.

It is worth noting that in the case of a "manually" created tunnel
using `ip tunnel add ... mode ipip ...`, the lower part of the stack is
different from the kmemleak report (see below). But I don't think this
can affect the skb allocation or pointer handling behavior, and the
upper part of the stack, starting at register_netdevice(), is identical
anyway.

comm: [ip], pid: 101422
        ip_tunnel_init+0
        register_netdevice+1088
        __ip_tunnel_create+640
        ip_tunnel_ctl+956
        ipip_tunnel_ctl+380
        ip_tunnel_siocdevprivate+212
        dev_ifsioc+1096
        dev_ioctl+348
        sock_ioctl+1760
        __arm64_sys_ioctl+288
        invoke_syscall.constprop.0+216
        do_el0_svc+344
        el0_svc+84
        el0t_64_sync_handler+308
        el0t_64_sync+380

Another thing I did consider is whether kmemleak is likely to be
confused by per-cpu allocations, since gcells->cells is per-cpu
allocated in gro_cells_init(). I created a simple test kernel module
that did a similar per-cpu allocation, and I did *not* notice any
problem with kmemleak being able to track dynamically allocated blocks
that are referenced through per-cpu pointers.

One final note is that the reports in [1] seem to have been observed on
x86_64 (judging by the presence of entry_SYSCALL_64_after_hwframe in
the stack trace), while mine were observed on aarch64. So, whatever the
root cause behind these kmemleak reports is, it seems to be
architecture independent.

Thanks in advance,
Radu Rendec

[1] https://lore.kernel.org/all/YwkH9zTmLRvDHHbP@krava/
[2] https://lore.kernel.org/all/1667213123-18922-1-git-send-email-wangyufen@huawei.com/