linux-kernel - Re: [PATCH 00/40] Memory allocation profiling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpEFV7ZB4pvnf6n0bVpTCDWCVQup9PtrHuAayrf3GrQskg@mail.gmail.com>
Date:   Wed, 3 May 2023 10:42:11 -0700
From:   Suren Baghdasaryan <surenb@...gle.com>
To:     Tejun Heo <tj@...nel.org>
Cc:     Kent Overstreet <kent.overstreet@...ux.dev>,
        Michal Hocko <mhocko@...e.com>, akpm@...ux-foundation.org,
        vbabka@...e.cz, hannes@...xchg.org, roman.gushchin@...ux.dev,
        mgorman@...e.de, dave@...olabs.net, willy@...radead.org,
        liam.howlett@...cle.com, corbet@....net, void@...ifault.com,
        peterz@...radead.org, juri.lelli@...hat.com, ldufour@...ux.ibm.com,
        catalin.marinas@....com, will@...nel.org, arnd@...db.de,
        tglx@...utronix.de, mingo@...hat.com, dave.hansen@...ux.intel.com,
        x86@...nel.org, peterx@...hat.com, david@...hat.com,
        axboe@...nel.dk, mcgrof@...nel.org, masahiroy@...nel.org,
        nathan@...nel.org, dennis@...nel.org, muchun.song@...ux.dev,
        rppt@...nel.org, paulmck@...nel.org, pasha.tatashin@...een.com,
        yosryahmed@...gle.com, yuzhao@...gle.com, dhowells@...hat.com,
        hughd@...gle.com, andreyknvl@...il.com, keescook@...omium.org,
        ndesaulniers@...gle.com, gregkh@...uxfoundation.org,
        ebiggers@...gle.com, ytcoode@...il.com, vincent.guittot@...aro.org,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        bristot@...hat.com, vschneid@...hat.com, cl@...ux.com,
        penberg@...nel.org, iamjoonsoo.kim@....com, 42.hyeyoo@...il.com,
        glider@...gle.com, elver@...gle.com, dvyukov@...gle.com,
        shakeelb@...gle.com, songmuchun@...edance.com, jbaron@...mai.com,
        rientjes@...gle.com, minchan@...gle.com, kaleshsingh@...gle.com,
        kernel-team@...roid.com, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, iommu@...ts.linux.dev,
        linux-arch@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-mm@...ck.org, linux-modules@...r.kernel.org,
        kasan-dev@...glegroups.com, cgroups@...r.kernel.org
Subject: Re: [PATCH 00/40] Memory allocation profiling

On Wed, May 3, 2023 at 9:35 AM Tejun Heo <tj@...nel.org> wrote:
>
> Hello, Kent.
>
> On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> > No, we're still waiting on the tracing people to _demonstrate_, not
> > claim, that this is at all possible in a comparable way with tracing.
>
> So, we (meta) happen to do stuff like this all the time in the fleet to hunt
> down tricky persistent problems like memory leaks, ref leaks, what-have-you.
> In recent kernels, with kprobe and BPF, our ability to debug these sorts of
> problems has improved a great deal. Below, I'm attaching a bcc script I used
> to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
> can follow the same pattern.

Thanks for sharing, Tejun!

>
> There are of course some pros and cons to this approach:
>
> Pros:
>
> * The framework doesn't really have any runtime overhead, so we can have it
>   deployed in the entire fleet and debug wherever problem is.

Do you mean it has no runtime overhead when disabled?
If so, do you know what's the overhead when enabled? I want to
understand if that's truly a viable solution to track all allocations
(including slab) all the time.
Thanks,
Suren.

>
> * It's fully flexible and programmable which enables non-trivial filtering
>   and summarizing to be done inside kernel w/ BPF as necessary, which is
>   pretty handy for tracking high frequency events.
>
> * BPF is pretty performant. Dedicated built-in kernel code can do better of
>   course but BPF's jit compiled code & its data structures are fast enough.
>   I don't remember any time this was a problem.
>
> Cons:
>
> * BPF has some learning curve. Also the fact that what it provides is a wide
>   open field rather than something scoped out for a specific problem can
>   make it seem a bit daunting at the beginning.
>
> * Because tracking starts when the script starts running, it doesn't know
>   anything which has happened upto that point, so you gotta pay attention to
>   handling e.g. handling frees which don't match allocs. It's kinda annoying
>   but not a huge problem usually. There are ways to build in BPF progs into
>   the kernel and load it early but I haven't experiemnted with it yet
>   personally.
>
> I'm not necessarily against adding dedicated memory debugging mechanism but
> do wonder whether the extra benefits would be enough to justify the code and
> maintenance overhead.
>
> Oh, a bit of delta but for anyone who's more interested in debugging
> problems like this, while I tend to go for bcc
> (https://github.com/iovisor/bcc) for this sort of problems. Others prefer to
> write against libbpf directly or use bpftrace
> (https://github.com/iovisor/bpftrace).
>
> Thanks.
>
> #!/usr/bin/env bcc-py
>
> import bcc
> import time
> import datetime
> import argparse
> import os
> import sys
> import errno
>
> description = """
> Record vmalloc/vfrees and trigger on unmatched vfree
> """
>
> bpf_source = """
> #include <uapi/linux/ptrace.h>
> #include <linux/vmalloc.h>
>
> struct vmalloc_rec {
>         unsigned long           ptr;
>         int                     last_alloc_stkid;
>         int                     last_free_stkid;
>         int                     this_stkid;
>         bool                    allocated;
> };
>
> BPF_STACK_TRACE(stacks, 8192);
> BPF_HASH(vmallocs, unsigned long, struct vmalloc_rec, 131072);
> BPF_ARRAY(dup_free, struct vmalloc_rec, 1);
>
> int kpret_vmalloc_node_range(struct pt_regs *ctx)
> {
>         unsigned long ptr = PT_REGS_RC(ctx);
>         uint32_t zkey = 0;
>         struct vmalloc_rec rec_init = { };
>         struct vmalloc_rec *rec;
>         int stkid;
>
>         if (!ptr)
>                 return 0;
>
>         stkid = stacks.get_stackid(ctx, 0);
>
>         rec_init.ptr = ptr;
>         rec_init.last_alloc_stkid = -1;
>         rec_init.last_free_stkid = -1;
>         rec_init.this_stkid = -1;
>
>         rec = vmallocs.lookup_or_init(&ptr, &rec_init);
>         rec->allocated = true;
>         rec->last_alloc_stkid = stkid;
>         return 0;
> }
>
> int kp_vfree(struct pt_regs *ctx, const void *addr)
> {
>         unsigned long ptr = (unsigned long)addr;
>         uint32_t zkey = 0;
>         struct vmalloc_rec rec_init = { };
>         struct vmalloc_rec *rec;
>         int stkid;
>
>         stkid = stacks.get_stackid(ctx, 0);
>
>         rec_init.ptr = ptr;
>         rec_init.last_alloc_stkid = -1;
>         rec_init.last_free_stkid = -1;
>         rec_init.this_stkid = -1;
>
>         rec = vmallocs.lookup_or_init(&ptr, &rec_init);
>         if (!rec->allocated && rec->last_alloc_stkid >= 0) {
>                 rec->this_stkid = stkid;
>                 dup_free.update(&zkey, rec);
>         }
>
>         rec->allocated = false;
>         rec->last_free_stkid = stkid;
>         return 0;
> }
> """
>
> bpf = bcc.BPF(text=bpf_source)
> bpf.attach_kretprobe(event="__vmalloc_node_range", fn_name="kpret_vmalloc_node_range");
> bpf.attach_kprobe(event="vfree", fn_name="kp_vfree");
> bpf.attach_kprobe(event="vfree_atomic", fn_name="kp_vfree");
>
> stacks = bpf["stacks"]
> vmallocs = bpf["vmallocs"]
> dup_free = bpf["dup_free"]
> last_dup_free_ptr = dup_free[0].ptr
>
> def print_stack(stkid):
>     for addr in stacks.walk(stkid):
>         sym = bpf.ksym(addr)
>         print('  {}'.format(sym))
>
> def print_dup(dup):
>     print('allocated={} ptr={}'.format(dup.allocated, hex(dup.ptr)))
>     if (dup.last_alloc_stkid >= 0):
>         print('last_alloc_stack: ')
>         print_stack(dup.last_alloc_stkid)
>     if (dup.last_free_stkid >= 0):
>         print('last_free_stack: ')
>         print_stack(dup.last_free_stkid)
>     if (dup.this_stkid >= 0):
>         print('this_stack: ')
>         print_stack(dup.this_stkid)
>
> while True:
>     time.sleep(1)
>
>     if dup_free[0].ptr != last_dup_free_ptr:
>         print('\nDUP_FREE:')
>         print_dup(dup_free[0])
>         last_dup_free_ptr = dup_free[0].ptr
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@...roid.com.
>