lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZFKlrP7nLn93iIRf@slm.duckdns.org>
Date:   Wed, 3 May 2023 08:19:24 -1000
From:   Tejun Heo <tj@...nel.org>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Kent Overstreet <kent.overstreet@...ux.dev>,
        Michal Hocko <mhocko@...e.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        akpm@...ux-foundation.org, vbabka@...e.cz,
        roman.gushchin@...ux.dev, mgorman@...e.de, dave@...olabs.net,
        willy@...radead.org, liam.howlett@...cle.com, corbet@....net,
        void@...ifault.com, peterz@...radead.org, juri.lelli@...hat.com,
        ldufour@...ux.ibm.com, catalin.marinas@....com, will@...nel.org,
        arnd@...db.de, tglx@...utronix.de, mingo@...hat.com,
        dave.hansen@...ux.intel.com, x86@...nel.org, peterx@...hat.com,
        david@...hat.com, axboe@...nel.dk, mcgrof@...nel.org,
        masahiroy@...nel.org, nathan@...nel.org, dennis@...nel.org,
        muchun.song@...ux.dev, rppt@...nel.org, paulmck@...nel.org,
        pasha.tatashin@...een.com, yosryahmed@...gle.com,
        yuzhao@...gle.com, dhowells@...hat.com, hughd@...gle.com,
        andreyknvl@...il.com, keescook@...omium.org,
        ndesaulniers@...gle.com, gregkh@...uxfoundation.org,
        ebiggers@...gle.com, ytcoode@...il.com, vincent.guittot@...aro.org,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        bristot@...hat.com, vschneid@...hat.com, cl@...ux.com,
        penberg@...nel.org, iamjoonsoo.kim@....com, 42.hyeyoo@...il.com,
        glider@...gle.com, elver@...gle.com, dvyukov@...gle.com,
        shakeelb@...gle.com, songmuchun@...edance.com, jbaron@...mai.com,
        rientjes@...gle.com, minchan@...gle.com, kaleshsingh@...gle.com,
        kernel-team@...roid.com, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, iommu@...ts.linux.dev,
        linux-arch@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-mm@...ck.org, linux-modules@...r.kernel.org,
        kasan-dev@...glegroups.com, cgroups@...r.kernel.org,
        Alexei Starovoitov <ast@...nel.org>,
        Andrii Nakryiko <andrii@...nel.org>
Subject: Re: [PATCH 00/40] Memory allocation profiling

Hello,

On Wed, May 03, 2023 at 02:07:26PM -0400, Johannes Weiner wrote:
...
> > * Because tracking starts when the script starts running, it doesn't know
> >   anything which has happened upto that point, so you gotta pay attention to
> >   handling e.g. handling frees which don't match allocs. It's kinda annoying
> >   but not a huge problem usually. There are ways to build in BPF progs into
> >   the kernel and load it early but I haven't experiemnted with it yet
> >   personally.
> 
> Yeah, early loading is definitely important, especially before module
> loading etc.
> 
> One common usecase is that we see a machine in the wild with a high
> amount of kernel memory disappearing somewhere that isn't voluntarily
> reported in vmstat/meminfo. Reproducing it isn't always
> practical. Something that records early and always (with acceptable
> runtime overhead) would be the holy grail.
> 
> Matching allocs to frees is doable using the pfn as the key for pages,
> and virtual addresses for slab objects.
> 
> The biggest issue I had when I tried with bpf was losing updates to
> the map. IIRC there is some trylocking going on to avoid deadlocks
> from nested contexts (alloc interrupted, interrupt frees). It doesn't
> sound like an unsolvable problem, though.

(cc'ing Alexei and Andrii)

This is the same thing that I hit with sched_ext. BPF plugged it for
struct_ops but I wonder whether it can be done for specific maps / progs -
ie. just declare that a given map or prog is not to be accessed from NMI and
bypass the trylock deadlock avoidance mechanism. But, yeah, this should be
addressed from BPF side.

> Another minor thing was the stack trace map exploding on a basically
> infinite number of unique interrupt stacks. This could probably also
> be solved by extending the trace extraction API to cut the frames off
> at the context switch boundary.
> 
> Taking a step back though, given the multitude of allocation sites in
> the kernel, it's a bit odd that the only accounting we do is the tiny
> fraction of voluntary vmstat/meminfo reporting. We try to cover the
> biggest consumers with this of course, but it's always going to be
> incomplete and is maintenance overhead too. There are on average
> several gigabytes in unknown memory (total - known vmstats) on our
> machines. It's difficult to detect regressions easily. And it's per
> definition the unexpected cornercases that are the trickiest to track
> down. So it might be doable with BPF, but it does feel like the kernel
> should do a better job of tracking out of the box and without
> requiring too much plumbing and somewhat fragile kernel allocation API
> tracking and probing from userspace.

Yeah, easy / default visibility argument does make sense to me.

Thanks.

-- 
tejun

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ