linux-kernel - Re: [PATCH v3 00/35] Memory allocation profiling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240214062020.GA989328@cmpxchg.org>
Date: Wed, 14 Feb 2024 01:20:20 -0500
From: Johannes Weiner <hannes@...xchg.org>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: akpm@...ux-foundation.org, kent.overstreet@...ux.dev, mhocko@...e.com,
	vbabka@...e.cz, roman.gushchin@...ux.dev, mgorman@...e.de,
	dave@...olabs.net, willy@...radead.org, liam.howlett@...cle.com,
	corbet@....net, void@...ifault.com, peterz@...radead.org,
	juri.lelli@...hat.com, catalin.marinas@....com, will@...nel.org,
	arnd@...db.de, tglx@...utronix.de, mingo@...hat.com,
	dave.hansen@...ux.intel.com, x86@...nel.org, peterx@...hat.com,
	david@...hat.com, axboe@...nel.dk, mcgrof@...nel.org,
	masahiroy@...nel.org, nathan@...nel.org, dennis@...nel.org,
	tj@...nel.org, muchun.song@...ux.dev, rppt@...nel.org,
	paulmck@...nel.org, pasha.tatashin@...een.com,
	yosryahmed@...gle.com, yuzhao@...gle.com, dhowells@...hat.com,
	hughd@...gle.com, andreyknvl@...il.com, keescook@...omium.org,
	ndesaulniers@...gle.com, vvvvvv@...gle.com,
	gregkh@...uxfoundation.org, ebiggers@...gle.com, ytcoode@...il.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, bristot@...hat.com,
	vschneid@...hat.com, cl@...ux.com, penberg@...nel.org,
	iamjoonsoo.kim@....com, 42.hyeyoo@...il.com, glider@...gle.com,
	elver@...gle.com, dvyukov@...gle.com, shakeelb@...gle.com,
	songmuchun@...edance.com, jbaron@...mai.com, rientjes@...gle.com,
	minchan@...gle.com, kaleshsingh@...gle.com, kernel-team@...roid.com,
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
	iommu@...ts.linux.dev, linux-arch@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
	linux-modules@...r.kernel.org, kasan-dev@...glegroups.com,
	cgroups@...r.kernel.org
Subject: Re: [PATCH v3 00/35] Memory allocation profiling

I'll do a more throrough code review, but before the discussion gets
too sidetracked, I wanted to add my POV on the overall merit of the
direction that is being proposed here.

I have backported and used this code for debugging production issues
before. Logging into a random host with an unfamiliar workload and
being able to get a reliable, comprehensive list of kernel memory
consumers is one of the coolest things I have seen in a long
time. This is a huge improvement to sysadmin quality of life.

It's also a huge improvement for MM developers. We're the first points
of contact for memory regressions that can be caused by pretty much
any driver or subsystem in the kernel.

I encourage anybody who is undecided on whether this is worth doing to
build a kernel with these patches applied and run it on their own
machine. I think you'll be surprised what you'll find - and how myopic
and uninformative /proc/meminfo feels in comparison to this. Did you
know there is a lot more to modern filesystems than the VFS objects we
are currently tracking? :)

Then imagine what this looks like on a production host running a
complex mix of filesystems, enterprise networking, bpf programs, gpus
and accelerators etc.

Backporting the code to a slightly older production kernel wasn't too
difficult. The instrumentation layering is explicit, clean, and fairly
centralized, so resolving minor conflicts around the _noprof renames
and the wrappers was pretty straight-forward.

When we talk about maintenance cost, a fair shake would be to weigh it
against the cost and reliability of our current method: evaluating
consumers in the kernel on a case-by-case basis and annotating the
alloc/free sites by hand; then quibbling with the MM community about
whether that consumer is indeed significant enough to warrant an entry
in /proc/meminfo, and what the catchiest name for the stat would be.

I think we can agree that this is vastly less scalable and more
burdensome than central annotations around a handful of mostly static
allocator entry points. Especially considering the rate of change in
the kernel as a whole, and that not everybody will think of the
comprehensive MM picture when writing a random driver. And I think
that's generous - we don't even have the network stack in meminfo.

So I think what we do now isn't working. In the Meta fleet, at any
given time the p50 for unaccounted kernel memory is several gigabytes
per host. The p99 is between 15% and 30% of total memory. That's a
looot of opaque resource usage we have to accept on faith.

For hunting down regressions, all it takes is one untracked consumer
in the kernel to really throw a wrench into things. It's difficult to
find in the noise with tracing, and if it's not growing after an
initial allocation spike, you're pretty much out of luck finding it at
all. Raise your hand if you've written a drgn script to walk pfns and
try to guess consumers from the state of struct page :)

I agree we should discuss how the annotations are implemented on a
technical basis, but my take is that we need something like this.

In a codebase of our size, I don't think the allocator should be
handing out memory without some basic implied tracking of where it's
going. It's a liability for production environments, and it can hide
bad memory management decisions in drivers and other subsystems for a
very long time.