linux-kernel - Re: [RFC PATCH] binfmt_elf: Dump smaller VMAs first in ELF cores

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <230E81B0-A0BD-44B5-B354-3902DB50D3D0@juniper.net>
Date: Mon, 5 Aug 2024 18:44:44 +0000
From: Brian Mak <makb@...iper.net>
To: Kees Cook <kees@...nel.org>
CC: "Eric W. Biederman" <ebiederm@...ssion.com>,
        Oleg Nesterov
	<oleg@...hat.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Alexander
 Viro <viro@...iv.linux.org.uk>,
        Christian Brauner <brauner@...nel.org>, Jan
 Kara <jack@...e.cz>,
        "linux-fsdevel@...r.kernel.org"
	<linux-fsdevel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH] binfmt_elf: Dump smaller VMAs first in ELF cores

On Aug 5, 2024, at 10:25 AM, Kees Cook <kees@...nel.org> wrote:

> On Thu, Aug 01, 2024 at 05:58:06PM +0000, Brian Mak wrote:
>> On Jul 31, 2024, at 7:52 PM, Eric W. Biederman <ebiederm@...ssion.com> wrote:
>>> One practical concern with this approach is that I think the ELF
>>> specification says that program headers should be written in memory
>>> order.  So a comment on your testing to see if gdb or rr or any of
>>> the other debuggers that read core dumps cares would be appreciated.
>> 
>> I've already tested readelf and gdb on core dumps (truncated and whole)
>> with this patch and it is able to read/use these core dumps in these
>> scenarios with a proper backtrace.
> 
> Can you compare the "rr" selftest before/after the patch? They have been
> the most sensitive to changes to ELF, ptrace, seccomp, etc, so I've
> tried to double-check "user visible" changes with their tree. :)

Hi Kees,

Thanks for your reply!

Can you please give me some more information on these self tests?
What/where are they? I'm not too familiar with rr.

>>> Since your concern is about stacks, and the kernel has information about
>>> stacks it might be worth using that information explicitly when sorting
>>> vmas, instead of just assuming stacks will be small.
>> 
>> This was originally the approach that we explored, but ultimately moved
>> away from. We need more than just stacks to form a proper backtrace. I
>> didn't narrow down exactly what it was that we needed because the sorting
>> solution seemed to be cleaner than trying to narrow down each of these
>> pieces that we'd need. At the very least, we need information about shared
>> libraries (.dynamic, etc.) and stacks, but my testing showed that we need a
>> third piece sitting in an anonymous R/W VMA, which is the point that I
>> stopped exploring this path. I was having a difficult time narrowing down
>> what this last piece was.
> 
> And those VMAs weren't thread stacks?

Admittedly, I did do all of this exploration months ago, and only have
my notes to go off of here, but no, they should not have been thread
stacks since I had pulled all of them in during a "first pass".

>> Please let me know your thoughts!
> 
> I echo all of Eric's comments, especially the "let's make this the
> default if we can". My only bit of discomfort is with making this change
> is that it falls into the "it happens to work" case, and we don't really
> understand _why_ it works for you. :)

Yep, the "let's make this the default" change is already in v2. v3 will
be out shortly with the change to sort in place rather than in a second
copy of the VMA list.

> It does also feel like part of the overall problem is that systemd
> doesn't have a way to know the process is crashing, and then creates the
> truncation problem. (i.e. we're trying to use the kernel to work around
> a visibility issue in userspace.)

Even if systemd had visibility into the fact that a crash is happening,
there's not much systemd can do in some circumstances. In applications
with strict time to recovery limits, the process needs to restart within
a certain time limit. We run into a similar issue as the issue I raised
in my last reply on this thread: to keep the core dump intact and
recover, we either need to start up a new process while the old one is
core dumping, or wait until core dumping is complete to restart.

If we start up a new process while the old one is core dumping, we risk
system stability in applications with a large memory footprint since we
could run out of memory from the duplication of memory consumption. If
we wait until core dumping is complete to restart, we're in the same
scenario as before with the core being truncated or we miss recovery
time objectives by waiting too long.

For this reason, I wouldn't say we're using the kernel to work around a
visibility issue or that systemd is creating the truncation problem, but
rather that the issue exists due to limitations in how we're truncating
cores. That being said, there might be some use in this type of
visibility for others with less strict recovery time objectives or
applications with a lower memory footprint.

Best,
Brian Mak

> All this said, if it doesn't create problems for gdb and rr, I would be
> fine to give a shot.
> 
> -Kees
> 
> --
> Kees Cook