linux-kernel - Re: [PATCH v3] binfmt_elf: Dump smaller VMAs first in ELF cores

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5870D095-D47F-447F-A079-B32D9C415124@juniper.net>
Date: Thu, 20 Feb 2025 22:59:06 +0000
From: Brian Mak <makb@...iper.net>
To: Kees Cook <kees@...nel.org>
CC: Linus Torvalds <torvalds@...ux-foundation.org>, Jan Kara <jack@...e.cz>,
        Michael Stapelberg <michael@...pelberg.ch>,
        Christian Brauner
	<brauner@...nel.org>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Oleg Nesterov <oleg@...hat.com>,
        Alexander Viro <viro@...iv.linux.org.uk>
Subject: Re: [PATCH v3] binfmt_elf: Dump smaller VMAs first in ELF cores

On Feb 19, 2025, at 5:36 PM, Kees Cook <kees@...nel.org> wrote:

>> We already have the code to count how big the core dump is, it's that
>> 
>>                cprm->vma_data_size += m->dump_size;
>> 
>> in dump_vma_snapshot() thing, so I think this could all basically be a
>> one-liner that does the sort() call only if that vma_data_size is
>> larger than the core-dump limit, or something like that?
...
> Oh! That's a good idea. In theory, a truncated dump is going to be
> traditionally "unusable", so a sort shouldn't hurt tools that are
> expecting a complete dump.
> 
> Brian, are you able to test this for your case?

Hi Kees and Linus,

I like the idea, but the suggested patch seems to have issues in
practice.

First, the vma_data_size does not include the ELF header, program header
table, notes, etc. That's not a terribly big issue though. We can live
with that since it makes the estimated core dump size *smaller*. An even
bigger problem is that the vma_data_size doesn't take into account the
sparseness of the core dump, while the core dump size limit does.

If a generated core dump is very sparse, the vma_data_size will be much
larger than the actual size on disk of the core dump, triggering the
sorting logic earlier than expected.

One thing we can do though is to iterate through the pages for all VMAs
and see if get_dump_page() returns NULL. Then, we use that information
to calculate a more accurate predicted core dump size.

Patch is below. Thoughts?

Best,
Brian

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index a43b78b4b646..dd49a89a62d3 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -212,6 +212,17 @@ pid>/``).
 This value defaults to 0.
 
 
+core_sort_vma
+=============
+
+The default coredump writes VMAs in address order. By setting
+``core_sort_vma`` to 1, VMAs will be written from smallest size
+to largest size. This is known to break at least elfutils, but
+can be handy when dealing with very large (and truncated)
+coredumps where the more useful debugging details are included
+in the smaller VMAs.
+
+
 core_uses_pid
 =============
 
diff --git a/fs/coredump.c b/fs/coredump.c
index 591700e1b2ce..496cc7234aa7 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -63,6 +63,7 @@ static void free_vma_snapshot(struct coredump_params *cprm);
 
 static int core_uses_pid;
 static unsigned int core_pipe_limit;
+static unsigned int core_sort_vma;
 static char core_pattern[CORENAME_MAX_SIZE] = "core";
 static int core_name_size = CORENAME_MAX_SIZE;
 unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT;
@@ -1026,6 +1027,15 @@ static const struct ctl_table coredump_sysctls[] = {
 		.extra1		= (unsigned int *)&core_file_note_size_min,
 		.extra2		= (unsigned int *)&core_file_note_size_max,
 	},
+	{
+		.procname	= "core_sort_vma",
+		.data		= &core_sort_vma,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 };
 
 static int __init init_fs_coredump_sysctls(void)
@@ -1204,6 +1214,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
 	struct mm_struct *mm = current->mm;
 	VMA_ITERATOR(vmi, mm, 0);
 	int i = 0;
+	size_t sparse_vma_dump_size = 0;
 
 	/*
 	 * Once the stack expansion code is fixed to not change VMA bounds
@@ -1241,6 +1252,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
 
 	for (i = 0; i < cprm->vma_count; i++) {
 		struct core_vma_metadata *m = cprm->vma_meta + i;
+		unsigned long addr;
 
 		if (m->dump_size == DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER) {
 			char elfmag[SELFMAG];
@@ -1254,10 +1266,27 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
 		}
 
 		cprm->vma_data_size += m->dump_size;
+		sparse_vma_dump_size += m->dump_size;
+
+		/* Subtract zero pages from the sparse_vma_dump_size. */
+		for (addr = m->start; addr < m->start + m->dump_size; addr += PAGE_SIZE) {
+			struct page *page = get_dump_page(addr);
+
+			if (!page)
+				sparse_vma_dump_size -= PAGE_SIZE;
+			else
+				put_page(page);
+		}
 	}
 
-	sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
-		cmp_vma_size, NULL);
+	/*
+	 * Only sort the vmas by size if:
+	 * a) the sysctl is set to do so, or
+	 * b) the vmas don't all fit in within the core dump size limit.
+	 */
+	if (core_sort_vma || sparse_vma_dump_size >= cprm->limit)
+		sort(cprm->vma_meta, cprm->vma_count, sizeof(*cprm->vma_meta),
+		     cmp_vma_size, NULL);
 
 	return true;
 }