[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR21MB13117BB925ABFD8857CAA5B5C45B9@SJ0PR21MB1311.namprd21.prod.outlook.com>
Date: Fri, 21 Jan 2022 01:31:44 +0000
From: Bill Messmer <wmessmer@...rosoft.com>
To: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Issue With Kernel Changes To Core Dump Collection (Kernel Bug...?)
Hello,
It has been my understanding for some time that the kernel config option CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS (and the corresponding bit 4 of the coredump filter) was, at one point, added for the purpose of ensuring that the GNU build-id of ELF objects was included in core dumps. The config description in Kconfig.binfmt even alludes to this in its description.
I am trying to understand why in the 5.10+ kernels, there was a change in the kernel that, instead of checking whether a given memory mapping had an ELF header in order to determine whether to include the page to checking whether the inode is executable. The change in question:
github.com/torvalds/linux/commit/429a22e776a2b9f85a2b9c53d8e647598b553dd1
In many distributions (e.g.: Ubuntu), the shared objects in /usr/lib and elsewhere are not marked as executable. One of the net effects here is that the first page of shared objects on these distributions are no longer captured in core dumps.
A core dump taken on Ubuntu 21.10 (with the 5.13 kernel) will, by default, not include these pages:
LOAD 0x0000000000007000 0x00007f375855f000 0x0000000000000000
0x0000000000000000 0x000000000002c000 R 0x1000
0x00007f375855f000 0x00007f375858b000 0x0000000000000000
/usr/lib/x86_64-linux-gnu/libc.so.6
Doing a quick "sudo chmod +x /usr/lib/x86_64-linux-gnu/libc.so.6" and repeating shows that it is:
LOAD 0x0000000000007000 0x00007fefd5282000 0x0000000000000000
0x0000000000001000 0x000000000002c000 R 0x1000
0x00007fefd5282000 0x00007fefd52ae000 0x0000000000000000
/usr/lib/x86_64-linux-gnu/libc.so.6
Prior to running with 5.10+ kernels, I was always seeing the first page of shared objects (and the contained build-id) within core dumps (assuming the proper kernel config and core dump filter bits). Not any longer.
The reason I ask this is that, as more teams here at Microsoft have products running on Linux (or in Linux containers), we have been pushing the crash reports for those up through the same post-mortem crash analysis infrastructure that we do for Windows. That means that what has traditionally been the Windows debugger (e.g.: WinDbg) has, for some time, been able to open, debug, and analyze various Linux post-mortem crash formats. Part of doing this on a post-mortem basis requires finding the original images and debug information for the executables and shared objects referenced in those core dumps. Whether we do that via our own symbol servers or via a debuginfod service -- the post-mortem debugger needs access to the build-ids of those objects.
Until recently, finding these from a core dump has been stable and working quite well. Of late, however, we have been seeing a number of crash reports (e.g.: from Debian or Ubuntu containers) where we can no longer find images & symbols based on the core dumps because this kernel change has caused the first page of shared object files to not be captured in core dumps. I don't know how many post-mortem Linux crash analysis solutions this is affecting...
Was the change here really the intent...? or is this a kernel bug?
Sincerely,
Bill Messmer
wmessmer@...rosoft.com
Powered by blists - more mailing lists