[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CA+CK2bC7Qe-EDX8mZ_OvfH+9rfiYHCGK++znivu+SKvi8HGpkg@mail.gmail.com>
Date: Wed, 10 Feb 2021 16:48:51 -0500
From: Pavel Tatashin <pasha.tatashin@...een.com>
To: sourabhjain@...ux.ibm.com, hbathini@...ux.ibm.com,
Linus Torvalds <torvalds@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>,
Andrew Morton <akpm@...ux-foundation.org>,
Sasha Levin <sashal@...nel.org>,
LKML <linux-kernel@...r.kernel.org>,
James Morse <james.morse@....com>,
Will Deacon <will@...nel.org>, vgoyal@...hat.com,
Michael Ellerman <mpe@...erman.id.au>,
AKASHI Takahiro <takahiro.akashi@...aro.org>,
Dan Williams <dan.j.williams@...el.com>,
linux-mm <linux-mm@...ck.org>,
Tyler Hicks <tyhicks@...ux.microsoft.com>
Subject: improving crash dump discussion
I would like to start a discussion about how we can improve Linux
crash dump facility, and use warm reboot / firmware assistance in
order to more reliably collect crash dumps while using fewer
memory resources and being more performant.
Currently, the main way to collect crash dumps on Linux is to use
kdump. Kdump uses kexec in order to collect dumps. Kdump makes
use of kexec, which is mature and portable (does not depend on
firmware), but using kexec is not ideal.
I will list some problems with kexec/kdump, and then discuss how some
of them (hopefully most) can be addressed.
1. Expecting a crashing kernel to do the right thing: properly quiesce
devices, CPUs and prepare the machine for the new kernel.
The amount of code that is executed to perform crash kexec reboot is
not trivial. Unfortunately, since we are panicking we already lost
control at some point and the goal would be to reduce the amount of
code executed by the panic handler in order to be able to reliably
collect dumps. There are some ways to improve the reliability of crash
kexec reboot. For example, passing maxcpus=1 kernel parameter is now
the required on almost all platforms, which, unfortunately, has the
downside of forcing crash kernel to use only a single thread to save
core, and thus "makedumpfile --num-thread" is useless if used from
crash kernel.
2. Unlike booting from firmware, the PCI, CPUs, interrupt controllers,
DMAs mappings, and I/O devices are not reinitialized and might not be
in a consistent state.
The reset_devices, irqpoll, and other kernel parameters also intend to
mitigate these shortfalls by requiring drivers to do the resetting
themselves. Also, the kernel is usually smart enough to ignore
spurious interrupts, but this is fragile.
3. There is a blackout window during boot where collecting a crash dump
is not possible.
With current kdump it is possible to collect crashes that occur after
the kernel early boot is finished. During early boot we do a lot:
determine platform, initialize mm, initialize clock, scheduler, and
start other CPUs. Only after entering usermode, we are able to kexec
load crash kernel into memory after which crash can be collected.
4. Kdump is not compatible with hardware watchdog resets
When a hardware watchdog causes a reset, software is not involved, and
therefore we lose the entire machine state.
5. Crash kernel requires memory reservation
Crash kernel can't use the memory that was used by the crashing
kernel, therefore memory must always be reserved that is wasted during
normal operation, and only contains the image of the crash kernel.
6. Crash kernel requires special image and two reboots
Special crash image is usually required to reduce the number of loaded
modules, and also to reduce the system to the bare minimum so that it
can be booted in the small reserved space. Also, after the crash
kernel collects the core dump, we reboot back to the normal kernel,
thus two reboots are needed in order to recover after the crash.
==========================================================================
On the other hand, powerpc can optionally use firmware assisted
kdump (fadump). The benefits of fadump:
1. reboot through firmware happens, and thus all devices are reset to
their initial state
2. memory for the crash kernel does not need to be reserved if CMA is
used and user pages do not need to be preserved (commonly there is no
need to preserve user pages to debug kernel panics).
3. fadump crash format is identical to kdump (ELF /proc/vmcore),
therefore tools are the same, i.e. crash(8), makedumpfile, and other
all can be used.
4. No need to have a special crash kernel image and no need to do a
second reboot from the crash kernel.
The following services are expected from firmware in order for fadump to work:
1. Ability to do warm reboot
Preserve memory content across reboot. Firmware must not zero
(initialize) memory content. From my experience, this is actually
common nowadays: I see this happens on my AMD desktop with x570 chip +
UEFI BIOS, we do this at Microsoft both on larger Xeon servers with
UEFI firmware, and on small arm64 devices which use device trees instead
of EFI for performance reasons, and also to preserve emulated pmem
devices across reboot. We also did it at Oracle on SPARC sun4v machines
where sun4v hypervisor would not reset memory content on every reboot for
performance reasons.
2. Ability to register preserved memory region with firmware
The first kernel uses firmware to reserve a region of memory that must
be preserved when rebooted. Firmware and bootloader must not allocate
from preserved regions.
3. Ability to copy boot memory source to destination.
On powerpc, boot must start from a lower address, similar like on x86.
Also, boot memory is a region of memory that can be used by the kernel
to boot, and the rest is added later once the kernel decides to
unreserve it: i.e. after vmcore is saved.
The copy boot memory is not strictly necessary: the panicking kernel
can do the copy on platforms where boot must start from a lower
address, and on other platforms where boot can be done from any
address the copy is not needed at all (i.e. ARM64, x64).
What it comes down to is that there is little that firmware needs to
do in order to help Linux to do a more reliable crash dump. It must
provide an ability by the kernel to reserve a region of memory from
which firmware/bootloader won’t do allocations, and optionally on
platforms where the kernel must always boot from a predefined physical
address firmware should be able to copy boot memory content. The rest
can be done by the kernel alone.
Support for hardware watchdog resets is a little more complicated as
it would involve firmware to copy CPUs registers content to a
predefined place, but it should also be achievable.
We could agree on an interface that the kernel would support for both
EFI based firmware and device-tree based firmware. We could also add
this support to open source projects such as linuxboot, coreboot, OVMF
type of firmware and to boot loaders u-boot, grub.
Pasha
Powered by blists - more mailing lists