[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4EBB9D76.80601@redhat.com>
Date: Thu, 10 Nov 2011 17:46:30 +0800
From: Cong Wang <amwang@...hat.com>
To: Mahesh J Salgaonkar <mahesh@...ux.vnet.ibm.com>
CC: linuxppc-dev <linuxppc-dev@...abs.org>,
Linux Kernel <linux-kernel@...r.kernel.org>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Ananth Narayan <ananth@...ibm.com>,
Milton Miller <miltonm@....com>,
Haren Myneni <hbabu@...ibm.com>,
Anton Blanchard <anton@...ba.org>,
"Eric W. Biederman" <ebiederm@...ssion.com>
Subject: Re: [RFC PATCH v4 01/10] fadump: Add documentation for firmware-assisted
dump.
于 2011年11月07日 17:55, Mahesh J Salgaonkar 写道:
> From: Mahesh Salgaonkar<mahesh@...ux.vnet.ibm.com>
>
> Documentation for firmware-assisted dump. This document is based on the
> original documentation written for phyp assisted dump by Linas Vepstas
> and Manish Ahuja, with few changes to reflect the current implementation.
>
> Change in v3:
> - Modified the documentation to reflect introdunction of fadump_registered
> sysfs file and few minor changes.
>
> Change in v2:
> - Modified the documentation to reflect the change of fadump_region
> file under debugfs filesystem.
>
> Signed-off-by: Mahesh Salgaonkar<mahesh@...ux.vnet.ibm.com>
Please Cc Randy Dunlap <rdunlap@...otime.net> for kernel documentation
patch.
I have some inline comments below.
> ---
> Documentation/powerpc/firmware-assisted-dump.txt | 262 ++++++++++++++++++++++
> 1 files changed, 262 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/powerpc/firmware-assisted-dump.txt
>
> diff --git a/Documentation/powerpc/firmware-assisted-dump.txt b/Documentation/powerpc/firmware-assisted-dump.txt
> new file mode 100644
> index 0000000..ba6724a
> --- /dev/null
> +++ b/Documentation/powerpc/firmware-assisted-dump.txt
> @@ -0,0 +1,262 @@
> +
> + Firmware-Assisted Dump
> + ------------------------
> + July 2011
> +
> +The goal of firmware-assisted dump is to enable the dump of
> +a crashed system, and to do so from a fully-reset system, and
> +to minimize the total elapsed time until the system is back
> +in production use.
> +
> +As compared to kdump or other strategies, firmware-assisted
> +dump offers several strong, practical advantages:
Comparing with kdump or...
> +
> +-- Unlike kdump, the system has been reset, and loaded
> + with a fresh copy of the kernel. In particular,
> + PCI and I/O devices have been reinitialized and are
> + in a clean, consistent state.
> +-- Once the dump is copied out, the memory that held the dump
> + is immediately available to the running kernel. A further
> + reboot isn't required.
> +
> +The above can only be accomplished by coordination with,
> +and assistance from the Power firmware. The procedure is
> +as follows:
> +
> +-- The first kernel registers the sections of memory with the
> + Power firmware for dump preservation during OS initialization.
> + This registered sections of memory is reserved by the first
These registered sections of memory are...
> + kernel during early boot.
> +
> +-- When a system crashes, the Power firmware will save
> + the low memory (boot memory of size larger of 5% of system RAM
> + or 256MB) of RAM to a previously registered save region. It
...to the previous registered region...
> + will also save system registers, and hardware PTE's.
> +
> + NOTE: The term 'boot memory' means size of the low memory chunk
> + that is required for a kernel to boot successfully when
> + booted with restricted memory. By default, the boot memory
> + size will be calculated to larger of 5% of system RAM or
will be the larger of...
> + 256MB. Alternatively, user can also specify boot memory
> + size through boot parameter 'fadump_reserve_mem=' which
> + will override the default calculated size.
> +
> +-- After the low memory (boot memory) area has been saved, the
> + firmware will reset PCI and other hardware state. It will
> + *not* clear the RAM. It will then launch the bootloader, as
> + normal.
> +
> +-- The freshly booted kernel will notice that there is a new
> + node (ibm,dump-kernel) in the device tree, indicating that
> + there is crash data available from a previous boot. During
> + the early boot OS will reserve rest of the memory above
> + boot memory size effectively booting with restricted memory
> + size. This will make sure that the second kernel will not
> + touch any of the dump memory area.
> +
> +-- Userspace tools will read /proc/vmcore to obtain the contents
> + of memory, which holds the previous crashed kernel dump in ELF
> + format. The userspace tools may copy this info to disk, or
> + network, nas, san, iscsi, etc. as desired.
s/Userspace/User-space/
> +
> +-- Once the userspace tool is done saving dump, it will echo
> + '1' to /sys/kernel/fadump_release_mem to release the reserved
> + memory back to general use, except the memory required for
> + next firmware-assisted dump registration.
> +
> + e.g.
> + # echo 1> /sys/kernel/fadump_release_mem
> +
> +Please note that the firmware-assisted dump feature
> +is only available on Power6 and above systems with recent
> +firmware versions.
> +
> +Implementation details:
> +----------------------
> +
> +During boot, a check is made to see if firmware supports
> +this feature on that particular machine. If it does, then
> +we check to see if an active dump is waiting for us. If yes
> +then everything but boot memory size of RAM is reserved during
> +early boot (See Fig. 2). This area is released once we collect a
> +dump from user land scripts (kdump scripts) that are run. If
This area is released once we finish collecting the dump
from user land scripts (e.g. kdump scripts).
> +there is dump data, then the /sys/kernel/fadump_release_mem
> +file is created, and the reserved memory is held.
> +
> +If there is no waiting dump data, then only the memory required
> +to hold CPU state, HPTE region, boot memory dump and elfcore
> +header, is reserved at the top of memory (see Fig. 1). This area
> +is *not* released: this region will be kept permanently reserved,
> +so that it can act as a receptacle for a copy of the boot memory
> +content in addition to CPU state and HPTE region, in the case a
> +crash does occur.
> +
> + o Memory Reservation during first kernel
> +
> + Low memory Top of memory
> + 0 boot memory size |
> + | | |<--Reserved dump area -->|
> + V V | Permanent Reservation V
> + +-----------+----------/ /----------+---+----+-----------+----+
> + | | |CPU|HPTE| DUMP |ELF |
> + +-----------+----------/ /----------+---+----+-----------+----+
> + | ^
> + | |
> + \ /
> + -------------------------------------------
> + Boot memory content gets transferred to
> + reserved area by firmware at the time of
> + crash
> + Fig. 1
> +
> + o Memory Reservation during second kernel after crash
> +
> + Low memory Top of memory
> + 0 boot memory size |
> + | |<------------- Reserved dump area ----------- -->|
> + V V V
> + +-----------+----------/ /----------+---+----+-----------+----+
> + | | |CPU|HPTE| DUMP |ELF |
> + +-----------+----------/ /----------+---+----+-----------+----+
> + | |
> + V V
> + Used by second /proc/vmcore
> + kernel to boot
> + Fig. 2
> +
> +Currently the dump will be copied from /proc/vmcore to a
> +a new file upon user intervention. The dump data available through
> +/proc/vmcore will be in ELF format. Hence the existing kdump
> +infrastructure (kdump scripts) to save the dump works fine
> +with minor modifications. The kdump script requires following
> +modifications:
> +-- During service kdump start if /proc/vmcore entry is not present,
> + look for the existence of /sys/kernel/fadump_enabled and read
> + value exported by it. If value is set to '0' then fallback to
> + existing kexec based kdump. If value is set to '1' then check the
> + value exported by /sys/kernel/fadump_registered. If value it set
> + to '1' then print success otherwise register for fadump by
> + echo'ing 1> /sys/kernel/fadump_registered file.
> +
> +-- During service kdump start if /proc/vmcore entry is present,
> + execute the existing routine to save the dump. Once the dump
> + is saved, echo 1> /sys/kernel/fadump_release_mem (if the
> + file exists) to release the reserved memory for general use
> + and continue without rebooting. At this point the memory
> + reservation map will look like as shown in Fig. 1. If the file
> + /sys/kernel/fadump_release_mem is not present then follow
> + the existing routine to reboot into new kernel.
> +
> +-- During service kdump stop echo 0> /sys/kernel/fadump_registered
> + to un-register the fadump.
> +
I don't think you need to document kdump script changes in a kernel
doc.
> +The tools to examine the dump will be same as the ones
> +used for kdump.
> +
> +How to enable firmware-assisted dump (fadump):
> +-------------------------------------
> +
> +1. Set config option CONFIG_FA_DUMP=y and build kernel.
> +2. Boot into linux kernel with 'fadump=1' kernel cmdline option.
> +3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline
> + to specify size of the memory to reserve for boot memory dump
> + preservation.
> +
> +NOTE: If firmware-assisted dump fails to reserve memory then it will
> + fallback to existing kdump mechanism if 'crashkernel=' option
> + is set at kernel cmdline.
> +
> +Sysfs/debugfs files:
> +------------
> +
> +Firmware-assisted dump feature uses sysfs file system to hold
> +the control files and debugfs file to display memory reserved region.
> +
> +Here is the list of files under kernel sysfs:
> +
> + /sys/kernel/fadump_enabled
> +
> + This is used to display the fadump status.
> + 0 = fadump is disabled
> + 1 = fadump is enabled
> +
> + /sys/kernel/fadump_registered
> +
> + This is used to display the fadump registration status as well
> + as to control (start/stop) the fadump registration.
> + 0 = fadump is not registered.
> + 1 = fadump is registered and ready to handle system crash.
> +
> + To register fadump echo 1> /sys/kernel/fadump_registered and
> + echo 0> /sys/kernel/fadump_registered for un-register and stop the
> + fadump. Once the fadump is un-registered, the system crash will not
> + be handled and vmcore will not be captured.
> +
> + /sys/kernel/fadump_release_mem
> +
> + This file is available only when fadump is active during
> + second kernel. This is used to release the reserved memory
> + region that are held for saving crash dump. To release the
> + reserved memory echo 1 to it:
> +
> + echo 1> /sys/kernel/fadump_release_mem
> +
> + After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region
> + file will change to reflect the new memory reservations.
> +
> +Here is the list of files under powerpc debugfs:
> +(Assuming debugfs is mounted on /sys/kernel/debug directory.)
> +
> + /sys/kernel/debug/powerpc/fadump_region
> +
> + This file shows the reserved memory regions if fadump is
> + enabled otherwise this file is empty. The output format
> + is:
> +<region>: [<start>-<end>]<reserved-size> bytes, Dumped:<dump-size>
> +
> + e.g.
> + Contents when fadump is registered during first kernel
> +
> + # cat /sys/kernel/debug/powerpc/fadump_region
> + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0
> + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0
> + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0
> +
> + Contents when fadump is active during second kernel
> +
> + # cat /sys/kernel/debug/powerpc/fadump_region
> + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020
> + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000
> + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000
> + : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000
> +
> +NOTE: Please refer to debugfs documentation on how to mount the debugfs
> + filesystem.
> +
That is Documentation/filesystems/debugfs.txt.
> +
> +TODO:
> +-----
> + o Need to come up with the better approach to find out more
> + accurate boot memory size that is required for a kernel to
> + boot successfully when booted with restricted memory.
> + o The fadump implementation introduces a fadump crash info structure
> + in the scratch area before the ELF core header. The idea of introducing
> + this structure is to pass some important crash info data to the second
> + kernel which will help second kernel to populate ELF core header with
> + correct data before it gets exported through /proc/vmcore. The current
> + design implementation does not address a possibility of introducing
> + additional fields (in future) to this structure without affecting
> + compatibility. Need to come up with the better approach to address this.
> + The possible approaches are:
> + 1. Introduce version field for version tracking, bump up the version
> + whenever a new field is added to the structure in future. The version
> + field can be used to find out what fields are valid for the current
> + version of the structure.
> + 2. Reserve the area of predefined size (say PAGE_SIZE) for this
> + structure and have unused area as reserved (initialized to zero)
> + for future field additions.
> + The advantage of approach 1 over 2 is we don't need to reserve extra space.
> +---
Why do we keep TODO in this doc?
Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists