|
@@ -0,0 +1,270 @@
|
|
|
+
|
|
|
+ Firmware-Assisted Dump
|
|
|
+ ------------------------
|
|
|
+ July 2011
|
|
|
+
|
|
|
+The goal of firmware-assisted dump is to enable the dump of
|
|
|
+a crashed system, and to do so from a fully-reset system, and
|
|
|
+to minimize the total elapsed time until the system is back
|
|
|
+in production use.
|
|
|
+
|
|
|
+- Firmware assisted dump (fadump) infrastructure is intended to replace
|
|
|
+ the existing phyp assisted dump.
|
|
|
+- Fadump uses the same firmware interfaces and memory reservation model
|
|
|
+ as phyp assisted dump.
|
|
|
+- Unlike phyp dump, fadump exports the memory dump through /proc/vmcore
|
|
|
+ in the ELF format in the same way as kdump. This helps us reuse the
|
|
|
+ kdump infrastructure for dump capture and filtering.
|
|
|
+- Unlike phyp dump, userspace tool does not need to refer any sysfs
|
|
|
+ interface while reading /proc/vmcore.
|
|
|
+- Unlike phyp dump, fadump allows user to release all the memory reserved
|
|
|
+ for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem.
|
|
|
+- Once enabled through kernel boot parameter, fadump can be
|
|
|
+ started/stopped through /sys/kernel/fadump_registered interface (see
|
|
|
+ sysfs files section below) and can be easily integrated with kdump
|
|
|
+ service start/stop init scripts.
|
|
|
+
|
|
|
+Comparing with kdump or other strategies, firmware-assisted
|
|
|
+dump offers several strong, practical advantages:
|
|
|
+
|
|
|
+-- Unlike kdump, the system has been reset, and loaded
|
|
|
+ with a fresh copy of the kernel. In particular,
|
|
|
+ PCI and I/O devices have been reinitialized and are
|
|
|
+ in a clean, consistent state.
|
|
|
+-- Once the dump is copied out, the memory that held the dump
|
|
|
+ is immediately available to the running kernel. And therefore,
|
|
|
+ unlike kdump, fadump doesn't need a 2nd reboot to get back
|
|
|
+ the system to the production configuration.
|
|
|
+
|
|
|
+The above can only be accomplished by coordination with,
|
|
|
+and assistance from the Power firmware. The procedure is
|
|
|
+as follows:
|
|
|
+
|
|
|
+-- The first kernel registers the sections of memory with the
|
|
|
+ Power firmware for dump preservation during OS initialization.
|
|
|
+ These registered sections of memory are reserved by the first
|
|
|
+ kernel during early boot.
|
|
|
+
|
|
|
+-- When a system crashes, the Power firmware will save
|
|
|
+ the low memory (boot memory of size larger of 5% of system RAM
|
|
|
+ or 256MB) of RAM to the previous registered region. It will
|
|
|
+ also save system registers, and hardware PTE's.
|
|
|
+
|
|
|
+ NOTE: The term 'boot memory' means size of the low memory chunk
|
|
|
+ that is required for a kernel to boot successfully when
|
|
|
+ booted with restricted memory. By default, the boot memory
|
|
|
+ size will be the larger of 5% of system RAM or 256MB.
|
|
|
+ Alternatively, user can also specify boot memory size
|
|
|
+ through boot parameter 'fadump_reserve_mem=' which will
|
|
|
+ override the default calculated size. Use this option
|
|
|
+ if default boot memory size is not sufficient for second
|
|
|
+ kernel to boot successfully.
|
|
|
+
|
|
|
+-- After the low memory (boot memory) area has been saved, the
|
|
|
+ firmware will reset PCI and other hardware state. It will
|
|
|
+ *not* clear the RAM. It will then launch the bootloader, as
|
|
|
+ normal.
|
|
|
+
|
|
|
+-- The freshly booted kernel will notice that there is a new
|
|
|
+ node (ibm,dump-kernel) in the device tree, indicating that
|
|
|
+ there is crash data available from a previous boot. During
|
|
|
+ the early boot OS will reserve rest of the memory above
|
|
|
+ boot memory size effectively booting with restricted memory
|
|
|
+ size. This will make sure that the second kernel will not
|
|
|
+ touch any of the dump memory area.
|
|
|
+
|
|
|
+-- User-space tools will read /proc/vmcore to obtain the contents
|
|
|
+ of memory, which holds the previous crashed kernel dump in ELF
|
|
|
+ format. The userspace tools may copy this info to disk, or
|
|
|
+ network, nas, san, iscsi, etc. as desired.
|
|
|
+
|
|
|
+-- Once the userspace tool is done saving dump, it will echo
|
|
|
+ '1' to /sys/kernel/fadump_release_mem to release the reserved
|
|
|
+ memory back to general use, except the memory required for
|
|
|
+ next firmware-assisted dump registration.
|
|
|
+
|
|
|
+ e.g.
|
|
|
+ # echo 1 > /sys/kernel/fadump_release_mem
|
|
|
+
|
|
|
+Please note that the firmware-assisted dump feature
|
|
|
+is only available on Power6 and above systems with recent
|
|
|
+firmware versions.
|
|
|
+
|
|
|
+Implementation details:
|
|
|
+----------------------
|
|
|
+
|
|
|
+During boot, a check is made to see if firmware supports
|
|
|
+this feature on that particular machine. If it does, then
|
|
|
+we check to see if an active dump is waiting for us. If yes
|
|
|
+then everything but boot memory size of RAM is reserved during
|
|
|
+early boot (See Fig. 2). This area is released once we finish
|
|
|
+collecting the dump from user land scripts (e.g. kdump scripts)
|
|
|
+that are run. If there is dump data, then the
|
|
|
+/sys/kernel/fadump_release_mem file is created, and the reserved
|
|
|
+memory is held.
|
|
|
+
|
|
|
+If there is no waiting dump data, then only the memory required
|
|
|
+to hold CPU state, HPTE region, boot memory dump and elfcore
|
|
|
+header, is reserved at the top of memory (see Fig. 1). This area
|
|
|
+is *not* released: this region will be kept permanently reserved,
|
|
|
+so that it can act as a receptacle for a copy of the boot memory
|
|
|
+content in addition to CPU state and HPTE region, in the case a
|
|
|
+crash does occur.
|
|
|
+
|
|
|
+ o Memory Reservation during first kernel
|
|
|
+
|
|
|
+ Low memory Top of memory
|
|
|
+ 0 boot memory size |
|
|
|
+ | | |<--Reserved dump area -->|
|
|
|
+ V V | Permanent Reservation V
|
|
|
+ +-----------+----------/ /----------+---+----+-----------+----+
|
|
|
+ | | |CPU|HPTE| DUMP |ELF |
|
|
|
+ +-----------+----------/ /----------+---+----+-----------+----+
|
|
|
+ | ^
|
|
|
+ | |
|
|
|
+ \ /
|
|
|
+ -------------------------------------------
|
|
|
+ Boot memory content gets transferred to
|
|
|
+ reserved area by firmware at the time of
|
|
|
+ crash
|
|
|
+ Fig. 1
|
|
|
+
|
|
|
+ o Memory Reservation during second kernel after crash
|
|
|
+
|
|
|
+ Low memory Top of memory
|
|
|
+ 0 boot memory size |
|
|
|
+ | |<------------- Reserved dump area ----------- -->|
|
|
|
+ V V V
|
|
|
+ +-----------+----------/ /----------+---+----+-----------+----+
|
|
|
+ | | |CPU|HPTE| DUMP |ELF |
|
|
|
+ +-----------+----------/ /----------+---+----+-----------+----+
|
|
|
+ | |
|
|
|
+ V V
|
|
|
+ Used by second /proc/vmcore
|
|
|
+ kernel to boot
|
|
|
+ Fig. 2
|
|
|
+
|
|
|
+Currently the dump will be copied from /proc/vmcore to a
|
|
|
+a new file upon user intervention. The dump data available through
|
|
|
+/proc/vmcore will be in ELF format. Hence the existing kdump
|
|
|
+infrastructure (kdump scripts) to save the dump works fine with
|
|
|
+minor modifications.
|
|
|
+
|
|
|
+The tools to examine the dump will be same as the ones
|
|
|
+used for kdump.
|
|
|
+
|
|
|
+How to enable firmware-assisted dump (fadump):
|
|
|
+-------------------------------------
|
|
|
+
|
|
|
+1. Set config option CONFIG_FA_DUMP=y and build kernel.
|
|
|
+2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
|
|
|
+3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline
|
|
|
+ to specify size of the memory to reserve for boot memory dump
|
|
|
+ preservation.
|
|
|
+
|
|
|
+NOTE: If firmware-assisted dump fails to reserve memory then it will
|
|
|
+ fallback to existing kdump mechanism if 'crashkernel=' option
|
|
|
+ is set at kernel cmdline.
|
|
|
+
|
|
|
+Sysfs/debugfs files:
|
|
|
+------------
|
|
|
+
|
|
|
+Firmware-assisted dump feature uses sysfs file system to hold
|
|
|
+the control files and debugfs file to display memory reserved region.
|
|
|
+
|
|
|
+Here is the list of files under kernel sysfs:
|
|
|
+
|
|
|
+ /sys/kernel/fadump_enabled
|
|
|
+
|
|
|
+ This is used to display the fadump status.
|
|
|
+ 0 = fadump is disabled
|
|
|
+ 1 = fadump is enabled
|
|
|
+
|
|
|
+ This interface can be used by kdump init scripts to identify if
|
|
|
+ fadump is enabled in the kernel and act accordingly.
|
|
|
+
|
|
|
+ /sys/kernel/fadump_registered
|
|
|
+
|
|
|
+ This is used to display the fadump registration status as well
|
|
|
+ as to control (start/stop) the fadump registration.
|
|
|
+ 0 = fadump is not registered.
|
|
|
+ 1 = fadump is registered and ready to handle system crash.
|
|
|
+
|
|
|
+ To register fadump echo 1 > /sys/kernel/fadump_registered and
|
|
|
+ echo 0 > /sys/kernel/fadump_registered for un-register and stop the
|
|
|
+ fadump. Once the fadump is un-registered, the system crash will not
|
|
|
+ be handled and vmcore will not be captured. This interface can be
|
|
|
+ easily integrated with kdump service start/stop.
|
|
|
+
|
|
|
+ /sys/kernel/fadump_release_mem
|
|
|
+
|
|
|
+ This file is available only when fadump is active during
|
|
|
+ second kernel. This is used to release the reserved memory
|
|
|
+ region that are held for saving crash dump. To release the
|
|
|
+ reserved memory echo 1 to it:
|
|
|
+
|
|
|
+ echo 1 > /sys/kernel/fadump_release_mem
|
|
|
+
|
|
|
+ After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region
|
|
|
+ file will change to reflect the new memory reservations.
|
|
|
+
|
|
|
+ The existing userspace tools (kdump infrastructure) can be easily
|
|
|
+ enhanced to use this interface to release the memory reserved for
|
|
|
+ dump and continue without 2nd reboot.
|
|
|
+
|
|
|
+Here is the list of files under powerpc debugfs:
|
|
|
+(Assuming debugfs is mounted on /sys/kernel/debug directory.)
|
|
|
+
|
|
|
+ /sys/kernel/debug/powerpc/fadump_region
|
|
|
+
|
|
|
+ This file shows the reserved memory regions if fadump is
|
|
|
+ enabled otherwise this file is empty. The output format
|
|
|
+ is:
|
|
|
+ <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size>
|
|
|
+
|
|
|
+ e.g.
|
|
|
+ Contents when fadump is registered during first kernel
|
|
|
+
|
|
|
+ # cat /sys/kernel/debug/powerpc/fadump_region
|
|
|
+ CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0
|
|
|
+ HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0
|
|
|
+ DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0
|
|
|
+
|
|
|
+ Contents when fadump is active during second kernel
|
|
|
+
|
|
|
+ # cat /sys/kernel/debug/powerpc/fadump_region
|
|
|
+ CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020
|
|
|
+ HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000
|
|
|
+ DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000
|
|
|
+ : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000
|
|
|
+
|
|
|
+NOTE: Please refer to Documentation/filesystems/debugfs.txt on
|
|
|
+ how to mount the debugfs filesystem.
|
|
|
+
|
|
|
+
|
|
|
+TODO:
|
|
|
+-----
|
|
|
+ o Need to come up with the better approach to find out more
|
|
|
+ accurate boot memory size that is required for a kernel to
|
|
|
+ boot successfully when booted with restricted memory.
|
|
|
+ o The fadump implementation introduces a fadump crash info structure
|
|
|
+ in the scratch area before the ELF core header. The idea of introducing
|
|
|
+ this structure is to pass some important crash info data to the second
|
|
|
+ kernel which will help second kernel to populate ELF core header with
|
|
|
+ correct data before it gets exported through /proc/vmcore. The current
|
|
|
+ design implementation does not address a possibility of introducing
|
|
|
+ additional fields (in future) to this structure without affecting
|
|
|
+ compatibility. Need to come up with the better approach to address this.
|
|
|
+ The possible approaches are:
|
|
|
+ 1. Introduce version field for version tracking, bump up the version
|
|
|
+ whenever a new field is added to the structure in future. The version
|
|
|
+ field can be used to find out what fields are valid for the current
|
|
|
+ version of the structure.
|
|
|
+ 2. Reserve the area of predefined size (say PAGE_SIZE) for this
|
|
|
+ structure and have unused area as reserved (initialized to zero)
|
|
|
+ for future field additions.
|
|
|
+ The advantage of approach 1 over 2 is we don't need to reserve extra space.
|
|
|
+---
|
|
|
+Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
|
|
+This document is based on the original documentation written for phyp
|
|
|
+assisted dump by Linas Vepstas and Manish Ahuja.
|