|
@@ -0,0 +1,136 @@
|
|
|
+What is hwpoison?
|
|
|
+
|
|
|
+Upcoming Intel CPUs have support for recovering from some memory errors
|
|
|
+(``MCA recovery''). This requires the OS to declare a page "poisoned",
|
|
|
+kill the processes associated with it and avoid using it in the future.
|
|
|
+
|
|
|
+This patchkit implements the necessary infrastructure in the VM.
|
|
|
+
|
|
|
+To quote the overview comment:
|
|
|
+
|
|
|
+ * High level machine check handler. Handles pages reported by the
|
|
|
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
|
|
|
+ * failure.
|
|
|
+ *
|
|
|
+ * This focusses on pages detected as corrupted in the background.
|
|
|
+ * When the current CPU tries to consume corruption the currently
|
|
|
+ * running process can just be killed directly instead. This implies
|
|
|
+ * that if the error cannot be handled for some reason it's safe to
|
|
|
+ * just ignore it because no corruption has been consumed yet. Instead
|
|
|
+ * when that happens another machine check will happen.
|
|
|
+ *
|
|
|
+ * Handles page cache pages in various states. The tricky part
|
|
|
+ * here is that we can access any page asynchronous to other VM
|
|
|
+ * users, because memory failures could happen anytime and anywhere,
|
|
|
+ * possibly violating some of their assumptions. This is why this code
|
|
|
+ * has to be extremely careful. Generally it tries to use normal locking
|
|
|
+ * rules, as in get the standard locks, even if that means the
|
|
|
+ * error handling takes potentially a long time.
|
|
|
+ *
|
|
|
+ * Some of the operations here are somewhat inefficient and have non
|
|
|
+ * linear algorithmic complexity, because the data structures have not
|
|
|
+ * been optimized for this case. This is in particular the case
|
|
|
+ * for the mapping from a vma to a process. Since this case is expected
|
|
|
+ * to be rare we hope we can get away with this.
|
|
|
+
|
|
|
+The code consists of a the high level handler in mm/memory-failure.c,
|
|
|
+a new page poison bit and various checks in the VM to handle poisoned
|
|
|
+pages.
|
|
|
+
|
|
|
+The main target right now is KVM guests, but it works for all kinds
|
|
|
+of applications. KVM support requires a recent qemu-kvm release.
|
|
|
+
|
|
|
+For the KVM use there was need for a new signal type so that
|
|
|
+KVM can inject the machine check into the guest with the proper
|
|
|
+address. This in theory allows other applications to handle
|
|
|
+memory failures too. The expection is that near all applications
|
|
|
+won't do that, but some very specialized ones might.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+There are two (actually three) modi memory failure recovery can be in:
|
|
|
+
|
|
|
+vm.memory_failure_recovery sysctl set to zero:
|
|
|
+ All memory failures cause a panic. Do not attempt recovery.
|
|
|
+ (on x86 this can be also affected by the tolerant level of the
|
|
|
+ MCE subsystem)
|
|
|
+
|
|
|
+early kill
|
|
|
+ (can be controlled globally and per process)
|
|
|
+ Send SIGBUS to the application as soon as the error is detected
|
|
|
+ This allows applications who can process memory errors in a gentle
|
|
|
+ way (e.g. drop affected object)
|
|
|
+ This is the mode used by KVM qemu.
|
|
|
+
|
|
|
+late kill
|
|
|
+ Send SIGBUS when the application runs into the corrupted page.
|
|
|
+ This is best for memory error unaware applications and default
|
|
|
+ Note some pages are always handled as late kill.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+User control:
|
|
|
+
|
|
|
+vm.memory_failure_recovery
|
|
|
+ See sysctl.txt
|
|
|
+
|
|
|
+vm.memory_failure_early_kill
|
|
|
+ Enable early kill mode globally
|
|
|
+
|
|
|
+PR_MCE_KILL
|
|
|
+ Set early/late kill mode/revert to system default
|
|
|
+ arg1: PR_MCE_KILL_CLEAR: Revert to system default
|
|
|
+ arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode
|
|
|
+ PR_MCE_KILL_EARLY: Early kill
|
|
|
+ PR_MCE_KILL_LATE: Late kill
|
|
|
+ PR_MCE_KILL_DEFAULT: Use system global default
|
|
|
+PR_MCE_KILL_GET
|
|
|
+ return current mode
|
|
|
+
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+Testing:
|
|
|
+
|
|
|
+madvise(MADV_POISON, ....)
|
|
|
+ (as root)
|
|
|
+ Poison a page in the process for testing
|
|
|
+
|
|
|
+
|
|
|
+hwpoison-inject module through debugfs
|
|
|
+ /sys/debug/hwpoison/corrupt-pfn
|
|
|
+
|
|
|
+Inject hwpoison fault at PFN echoed into this file
|
|
|
+
|
|
|
+
|
|
|
+Architecture specific MCE injector
|
|
|
+
|
|
|
+x86 has mce-inject, mce-test
|
|
|
+
|
|
|
+Some portable hwpoison test programs in mce-test, see blow.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+References:
|
|
|
+
|
|
|
+http://halobates.de/mce-lc09-2.pdf
|
|
|
+ Overview presentation from LinuxCon 09
|
|
|
+
|
|
|
+git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
|
|
|
+ Test suite (hwpoison specific portable tests in tsrc)
|
|
|
+
|
|
|
+git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
|
|
|
+ x86 specific injector
|
|
|
+
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+Limitations:
|
|
|
+
|
|
|
+- Not all page types are supported and never will. Most kernel internal
|
|
|
+objects cannot be recovered, only LRU pages for now.
|
|
|
+- Right now hugepage support is missing.
|
|
|
+
|
|
|
+---
|
|
|
+Andi Kleen, Oct 2009
|
|
|
+
|