16 years ago · f58ee00f15
--- a/Documentation/vm/hwpoison.txt
+++ b/Documentation/vm/hwpoison.txt
@@ -0,0 +1,136 @@
 
				+What is hwpoison?
			
 
				+
			
 
				+Upcoming Intel CPUs have support for recovering from some memory errors
			
 
				+(``MCA recovery''). This requires the OS to declare a page "poisoned",
			
 
				+kill the processes associated with it and avoid using it in the future.
			
 
				+
			
 
				+This patchkit implements the necessary infrastructure in the VM.
			
 
				+
			
 
				+To quote the overview comment:
			
 
				+
			
 
				+ * High level machine check handler. Handles pages reported by the
			
 
				+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
			
 
				+ * failure.
			
 
				+ *
			
 
				+ * This focusses on pages detected as corrupted in the background.
			
 
				+ * When the current CPU tries to consume corruption the currently
			
 
				+ * running process can just be killed directly instead. This implies
			
 
				+ * that if the error cannot be handled for some reason it's safe to
			
 
				+ * just ignore it because no corruption has been consumed yet. Instead
			
 
				+ * when that happens another machine check will happen.
			
 
				+ *
			
 
				+ * Handles page cache pages in various states. The tricky part
			
 
				+ * here is that we can access any page asynchronous to other VM
			
 
				+ * users, because memory failures could happen anytime and anywhere,
			
 
				+ * possibly violating some of their assumptions. This is why this code
			
 
				+ * has to be extremely careful. Generally it tries to use normal locking
			
 
				+ * rules, as in get the standard locks, even if that means the
			
 
				+ * error handling takes potentially a long time.
			
 
				+ *
			
 
				+ * Some of the operations here are somewhat inefficient and have non
			
 
				+ * linear algorithmic complexity, because the data structures have not
			
 
				+ * been optimized for this case. This is in particular the case
			
 
				+ * for the mapping from a vma to a process. Since this case is expected
			
 
				+ * to be rare we hope we can get away with this.
			
 
				+
			
 
				+The code consists of a the high level handler in mm/memory-failure.c,
			
 
				+a new page poison bit and various checks in the VM to handle poisoned
			
 
				+pages.
			
 
				+
			
 
				+The main target right now is KVM guests, but it works for all kinds
			
 
				+of applications. KVM support requires a recent qemu-kvm release.
			
 
				+
			
 
				+For the KVM use there was need for a new signal type so that
			
 
				+KVM can inject the machine check into the guest with the proper
			
 
				+address. This in theory allows other applications to handle
			
 
				+memory failures too. The expection is that near all applications
			
 
				+won't do that, but some very specialized ones might.
			
 
				+
			
 
				+---
			
 
				+
			
 
				+There are two (actually three) modi memory failure recovery can be in:
			
 
				+
			
 
				+vm.memory_failure_recovery sysctl set to zero:
			
 
				+	All memory failures cause a panic. Do not attempt recovery.
			
 
				+	(on x86 this can be also affected by the tolerant level of the
			
 
				+	MCE subsystem)
			
 
				+
			
 
				+early kill
			
 
				+	(can be controlled globally and per process)
			
 
				+	Send SIGBUS to the application as soon as the error is detected
			
 
				+	This allows applications who can process memory errors in a gentle
			
 
				+	way (e.g. drop affected object)
			
 
				+	This is the mode used by KVM qemu.
			
 
				+
			
 
				+late kill
			
 
				+	Send SIGBUS when the application runs into the corrupted page.
			
 
				+	This is best for memory error unaware applications and default
			
 
				+	Note some pages are always handled as late kill.
			
 
				+
			
 
				+---
			
 
				+
			
 
				+User control:
			
 
				+
			
 
				+vm.memory_failure_recovery
			
 
				+	See sysctl.txt
			
 
				+
			
 
				+vm.memory_failure_early_kill
			
 
				+	Enable early kill mode globally
			
 
				+
			
 
				+PR_MCE_KILL
			
 
				+	Set early/late kill mode/revert to system default
			
 
				+	arg1: PR_MCE_KILL_CLEAR: Revert to system default
			
 
				+	arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode
			
 
				+		PR_MCE_KILL_EARLY: Early kill
			
 
				+		PR_MCE_KILL_LATE:  Late kill
			
 
				+		PR_MCE_KILL_DEFAULT: Use system global default
			
 
				+PR_MCE_KILL_GET
			
 
				+	return current mode
			
 
				+
			
 
				+
			
 
				+---
			
 
				+
			
 
				+Testing:
			
 
				+
			
 
				+madvise(MADV_POISON, ....)
			
 
				+	(as root)
			
 
				+	Poison a page in the process for testing
			
 
				+
			
 
				+
			
 
				+hwpoison-inject module through debugfs
			
 
				+	/sys/debug/hwpoison/corrupt-pfn
			
 
				+
			
 
				+Inject hwpoison fault at PFN echoed into this file
			
 
				+
			
 
				+
			
 
				+Architecture specific MCE injector
			
 
				+
			
 
				+x86 has mce-inject, mce-test
			
 
				+
			
 
				+Some portable hwpoison test programs in mce-test, see blow.
			
 
				+
			
 
				+---
			
 
				+
			
 
				+References:
			
 
				+
			
 
				+http://halobates.de/mce-lc09-2.pdf
			
 
				+	Overview presentation from LinuxCon 09
			
 
				+
			
 
				+git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
			
 
				+	Test suite (hwpoison specific portable tests in tsrc)
			
 
				+
			
 
				+git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
			
 
				+	x86 specific injector
			
 
				+
			
 
				+
			
 
				+---
			
 
				+
			
 
				+Limitations:
			
 
				+
			
 
				+- Not all page types are supported and never will. Most kernel internal
			
 
				+objects cannot be recovered, only LRU pages for now.
			
 
				+- Right now hugepage support is missing.
			
 
				+
			
 
				+---
			
 
				+Andi Kleen, Oct 2009
			
 
				+