16 years ago · 31983a04d6
--- a/Documentation/edac.txt
+++ b/Documentation/edac.txt
@@ -6,6 +6,8 @@ Written by Doug Thompson <dougthompson@xmission.com>
 
				 7 Dec 2005
			
 
				 17 Jul 2007	Updated
			
 
				 
			
 
				+(c) Mauro Carvalho Chehab <mchehab@redhat.com>
			
 
				+05 Aug 2009	Nehalem interface
			
 
				 
			
 
				 EDAC is maintained and written by:
			
 
				 
			
@@ -717,3 +719,111 @@ unique drivers for their hardware systems.
 
				 The 'test_device_edac' sample driver is located at the
			
 
				 bluesmoke.sourceforge.net project site for EDAC.
			
 
				 
			
 
				+=======================================================================
			
 
				+NEHALEM USAGE OF EDAC APIs
			
 
				+
			
 
				+This chapter documents some EXPERIMENTAL mappings for EDAC API to handle
			
 
				+Nehalem EDAC driver. They will likely be changed on future versions
			
 
				+of the driver.
			
 
				+
			
 
				+Due to the way Nehalem exports Memory Controller data, some adjustments
			
 
				+were done at i7core_edac driver. This chapter will cover those differences
			
 
				+
			
 
				+1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect
			
 
				+   (QPI). At the driver, the term "socket" means one QPI. It should also be
			
 
				+   associated with the CPU physical socket.
			
 
				+
			
 
				+   Each MC have 3 physical read channels, 3 physical write channels and
			
 
				+   3 logic channels. The driver currenty sees it as just 3 channels.
			
 
				+   Each channel can have up to 3 DIMMs.
			
 
				+
			
 
				+   The minimum known unity is DIMMs. There are no information about csrows.
			
 
				+   As EDAC API maps the minimum unity is csrows, the driver exports one
			
 
				+   DIMM per csrow.
			
 
				+
			
 
				+   Currently, it also exports the several memory controllers as just one. This
			
 
				+   limit will be removed on future versions of the driver.
			
 
				+
			
 
				+2) Nehalem MC has the hability to generate errors. The driver implements this
			
 
				+   functionality via some error injection nodes:
			
 
				+
			
 
				+   For injecting a memory error, there are some sysfs nodes, under
			
 
				+   /sys/devices/system/edac/mc/mc0/:
			
 
				+
			
 
				+   inject_addrmatch:
			
 
				+      Controls the error injection mask register. It is possible to specify
			
 
				+      several characteristics of the address to match an error code:
			
 
				+         dimm = the affected dimm. Numbers are relative to a channel;
			
 
				+         rank = the memory rank;
			
 
				+         channel = the channel that will generate an error;
			
 
				+         bank = the affected bank;
			
 
				+         page = the page address;
			
 
				+         column (or col) = the address column.
			
 
				+      each of the above values can be set to "any" to match any valid value.
			
 
				+
			
 
				+      At driver init, all values are set to any.
			
 
				+
			
 
				+      For example, to generate an error at rank 1 of dimm 2, for any channel,
			
 
				+      any bank, any page, any column:
			
 
				+		echo "dimm:2 rank:1" >/sys/devices/system/edac/mc/mc0/inject_addrmatch
			
 
				+
			
 
				+	To return to the default behaviour of matching any, you can do:
			
 
				+		echo "dimm:any rank:any" >/sys/devices/system/edac/mc/mc0/inject_addrmatch
			
 
				+
			
 
				+   inject_eccmask:
			
 
				+       specifies what bits will have troubles,
			
 
				+
			
 
				+   inject_section:
			
 
				+       specifies what ECC cache section will get the error:
			
 
				+		3 for both
			
 
				+		2 for the highest
			
 
				+		1 for the lowest
			
 
				+
			
 
				+   inject_socket:
			
 
				+       specifies what QPI (or processor socket) will generate the error.
			
 
				+          on Xeon 35xx, it should be 0.
			
 
				+          on Xeon 55xx, it should be 0 or 1.
			
 
				+
			
 
				+   inject_type:
			
 
				+       specifies the type of error, being a combination of the following bits:
			
 
				+		bit 0 - repeat
			
 
				+		bit 1 - ecc
			
 
				+		bit 2 - parity
			
 
				+
			
 
				+       inject_enable starts the error generation when something different
			
 
				+       than 0 is written.
			
 
				+
			
 
				+   All inject vars can be read. root permission is needed for write.
			
 
				+
			
 
				+   Datasheet states that the error will only be generated after a write on an
			
 
				+   address that matches inject_addrmatch. It seems, however, that reading will
			
 
				+   also produce an error.
			
 
				+
			
 
				+   For example, the following code will generate an error for any write access
			
 
				+   at socket 0, on any DIMM/address on channel 2:
			
 
				+
			
 
				+   echo "channel:2" > /sys/devices/system/edac/mc/mc0/inject_addrmatch
			
 
				+   echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
			
 
				+   echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
			
 
				+   echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
			
 
				+   echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket
			
 
				+   echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
			
 
				+   dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
			
 
				+
			
 
				+   The generated error message will look like:
			
 
				+
			
 
				+   EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
			
 
				+
			
 
				+3) Nehalem specific Corrected Error memory counters
			
 
				+
			
 
				+   Nehalem have some registers to count memory errors, reporting it on a
			
 
				+   way that it is different from what EDAC API allows. Due to that, a
			
 
				+   separate sysfs note were created to handle such counters.
			
 
				+
			
 
				+   They can be read by looking at the contents of "corrected_error_counts"
			
 
				+   counter:
			
 
				+
			
 
				+	$ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts
			
 
				+	dimm0: 15866
			
 
				+	dimm1: 0
			
 
				+	dimm2: 27285