|
@@ -730,25 +730,41 @@ Due to the way Nehalem exports Memory Controller data, some adjustments
|
|
|
were done at i7core_edac driver. This chapter will cover those differences
|
|
|
|
|
|
1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect
|
|
|
- (QPI). At the driver, the term "socket" means one QPI. It should also be
|
|
|
- associated with the CPU physical socket.
|
|
|
+ (QPI). At the driver, the term "socket" means one QPI. This is
|
|
|
+ associated with a physical CPU socket.
|
|
|
|
|
|
Each MC have 3 physical read channels, 3 physical write channels and
|
|
|
3 logic channels. The driver currenty sees it as just 3 channels.
|
|
|
Each channel can have up to 3 DIMMs.
|
|
|
|
|
|
The minimum known unity is DIMMs. There are no information about csrows.
|
|
|
- As EDAC API maps the minimum unity is csrows, the driver exports one
|
|
|
+ As EDAC API maps the minimum unity is csrows, the driver sequencially
|
|
|
+ maps channel/dimm into different csrows.
|
|
|
+
|
|
|
+ For example, suposing the following layout:
|
|
|
+ Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
|
|
|
+ dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
|
|
|
+ dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
|
|
|
+ Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
|
|
|
+ dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
|
|
|
+ Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
|
|
|
+ dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
|
|
|
+ The driver will map it as:
|
|
|
+ csrow0: channel 0, dimm0
|
|
|
+ csrow1: channel 0, dimm1
|
|
|
+ csrow2: channel 1, dimm0
|
|
|
+ csrow3: channel 2, dimm0
|
|
|
+
|
|
|
+exports one
|
|
|
DIMM per csrow.
|
|
|
|
|
|
- Currently, it also exports the several memory controllers as just one. This
|
|
|
- limit will be removed on future versions of the driver.
|
|
|
+ Each QPI is exported as a different memory controller.
|
|
|
|
|
|
2) Nehalem MC has the hability to generate errors. The driver implements this
|
|
|
functionality via some error injection nodes:
|
|
|
|
|
|
For injecting a memory error, there are some sysfs nodes, under
|
|
|
- /sys/devices/system/edac/mc/mc0/:
|
|
|
+ /sys/devices/system/edac/mc/mc?/:
|
|
|
|
|
|
inject_addrmatch:
|
|
|
Controls the error injection mask register. It is possible to specify
|
|
@@ -779,11 +795,6 @@ were done at i7core_edac driver. This chapter will cover those differences
|
|
|
2 for the highest
|
|
|
1 for the lowest
|
|
|
|
|
|
- inject_socket:
|
|
|
- specifies what QPI (or processor socket) will generate the error.
|
|
|
- on Xeon 35xx, it should be 0.
|
|
|
- on Xeon 55xx, it should be 0 or 1.
|
|
|
-
|
|
|
inject_type:
|
|
|
specifies the type of error, being a combination of the following bits:
|
|
|
bit 0 - repeat
|
|
@@ -806,10 +817,12 @@ were done at i7core_edac driver. This chapter will cover those differences
|
|
|
echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
|
|
|
echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
|
|
|
echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
|
|
|
- echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket
|
|
|
echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
|
|
|
dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
|
|
|
|
|
|
+ For socket 1, it is needed to replace "mc0" by "mc1" at the above
|
|
|
+ commands.
|
|
|
+
|
|
|
The generated error message will look like:
|
|
|
|
|
|
EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
|
|
@@ -821,9 +834,36 @@ were done at i7core_edac driver. This chapter will cover those differences
|
|
|
separate sysfs note were created to handle such counters.
|
|
|
|
|
|
They can be read by looking at the contents of "corrected_error_counts"
|
|
|
- counter:
|
|
|
+ counter. Due to hardware limits, the output is different on machines
|
|
|
+ with unregistered memories and machines with registered ones.
|
|
|
+
|
|
|
+ With unregistered memories, it outputs:
|
|
|
|
|
|
$ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts
|
|
|
- dimm0: 15866
|
|
|
- dimm1: 0
|
|
|
- dimm2: 27285
|
|
|
+ all channels UDIMM0: 0 UDIMM1: 0 UDIMM2: 0
|
|
|
+
|
|
|
+ What happens here is that errors on different csrows, but at the same
|
|
|
+ dimm number will increment the same counter.
|
|
|
+ So, in this memory mapping:
|
|
|
+ csrow0: channel 0, dimm0
|
|
|
+ csrow1: channel 0, dimm1
|
|
|
+ csrow2: channel 1, dimm0
|
|
|
+ csrow3: channel 2, dimm0
|
|
|
+ The hardware will increment UDIMM0 for an error at either csrow0, csrow2
|
|
|
+ or csrow3.
|
|
|
+
|
|
|
+ With registered memories, it outputs:
|
|
|
+
|
|
|
+ $cat /sys/devices/system/edac/mc/mc0/corrected_error_counts
|
|
|
+ channel 0 RDIMM0: 0 RDIMM1: 0 RDIMM2: 0
|
|
|
+ channel 1 RDIMM0: 0 RDIMM1: 0 RDIMM2: 0
|
|
|
+ channel 2 RDIMM0: 0 RDIMM1: 0 RDIMM2: 0
|
|
|
+
|
|
|
+ So, with registered memories, there's a direct map between a csrow and a
|
|
|
+ counter.
|
|
|
+
|
|
|
+4) Standard error counters
|
|
|
+
|
|
|
+ The standard error counters are generated when an mcelog error is received
|
|
|
+ by the driver. Since it is counted by software, it is possible that some
|
|
|
+ errors could be lost.
|