123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475 |
- Configurable sysfs parameters for the x86-64 machine check code.
- Machine checks report internal hardware error conditions detected
- by the CPU. Uncorrected errors typically cause a machine check
- (often with panic), corrected ones cause a machine check log entry.
- Machine checks are organized in banks (normally associated with
- a hardware subsystem) and subevents in a bank. The exact meaning
- of the banks and subevent is CPU specific.
- mcelog knows how to decode them.
- When you see the "Machine check errors logged" message in the system
- log then mcelog should run to collect and decode machine check entries
- from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
- Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
- (N = CPU number)
- The directory contains some configurable entries:
- Entries:
- bankNctl
- (N bank number)
- 64bit Hex bitmask enabling/disabling specific subevents for bank N
- When a bit in the bitmask is zero then the respective
- subevent will not be reported.
- By default all events are enabled.
- Note that BIOS maintain another mask to disable specific events
- per bank. This is not visible here
- The following entries appear for each CPU, but they are truly shared
- between all CPUs.
- check_interval
- How often to poll for corrected machine check errors, in seconds
- (Note output is hexademical). Default 5 minutes. When the poller
- finds MCEs it triggers an exponential speedup (poll more often) on
- the polling interval. When the poller stops finding MCEs, it
- triggers an exponential backoff (poll less often) on the polling
- interval. The check_interval variable is both the initial and
- maximum polling interval.
- tolerant
- Tolerance level. When a machine check exception occurs for a non
- corrected machine check the kernel can take different actions.
- Since machine check exceptions can happen any time it is sometimes
- risky for the kernel to kill a process because it defies
- normal kernel locking rules. The tolerance level configures
- how hard the kernel tries to recover even at some risk of deadlock.
- 0: always panic,
- 1: panic if deadlock possible,
- 2: try to avoid panic,
- 3: never panic or exit (for testing only)
- Default: 1
- Note this only makes a difference if the CPU allows recovery
- from a machine check exception. Current x86 CPUs generally do not.
- trigger
- Program to run when a machine check event is detected.
- This is an alternative to running mcelog regularly from cron
- and allows to detect events faster.
- TBD document entries for AMD threshold interrupt configuration
- For more details about the x86 machine check architecture
- see the Intel and AMD architecture manuals from their developer websites.
- For more details about the architecture see
- see http://one.firstfloor.org/~andi/mce.pdf
|