12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758 |
- Using RCU's CPU Stall Detector
- The CONFIG_RCU_CPU_STALL_DETECTOR kernel config parameter enables
- RCU's CPU stall detector, which detects conditions that unduly delay
- RCU grace periods. The stall detector's idea of what constitutes
- "unduly delayed" is controlled by a pair of C preprocessor macros:
- RCU_SECONDS_TILL_STALL_CHECK
- This macro defines the period of time that RCU will wait from
- the beginning of a grace period until it issues an RCU CPU
- stall warning. It is normally ten seconds.
- RCU_SECONDS_TILL_STALL_RECHECK
- This macro defines the period of time that RCU will wait after
- issuing a stall warning until it issues another stall warning.
- It is normally set to thirty seconds.
- RCU_STALL_RAT_DELAY
- The CPU stall detector tries to make the offending CPU rat on itself,
- as this often gives better-quality stack traces. However, if
- the offending CPU does not detect its own stall in the number
- of jiffies specified by RCU_STALL_RAT_DELAY, then other CPUs will
- complain. This is normally set to two jiffies.
- The following problems can result in an RCU CPU stall warning:
- o A CPU looping in an RCU read-side critical section.
-
- o A CPU looping with interrupts disabled.
- o A CPU looping with preemption disabled.
- o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
- without invoking schedule().
- o A bug in the RCU implementation.
- o A hardware failure. This is quite unlikely, but has occurred
- at least once in a former life. A CPU failed in a running system,
- becoming unresponsive, but not causing an immediate crash.
- This resulted in a series of RCU CPU stall warnings, eventually
- leading the realization that the CPU had failed.
- The RCU, RCU-sched, and RCU-bh implementations have CPU stall warning.
- SRCU does not do so directly, but its calls to synchronize_sched() will
- result in RCU-sched detecting any CPU stalls that might be occurring.
- To diagnose the cause of the stall, inspect the stack traces. The offending
- function will usually be near the top of the stack. If you have a series
- of stall warnings from a single extended stall, comparing the stack traces
- can often help determine where the stall is occurring, which will usually
- be in the function nearest the top of the stack that stays the same from
- trace to trace.
- RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE.
|