eeh-pci-error-recovery.txt 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332
  1. PCI Bus EEH Error Recovery
  2. --------------------------
  3. Linas Vepstas
  4. <linas@austin.ibm.com>
  5. 12 January 2005
  6. Overview:
  7. ---------
  8. The IBM POWER-based pSeries and iSeries computers include PCI bus
  9. controller chips that have extended capabilities for detecting and
  10. reporting a large variety of PCI bus error conditions. These features
  11. go under the name of "EEH", for "Extended Error Handling". The EEH
  12. hardware features allow PCI bus errors to be cleared and a PCI
  13. card to be "rebooted", without also having to reboot the operating
  14. system.
  15. This is in contrast to traditional PCI error handling, where the
  16. PCI chip is wired directly to the CPU, and an error would cause
  17. a CPU machine-check/check-stop condition, halting the CPU entirely.
  18. Another "traditional" technique is to ignore such errors, which
  19. can lead to data corruption, both of user data or of kernel data,
  20. hung/unresponsive adapters, or system crashes/lockups. Thus,
  21. the idea behind EEH is that the operating system can become more
  22. reliable and robust by protecting it from PCI errors, and giving
  23. the OS the ability to "reboot"/recover individual PCI devices.
  24. Future systems from other vendors, based on the PCI-E specification,
  25. may contain similar features.
  26. Causes of EEH Errors
  27. --------------------
  28. EEH was originally designed to guard against hardware failure, such
  29. as PCI cards dying from heat, humidity, dust, vibration and bad
  30. electrical connections. The vast majority of EEH errors seen in
  31. "real life" are due to eithr poorly seated PCI cards, or,
  32. unfortunately quite commonly, due device driver bugs, device firmware
  33. bugs, and sometimes PCI card hardware bugs.
  34. The most common software bug, is one that causes the device to
  35. attempt to DMA to a location in system memory that has not been
  36. reserved for DMA access for that card. This is a powerful feature,
  37. as it prevents what; otherwise, would have been silent memory
  38. corruption caused by the bad DMA. A number of device driver
  39. bugs have been found and fixed in this way over the past few
  40. years. Other possible causes of EEH errors include data or
  41. address line parity errors (for example, due to poor electrical
  42. connectivity due to a poorly seated card), and PCI-X split-completion
  43. errors (due to software, device firmware, or device PCI hardware bugs).
  44. The vast majority of "true hardware failures" can be cured by
  45. physically removing and re-seating the PCI card.
  46. Detection and Recovery
  47. ----------------------
  48. In the following discussion, a generic overview of how to detect
  49. and recover from EEH errors will be presented. This is followed
  50. by an overview of how the current implementation in the Linux
  51. kernel does it. The actual implementation is subject to change,
  52. and some of the finer points are still being debated. These
  53. may in turn be swayed if or when other architectures implement
  54. similar functionality.
  55. When a PCI Host Bridge (PHB, the bus controller connecting the
  56. PCI bus to the system CPU electronics complex) detects a PCI error
  57. condition, it will "isolate" the affected PCI card. Isolation
  58. will block all writes (either to the card from the system, or
  59. from the card to the system), and it will cause all reads to
  60. return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads).
  61. This value was chosen because it is the same value you would
  62. get if the device was physically unplugged from the slot.
  63. This includes access to PCI memory, I/O space, and PCI config
  64. space. Interrupts; however, will continued to be delivered.
  65. Detection and recovery are performed with the aid of ppc64
  66. firmware. The programming interfaces in the Linux kernel
  67. into the firmware are referred to as RTAS (Run-Time Abstraction
  68. Services). The Linux kernel does not (should not) access
  69. the EEH function in the PCI chipsets directly, primarily because
  70. there are a number of different chipsets out there, each with
  71. different interfaces and quirks. The firmware provides a
  72. uniform abstraction layer that will work with all pSeries
  73. and iSeries hardware (and be forwards-compatible).
  74. If the OS or device driver suspects that a PCI slot has been
  75. EEH-isolated, there is a firmware call it can make to determine if
  76. this is the case. If so, then the device driver should put itself
  77. into a consistent state (given that it won't be able to complete any
  78. pending work) and start recovery of the card. Recovery normally
  79. would consist of reseting the PCI device (holding the PCI #RST
  80. line high for two seconds), followed by setting up the device
  81. config space (the base address registers (BAR's), latency timer,
  82. cache line size, interrupt line, and so on). This is followed by a
  83. reinitialization of the device driver. In a worst-case scenario,
  84. the power to the card can be toggled, at least on hot-plug-capable
  85. slots. In principle, layers far above the device driver probably
  86. do not need to know that the PCI card has been "rebooted" in this
  87. way; ideally, there should be at most a pause in Ethernet/disk/USB
  88. I/O while the card is being reset.
  89. If the card cannot be recovered after three or four resets, the
  90. kernel/device driver should assume the worst-case scenario, that the
  91. card has died completely, and report this error to the sysadmin.
  92. In addition, error messages are reported through RTAS and also through
  93. syslogd (/var/log/messages) to alert the sysadmin of PCI resets.
  94. The correct way to deal with failed adapters is to use the standard
  95. PCI hotplug tools to remove and replace the dead card.
  96. Current PPC64 Linux EEH Implementation
  97. --------------------------------------
  98. At this time, a generic EEH recovery mechanism has been implemented,
  99. so that individual device drivers do not need to be modified to support
  100. EEH recovery. This generic mechanism piggy-backs on the PCI hotplug
  101. infrastructure, and percolates events up through the hotplug/udev
  102. infrastructure. Followiing is a detailed description of how this is
  103. accomplished.
  104. EEH must be enabled in the PHB's very early during the boot process,
  105. and if a PCI slot is hot-plugged. The former is performed by
  106. eeh_init() in arch/ppc64/kernel/eeh.c, and the later by
  107. drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code.
  108. EEH must be enabled before a PCI scan of the device can proceed.
  109. Current Power5 hardware will not work unless EEH is enabled;
  110. although older Power4 can run with it disabled. Effectively,
  111. EEH can no longer be turned off. PCI devices *must* be
  112. registered with the EEH code; the EEH code needs to know about
  113. the I/O address ranges of the PCI device in order to detect an
  114. error. Given an arbitrary address, the routine
  115. pci_get_device_by_addr() will find the pci device associated
  116. with that address (if any).
  117. The default include/asm-ppc64/io.h macros readb(), inb(), insb(),
  118. etc. include a check to see if the i/o read returned all-0xff's.
  119. If so, these make a call to eeh_dn_check_failure(), which in turn
  120. asks the firmware if the all-ff's value is the sign of a true EEH
  121. error. If it is not, processing continues as normal. The grand
  122. total number of these false alarms or "false positives" can be
  123. seen in /proc/ppc64/eeh (subject to change). Normally, almost
  124. all of these occur during boot, when the PCI bus is scanned, where
  125. a large number of 0xff reads are part of the bus scan procedure.
  126. If a frozen slot is detected, code in arch/ppc64/kernel/eeh.c will
  127. print a stack trace to syslog (/var/log/messages). This stack trace
  128. has proven to be very useful to device-driver authors for finding
  129. out at what point the EEH error was detected, as the error itself
  130. usually occurs slightly beforehand.
  131. Next, it uses the Linux kernel notifier chain/work queue mechanism to
  132. allow any interested parties to find out about the failure. Device
  133. drivers, or other parts of the kernel, can use
  134. eeh_register_notifier(struct notifier_block *) to find out about EEH
  135. events. The event will include a pointer to the pci device, the
  136. device node and some state info. Receivers of the event can "do as
  137. they wish"; the default handler will be described further in this
  138. section.
  139. To assist in the recovery of the device, eeh.c exports the
  140. following functions:
  141. rtas_set_slot_reset() -- assert the PCI #RST line for 1/8th of a second
  142. rtas_configure_bridge() -- ask firmware to configure any PCI bridges
  143. located topologically under the pci slot.
  144. eeh_save_bars() and eeh_restore_bars(): save and restore the PCI
  145. config-space info for a device and any devices under it.
  146. A handler for the EEH notifier_block events is implemented in
  147. drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events().
  148. It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter().
  149. This last call causes the device driver for the card to be stopped,
  150. which causes hotplug events to go out to user space. This triggers
  151. user-space scripts that might issue commands such as "ifdown eth0"
  152. for ethernet cards, and so on. This handler then sleeps for 5 seconds,
  153. hoping to give the user-space scripts enough time to complete.
  154. It then resets the PCI card, reconfigures the device BAR's, and
  155. any bridges underneath. It then calls rpaphp_enable_pci_slot(),
  156. which restarts the device driver and triggers more user-space
  157. events (for example, calling "ifup eth0" for ethernet cards).
  158. Device Shutdown and User-Space Events
  159. -------------------------------------
  160. This section documents what happens when a pci slot is unconfigured,
  161. focusing on how the device driver gets shut down, and on how the
  162. events get delivered to user-space scripts.
  163. Following is an example sequence of events that cause a device driver
  164. close function to be called during the first phase of an EEH reset.
  165. The following sequence is an example of the pcnet32 device driver.
  166. rpa_php_unconfig_pci_adapter (struct slot *) // in rpaphp_pci.c
  167. {
  168. calls
  169. pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c
  170. {
  171. calls
  172. pci_destroy_dev (struct pci_dev *)
  173. {
  174. calls
  175. device_unregister (&dev->dev) // in /drivers/base/core.c
  176. {
  177. calls
  178. device_del (struct device *)
  179. {
  180. calls
  181. bus_remove_device() // in /drivers/base/bus.c
  182. {
  183. calls
  184. device_release_driver()
  185. {
  186. calls
  187. struct device_driver->remove() which is just
  188. pci_device_remove() // in /drivers/pci/pci_driver.c
  189. {
  190. calls
  191. struct pci_driver->remove() which is just
  192. pcnet32_remove_one() // in /drivers/net/pcnet32.c
  193. {
  194. calls
  195. unregister_netdev() // in /net/core/dev.c
  196. {
  197. calls
  198. dev_close() // in /net/core/dev.c
  199. {
  200. calls dev->stop();
  201. which is just pcnet32_close() // in pcnet32.c
  202. {
  203. which does what you wanted
  204. to stop the device
  205. }
  206. }
  207. }
  208. which
  209. frees pcnet32 device driver memory
  210. }
  211. }}}}}}
  212. in drivers/pci/pci_driver.c,
  213. struct device_driver->remove() is just pci_device_remove()
  214. which calls struct pci_driver->remove() which is pcnet32_remove_one()
  215. which calls unregister_netdev() (in net/core/dev.c)
  216. which calls dev_close() (in net/core/dev.c)
  217. which calls dev->stop() which is pcnet32_close()
  218. which then does the appropriate shutdown.
  219. ---
  220. Following is the analogous stack trace for events sent to user-space
  221. when the pci device is unconfigured.
  222. rpa_php_unconfig_pci_adapter() { // in rpaphp_pci.c
  223. calls
  224. pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c
  225. calls
  226. pci_destroy_dev (struct pci_dev *) {
  227. calls
  228. device_unregister (&dev->dev) { // in /drivers/base/core.c
  229. calls
  230. device_del(struct device * dev) { // in /drivers/base/core.c
  231. calls
  232. kobject_del() { //in /libs/kobject.c
  233. calls
  234. kobject_hotplug() { // in /libs/kobject.c
  235. calls
  236. kset_hotplug() { // in /lib/kobject.c
  237. calls
  238. kset->hotplug_ops->hotplug() which is really just
  239. a call to
  240. dev_hotplug() { // in /drivers/base/core.c
  241. calls
  242. dev->bus->hotplug() which is really just a call to
  243. pci_hotplug () { // in drivers/pci/hotplug.c
  244. which prints device name, etc....
  245. }
  246. }
  247. then kset_hotplug() calls
  248. call_usermodehelper () with
  249. argv[0]=hotplug_path[] which is "/sbin/hotplug"
  250. --> event to userspace,
  251. }
  252. }
  253. kobject_del() then calls sysfs_remove_dir(), which would
  254. trigger any user-space daemon that was watching /sysfs,
  255. and notice the delete event.
  256. Pro's and Con's of the Current Design
  257. -------------------------------------
  258. There are several issues with the current EEH software recovery design,
  259. which may be addressed in future revisions. But first, note that the
  260. big plus of the current design is that no changes need to be made to
  261. individual device drivers, so that the current design throws a wide net.
  262. The biggest negative of the design is that it potentially disturbs
  263. network daemons and file systems that didn't need to be disturbed.
  264. -- A minor complaint is that resetting the network card causes
  265. user-space back-to-back ifdown/ifup burps that potentially disturb
  266. network daemons, that didn't need to even know that the pci
  267. card was being rebooted.
  268. -- A more serious concern is that the same reset, for SCSI devices,
  269. causes havoc to mounted file systems. Scripts cannot post-facto
  270. unmount a file system without flushing pending buffers, but this
  271. is impossible, because I/O has already been stopped. Thus,
  272. ideally, the reset should happen at or below the block layer,
  273. so that the file systems are not disturbed.
  274. Reiserfs does not tolerate errors returned from the block device.
  275. Ext3fs seems to be tolerant, retrying reads/writes until it does
  276. succeed. Both have been only lightly tested in this scenario.
  277. The SCSI-generic subsystem already has built-in code for performing
  278. SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter
  279. (HBA) resets. These are cascaded into a chain of attempted
  280. resets if a SCSI command fails. These are completely hidden
  281. from the block layer. It would be very natural to add an EEH
  282. reset into this chain of events.
  283. -- If a SCSI error occurs for the root device, all is lost unless
  284. the sysadmin had the foresight to run /bin, /sbin, /etc, /var
  285. and so on, out of ramdisk/tmpfs.
  286. Conclusions
  287. -----------
  288. There's forward progress ...