pci-error-recovery.txt 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246
  1. PCI Error Recovery
  2. ------------------
  3. May 31, 2005
  4. Current document maintainer:
  5. Linas Vepstas <linas@austin.ibm.com>
  6. Some PCI bus controllers are able to detect certain "hard" PCI errors
  7. on the bus, such as parity errors on the data and address busses, as
  8. well as SERR and PERR errors. These chipsets are then able to disable
  9. I/O to/from the affected device, so that, for example, a bad DMA
  10. address doesn't end up corrupting system memory. These same chipsets
  11. are also able to reset the affected PCI device, and return it to
  12. working condition. This document describes a generic API form
  13. performing error recovery.
  14. The core idea is that after a PCI error has been detected, there must
  15. be a way for the kernel to coordinate with all affected device drivers
  16. so that the pci card can be made operational again, possibly after
  17. performing a full electrical #RST of the PCI card. The API below
  18. provides a generic API for device drivers to be notified of PCI
  19. errors, and to be notified of, and respond to, a reset sequence.
  20. Preliminary sketch of API, cut-n-pasted-n-modified email from
  21. Ben Herrenschmidt, circa 5 april 2005
  22. The error recovery API support is exposed to the driver in the form of
  23. a structure of function pointers pointed to by a new field in struct
  24. pci_driver. The absence of this pointer in pci_driver denotes an
  25. "non-aware" driver, behaviour on these is platform dependant.
  26. Platforms like ppc64 can try to simulate pci hotplug remove/add.
  27. The definition of "pci_error_token" is not covered here. It is based on
  28. Seto's work on the synchronous error detection. We still need to define
  29. functions for extracting infos out of an opaque error token. This is
  30. separate from this API.
  31. This structure has the form:
  32. struct pci_error_handlers
  33. {
  34. int (*error_detected)(struct pci_dev *dev, pci_error_token error);
  35. int (*mmio_enabled)(struct pci_dev *dev);
  36. int (*resume)(struct pci_dev *dev);
  37. int (*link_reset)(struct pci_dev *dev);
  38. int (*slot_reset)(struct pci_dev *dev);
  39. };
  40. A driver doesn't have to implement all of these callbacks. The
  41. only mandatory one is error_detected(). If a callback is not
  42. implemented, the corresponding feature is considered unsupported.
  43. For example, if mmio_enabled() and resume() aren't there, then the
  44. driver is assumed as not doing any direct recovery and requires
  45. a reset. If link_reset() is not implemented, the card is assumed as
  46. not caring about link resets, in which case, if recover is supported,
  47. the core can try recover (but not slot_reset() unless it really did
  48. reset the slot). If slot_reset() is not supported, link_reset() can
  49. be called instead on a slot reset.
  50. At first, the call will always be :
  51. 1) error_detected()
  52. Error detected. This is sent once after an error has been detected. At
  53. this point, the device might not be accessible anymore depending on the
  54. platform (the slot will be isolated on ppc64). The driver may already
  55. have "noticed" the error because of a failing IO, but this is the proper
  56. "synchronisation point", that is, it gives a chance to the driver to
  57. cleanup, waiting for pending stuff (timers, whatever, etc...) to
  58. complete; it can take semaphores, schedule, etc... everything but touch
  59. the device. Within this function and after it returns, the driver
  60. shouldn't do any new IOs. Called in task context. This is sort of a
  61. "quiesce" point. See note about interrupts at the end of this doc.
  62. Result codes:
  63. - PCIERR_RESULT_CAN_RECOVER:
  64. Driever returns this if it thinks it might be able to recover
  65. the HW by just banging IOs or if it wants to be given
  66. a chance to extract some diagnostic informations (see
  67. below).
  68. - PCIERR_RESULT_NEED_RESET:
  69. Driver returns this if it thinks it can't recover unless the
  70. slot is reset.
  71. - PCIERR_RESULT_DISCONNECT:
  72. Return this if driver thinks it won't recover at all,
  73. (this will detach the driver ? or just leave it
  74. dangling ? to be decided)
  75. So at this point, we have called error_detected() for all drivers
  76. on the segment that had the error. On ppc64, the slot is isolated. What
  77. happens now typically depends on the result from the drivers. If all
  78. drivers on the segment/slot return PCIERR_RESULT_CAN_RECOVER, we would
  79. re-enable IOs on the slot (or do nothing special if the platform doesn't
  80. isolate slots) and call 2). If not and we can reset slots, we go to 4),
  81. if neither, we have a dead slot. If it's an hotplug slot, we might
  82. "simulate" reset by triggering HW unplug/replug though.
  83. >>> Current ppc64 implementation assumes that a device driver will
  84. >>> *not* schedule or semaphore in this routine; the current ppc64
  85. >>> implementation uses one kernel thread to notify all devices;
  86. >>> thus, of one device sleeps/schedules, all devices are affected.
  87. >>> Doing better requires complex multi-threaded logic in the error
  88. >>> recovery implementation (e.g. waiting for all notification threads
  89. >>> to "join" before proceeding with recovery.) This seems excessively
  90. >>> complex and not worth implementing.
  91. >>> The current ppc64 implementation doesn't much care if the device
  92. >>> attempts i/o at this point, or not. I/O's will fail, returning
  93. >>> a value of 0xff on read, and writes will be dropped. If the device
  94. >>> driver attempts more than 10K I/O's to a frozen adapter, it will
  95. >>> assume that the device driver has gone into an infinite loop, and
  96. >>> it will panic the the kernel.
  97. 2) mmio_enabled()
  98. This is the "early recovery" call. IOs are allowed again, but DMA is
  99. not (hrm... to be discussed, I prefer not), with some restrictions. This
  100. is NOT a callback for the driver to start operations again, only to
  101. peek/poke at the device, extract diagnostic information, if any, and
  102. eventually do things like trigger a device local reset or some such,
  103. but not restart operations. This is sent if all drivers on a segment
  104. agree that they can try to recover and no automatic link reset was
  105. performed by the HW. If the platform can't just re-enable IOs without
  106. a slot reset or a link reset, it doesn't call this callback and goes
  107. directly to 3) or 4). All IOs should be done _synchronously_ from
  108. within this callback, errors triggered by them will be returned via
  109. the normal pci_check_whatever() api, no new error_detected() callback
  110. will be issued due to an error happening here. However, such an error
  111. might cause IOs to be re-blocked for the whole segment, and thus
  112. invalidate the recovery that other devices on the same segment might
  113. have done, forcing the whole segment into one of the next states,
  114. that is link reset or slot reset.
  115. Result codes:
  116. - PCIERR_RESULT_RECOVERED
  117. Driver returns this if it thinks the device is fully
  118. functionnal and thinks it is ready to start
  119. normal driver operations again. There is no
  120. guarantee that the driver will actually be
  121. allowed to proceed, as another driver on the
  122. same segment might have failed and thus triggered a
  123. slot reset on platforms that support it.
  124. - PCIERR_RESULT_NEED_RESET
  125. Driver returns this if it thinks the device is not
  126. recoverable in it's current state and it needs a slot
  127. reset to proceed.
  128. - PCIERR_RESULT_DISCONNECT
  129. Same as above. Total failure, no recovery even after
  130. reset driver dead. (To be defined more precisely)
  131. >>> The current ppc64 implementation does not implement this callback.
  132. 3) link_reset()
  133. This is called after the link has been reset. This is typically
  134. a PCI Express specific state at this point and is done whenever a
  135. non-fatal error has been detected that can be "solved" by resetting
  136. the link. This call informs the driver of the reset and the driver
  137. should check if the device appears to be in working condition.
  138. This function acts a bit like 2) mmio_enabled(), in that the driver
  139. is not supposed to restart normal driver I/O operations right away.
  140. Instead, it should just "probe" the device to check it's recoverability
  141. status. If all is right, then the core will call resume() once all
  142. drivers have ack'd link_reset().
  143. Result codes:
  144. (identical to mmio_enabled)
  145. >>> The current ppc64 implementation does not implement this callback.
  146. 4) slot_reset()
  147. This is called after the slot has been soft or hard reset by the
  148. platform. A soft reset consists of asserting the adapter #RST line
  149. and then restoring the PCI BARs and PCI configuration header. If the
  150. platform supports PCI hotplug, then it might instead perform a hard
  151. reset by toggling power on the slot off/on. This call gives drivers
  152. the chance to re-initialize the hardware (re-download firmware, etc.),
  153. but drivers shouldn't restart normal I/O processing operations at
  154. this point. (See note about interrupts; interrupts aren't guaranteed
  155. to be delivered until the resume() callback has been called). If all
  156. device drivers report success on this callback, the patform will call
  157. resume() to complete the error handling and let the driver restart
  158. normal I/O processing.
  159. A driver can still return a critical failure for this function if
  160. it can't get the device operational after reset. If the platform
  161. previously tried a soft reset, it migh now try a hard reset (power
  162. cycle) and then call slot_reset() again. It the device still can't
  163. be recovered, there is nothing more that can be done; the platform
  164. will typically report a "permanent failure" in such a case. The
  165. device will be considered "dead" in this case.
  166. Result codes:
  167. - PCIERR_RESULT_DISCONNECT
  168. Same as above.
  169. >>> The current ppc64 implementation does not try a power-cycle reset
  170. >>> if the driver returned PCIERR_RESULT_DISCONNECT. However, it should.
  171. 5) resume()
  172. This is called if all drivers on the segment have returned
  173. PCIERR_RESULT_RECOVERED from one of the 3 prevous callbacks.
  174. That basically tells the driver to restart activity, tht everything
  175. is back and running. No result code is taken into account here. If
  176. a new error happens, it will restart a new error handling process.
  177. That's it. I think this covers all the possibilities. The way those
  178. callbacks are called is platform policy. A platform with no slot reset
  179. capability for example may want to just "ignore" drivers that can't
  180. recover (disconnect them) and try to let other cards on the same segment
  181. recover. Keep in mind that in most real life cases, though, there will
  182. be only one driver per segment.
  183. Now, there is a note about interrupts. If you get an interrupt and your
  184. device is dead or has been isolated, there is a problem :)
  185. After much thinking, I decided to leave that to the platform. That is,
  186. the recovery API only precies that:
  187. - There is no guarantee that interrupt delivery can proceed from any
  188. device on the segment starting from the error detection and until the
  189. restart callback is sent, at which point interrupts are expected to be
  190. fully operational.
  191. - There is no guarantee that interrupt delivery is stopped, that is, ad
  192. river that gets an interrupts after detecting an error, or that detects
  193. and error within the interrupt handler such that it prevents proper
  194. ack'ing of the interrupt (and thus removal of the source) should just
  195. return IRQ_NOTHANDLED. It's up to the platform to deal with taht
  196. condition, typically by masking the irq source during the duration of
  197. the error handling. It is expected that the platform "knows" which
  198. interrupts are routed to error-management capable slots and can deal
  199. with temporarily disabling that irq number during error processing (this
  200. isn't terribly complex). That means some IRQ latency for other devices
  201. sharing the interrupt, but there is simply no other way. High end
  202. platforms aren't supposed to share interrupts between many devices
  203. anyway :)
  204. Revised: 31 May 2005 Linas Vepstas <linas@austin.ibm.com>