barrier.txt 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261
  1. I/O Barriers
  2. ============
  3. Tejun Heo <htejun@gmail.com>, July 22 2005
  4. I/O barrier requests are used to guarantee ordering around the barrier
  5. requests. Unless you're crazy enough to use disk drives for
  6. implementing synchronization constructs (wow, sounds interesting...),
  7. the ordering is meaningful only for write requests for things like
  8. journal checkpoints. All requests queued before a barrier request
  9. must be finished (made it to the physical medium) before the barrier
  10. request is started, and all requests queued after the barrier request
  11. must be started only after the barrier request is finished (again,
  12. made it to the physical medium).
  13. In other words, I/O barrier requests have the following two properties.
  14. 1. Request ordering
  15. Requests cannot pass the barrier request. Preceding requests are
  16. processed before the barrier and following requests after.
  17. Depending on what features a drive supports, this can be done in one
  18. of the following three ways.
  19. i. For devices which have queue depth greater than 1 (TCQ devices) and
  20. support ordered tags, block layer can just issue the barrier as an
  21. ordered request and the lower level driver, controller and drive
  22. itself are responsible for making sure that the ordering constraint is
  23. met. Most modern SCSI controllers/drives should support this.
  24. NOTE: SCSI ordered tag isn't currently used due to limitation in the
  25. SCSI midlayer, see the following random notes section.
  26. ii. For devices which have queue depth greater than 1 but don't
  27. support ordered tags, block layer ensures that the requests preceding
  28. a barrier request finishes before issuing the barrier request. Also,
  29. it defers requests following the barrier until the barrier request is
  30. finished. Older SCSI controllers/drives and SATA drives fall in this
  31. category.
  32. iii. Devices which have queue depth of 1. This is a degenerate case
  33. of ii. Just keeping issue order suffices. Ancient SCSI
  34. controllers/drives and IDE drives are in this category.
  35. 2. Forced flushing to physical medium
  36. Again, if you're not gonna do synchronization with disk drives (dang,
  37. it sounds even more appealing now!), the reason you use I/O barriers
  38. is mainly to protect filesystem integrity when power failure or some
  39. other events abruptly stop the drive from operating and possibly make
  40. the drive lose data in its cache. So, I/O barriers need to guarantee
  41. that requests actually get written to non-volatile medium in order.
  42. There are four cases,
  43. i. No write-back cache. Keeping requests ordered is enough.
  44. ii. Write-back cache but no flush operation. There's no way to
  45. guarantee physical-medium commit order. This kind of devices can't to
  46. I/O barriers.
  47. iii. Write-back cache and flush operation but no FUA (forced unit
  48. access). We need two cache flushes - before and after the barrier
  49. request.
  50. iv. Write-back cache, flush operation and FUA. We still need one
  51. flush to make sure requests preceding a barrier are written to medium,
  52. but post-barrier flush can be avoided by using FUA write on the
  53. barrier itself.
  54. How to support barrier requests in drivers
  55. ------------------------------------------
  56. All barrier handling is done inside block layer proper. All low level
  57. drivers have to are implementing its prepare_flush_fn and using one
  58. the following two functions to indicate what barrier type it supports
  59. and how to prepare flush requests. Note that the term 'ordered' is
  60. used to indicate the whole sequence of performing barrier requests
  61. including draining and flushing.
  62. typedef void (prepare_flush_fn)(struct request_queue *q, struct request *rq);
  63. int blk_queue_ordered(struct request_queue *q, unsigned ordered,
  64. prepare_flush_fn *prepare_flush_fn);
  65. @q : the queue in question
  66. @ordered : the ordered mode the driver/device supports
  67. @prepare_flush_fn : this function should prepare @rq such that it
  68. flushes cache to physical medium when executed
  69. For example, SCSI disk driver's prepare_flush_fn looks like the
  70. following.
  71. static void sd_prepare_flush(struct request_queue *q, struct request *rq)
  72. {
  73. memset(rq->cmd, 0, sizeof(rq->cmd));
  74. rq->cmd_type = REQ_TYPE_BLOCK_PC;
  75. rq->timeout = SD_TIMEOUT;
  76. rq->cmd[0] = SYNCHRONIZE_CACHE;
  77. rq->cmd_len = 10;
  78. }
  79. The following seven ordered modes are supported. The following table
  80. shows which mode should be used depending on what features a
  81. device/driver supports. In the leftmost column of table,
  82. QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
  83. The table is followed by description of each mode. Note that in the
  84. descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
  85. used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the
  86. preceding step must be complete before proceeding to the next step.
  87. '->' indicates that the next step can start as soon as the previous
  88. step is issued.
  89. write-back cache ordered tag flush FUA
  90. -----------------------------------------------------------------------
  91. NONE yes/no N/A no N/A
  92. DRAIN no no N/A N/A
  93. DRAIN_FLUSH yes no yes no
  94. DRAIN_FUA yes no yes yes
  95. TAG no yes N/A N/A
  96. TAG_FLUSH yes yes yes no
  97. TAG_FUA yes yes yes yes
  98. QUEUE_ORDERED_NONE
  99. I/O barriers are not needed and/or supported.
  100. Sequence: N/A
  101. QUEUE_ORDERED_DRAIN
  102. Requests are ordered by draining the request queue and cache
  103. flushing isn't needed.
  104. Sequence: drain => barrier
  105. QUEUE_ORDERED_DRAIN_FLUSH
  106. Requests are ordered by draining the request queue and both
  107. pre-barrier and post-barrier cache flushings are needed.
  108. Sequence: drain => preflush => barrier => postflush
  109. QUEUE_ORDERED_DRAIN_FUA
  110. Requests are ordered by draining the request queue and
  111. pre-barrier cache flushing is needed. By using FUA on barrier
  112. request, post-barrier flushing can be skipped.
  113. Sequence: drain => preflush => barrier
  114. QUEUE_ORDERED_TAG
  115. Requests are ordered by ordered tag and cache flushing isn't
  116. needed.
  117. Sequence: barrier
  118. QUEUE_ORDERED_TAG_FLUSH
  119. Requests are ordered by ordered tag and both pre-barrier and
  120. post-barrier cache flushings are needed.
  121. Sequence: preflush -> barrier -> postflush
  122. QUEUE_ORDERED_TAG_FUA
  123. Requests are ordered by ordered tag and pre-barrier cache
  124. flushing is needed. By using FUA on barrier request,
  125. post-barrier flushing can be skipped.
  126. Sequence: preflush -> barrier
  127. Random notes/caveats
  128. --------------------
  129. * SCSI layer currently can't use TAG ordering even if the drive,
  130. controller and driver support it. The problem is that SCSI midlayer
  131. request dispatch function is not atomic. It releases queue lock and
  132. switch to SCSI host lock during issue and it's possible and likely to
  133. happen in time that requests change their relative positions. Once
  134. this problem is solved, TAG ordering can be enabled.
  135. * Currently, no matter which ordered mode is used, there can be only
  136. one barrier request in progress. All I/O barriers are held off by
  137. block layer until the previous I/O barrier is complete. This doesn't
  138. make any difference for DRAIN ordered devices, but, for TAG ordered
  139. devices with very high command latency, passing multiple I/O barriers
  140. to low level *might* be helpful if they are very frequent. Well, this
  141. certainly is a non-issue. I'm writing this just to make clear that no
  142. two I/O barrier is ever passed to low-level driver.
  143. * Completion order. Requests in ordered sequence are issued in order
  144. but not required to finish in order. Barrier implementation can
  145. handle out-of-order completion of ordered sequence. IOW, the requests
  146. MUST be processed in order but the hardware/software completion paths
  147. are allowed to reorder completion notifications - eg. current SCSI
  148. midlayer doesn't preserve completion order during error handling.
  149. * Requeueing order. Low-level drivers are free to requeue any request
  150. after they removed it from the request queue with
  151. blkdev_dequeue_request(). As barrier sequence should be kept in order
  152. when requeued, generic elevator code takes care of putting requests in
  153. order around barrier. See blk_ordered_req_seq() and
  154. ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
  155. Note that block drivers must not requeue preceding requests while
  156. completing latter requests in an ordered sequence. Currently, no
  157. error checking is done against this.
  158. * Error handling. Currently, block layer will report error to upper
  159. layer if any of requests in an ordered sequence fails. Unfortunately,
  160. this doesn't seem to be enough. Look at the following request flow.
  161. QUEUE_ORDERED_TAG_FLUSH is in use.
  162. [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
  163. still in elevator
  164. Let's say request [2], [3] are write requests to update file system
  165. metadata (journal or whatever) and [barrier] is used to mark that
  166. those updates are valid. Consider the following sequence.
  167. i. Requests [0] ~ [post] leaves the request queue and enters
  168. low-level driver.
  169. ii. After a while, unfortunately, something goes wrong and the
  170. drive fails [2]. Note that any of [0], [1] and [3] could have
  171. completed by this time, but [pre] couldn't have been finished
  172. as the drive must process it in order and it failed before
  173. processing that command.
  174. iii. Error handling kicks in and determines that the error is
  175. unrecoverable and fails [2], and resumes operation.
  176. iv. [pre] [barrier] [post] gets processed.
  177. v. *BOOM* power fails
  178. The problem here is that the barrier request is *supposed* to indicate
  179. that filesystem update requests [2] and [3] made it safely to the
  180. physical medium and, if the machine crashes after the barrier is
  181. written, filesystem recovery code can depend on that. Sadly, that
  182. isn't true in this case anymore. IOW, the success of a I/O barrier
  183. should also be dependent on success of some of the preceding requests,
  184. where only upper layer (filesystem) knows what 'some' is.
  185. This can be solved by implementing a way to tell the block layer which
  186. requests affect the success of the following barrier request and
  187. making lower lever drivers to resume operation on error only after
  188. block layer tells it to do so.
  189. As the probability of this happening is very low and the drive should
  190. be faulty, implementing the fix is probably an overkill. But, still,
  191. it's there.
  192. * In previous drafts of barrier implementation, there was fallback
  193. mechanism such that, if FUA or ordered TAG fails, less fancy ordered
  194. mode can be selected and the failed barrier request is retried
  195. automatically. The rationale for this feature was that as FUA is
  196. pretty new in ATA world and ordered tag was never used widely, there
  197. could be devices which report to support those features but choke when
  198. actually given such requests.
  199. This was removed for two reasons 1. it's an overkill 2. it's
  200. impossible to implement properly when TAG ordering is used as low
  201. level drivers resume after an error automatically. If it's ever
  202. needed adding it back and modifying low level drivers accordingly
  203. shouldn't be difficult.