as-iosched.txt 8.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172
  1. Anticipatory IO scheduler
  2. -------------------------
  3. Nick Piggin <piggin@cyberone.com.au> 13 Sep 2003
  4. Attention! Database servers, especially those using "TCQ" disks should
  5. investigate performance with the 'deadline' IO scheduler. Any system with high
  6. disk performance requirements should do so, in fact.
  7. If you see unusual performance characteristics of your disk systems, or you
  8. see big performance regressions versus the deadline scheduler, please email
  9. me. Database users don't bother unless you're willing to test a lot of patches
  10. from me ;) its a known issue.
  11. Also, users with hardware RAID controllers, doing striping, may find
  12. highly variable performance results with using the as-iosched. The
  13. as-iosched anticipatory implementation is based on the notion that a disk
  14. device has only one physical seeking head. A striped RAID controller
  15. actually has a head for each physical device in the logical RAID device.
  16. However, setting the antic_expire (see tunable parameters below) produces
  17. very similar behavior to the deadline IO scheduler.
  18. Selecting IO schedulers
  19. -----------------------
  20. Refer to Documentation/block/switching-sched.txt for information on
  21. selecting an io scheduler on a per-device basis.
  22. Anticipatory IO scheduler Policies
  23. ----------------------------------
  24. The as-iosched implementation implements several layers of policies
  25. to determine when an IO request is dispatched to the disk controller.
  26. Here are the policies outlined, in order of application.
  27. 1. one-way Elevator algorithm.
  28. The elevator algorithm is similar to that used in deadline scheduler, with
  29. the addition that it allows limited backward movement of the elevator
  30. (i.e. seeks backwards). A seek backwards can occur when choosing between
  31. two IO requests where one is behind the elevator's current position, and
  32. the other is in front of the elevator's position. If the seek distance to
  33. the request in back of the elevator is less than half the seek distance to
  34. the request in front of the elevator, then the request in back can be chosen.
  35. Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
  36. This favors forward movement of the elevator, while allowing opportunistic
  37. "short" backward seeks.
  38. 2. FIFO expiration times for reads and for writes.
  39. This is again very similar to the deadline IO scheduler. The expiration
  40. times for requests on these lists is tunable using the parameters read_expire
  41. and write_expire discussed below. When a read or a write expires in this way,
  42. the IO scheduler will interrupt its current elevator sweep or read anticipation
  43. to service the expired request.
  44. 3. Read and write request batching
  45. A batch is a collection of read requests or a collection of write
  46. requests. The as scheduler alternates dispatching read and write batches
  47. to the driver. In the case a read batch, the scheduler submits read
  48. requests to the driver as long as there are read requests to submit, and
  49. the read batch time limit has not been exceeded (read_batch_expire).
  50. The read batch time limit begins counting down only when there are
  51. competing write requests pending.
  52. In the case of a write batch, the scheduler submits write requests to
  53. the driver as long as there are write requests available, and the
  54. write batch time limit has not been exceeded (write_batch_expire).
  55. However, the length of write batches will be gradually shortened
  56. when read batches frequently exceed their time limit.
  57. When changing between batch types, the scheduler waits for all requests
  58. from the previous batch to complete before scheduling requests for the
  59. next batch.
  60. The read and write fifo expiration times described in policy 2 above
  61. are checked only when in scheduling IO of a batch for the corresponding
  62. (read/write) type. So for example, the read FIFO timeout values are
  63. tested only during read batches. Likewise, the write FIFO timeout
  64. values are tested only during write batches. For this reason,
  65. it is generally not recommended for the read batch time
  66. to be longer than the write expiration time, nor for the write batch
  67. time to exceed the read expiration time (see tunable parameters below).
  68. When the IO scheduler changes from a read to a write batch,
  69. it begins the elevator from the request that is on the head of the
  70. write expiration FIFO. Likewise, when changing from a write batch to
  71. a read batch, scheduler begins the elevator from the first entry
  72. on the read expiration FIFO.
  73. 4. Read anticipation.
  74. Read anticipation occurs only when scheduling a read batch.
  75. This implementation of read anticipation allows only one read request
  76. to be dispatched to the disk controller at a time. In
  77. contrast, many write requests may be dispatched to the disk controller
  78. at a time during a write batch. It is this characteristic that can make
  79. the anticipatory scheduler perform anomalously with controllers supporting
  80. TCQ, or with hardware striped RAID devices. Setting the antic_expire
  81. queue parameter (see below) to zero disables this behavior, and the
  82. anticipatory scheduler behaves essentially like the deadline scheduler.
  83. When read anticipation is enabled (antic_expire is not zero), reads
  84. are dispatched to the disk controller one at a time.
  85. At the end of each read request, the IO scheduler examines its next
  86. candidate read request from its sorted read list. If that next request
  87. is from the same process as the request that just completed,
  88. or if the next request in the queue is "very close" to the
  89. just completed request, it is dispatched immediately. Otherwise,
  90. statistics (average think time, average seek distance) on the process
  91. that submitted the just completed request are examined. If it seems
  92. likely that that process will submit another request soon, and that
  93. request is likely to be near the just completed request, then the IO
  94. scheduler will stop dispatching more read requests for up to (antic_expire)
  95. milliseconds, hoping that process will submit a new request near the one
  96. that just completed. If such a request is made, then it is dispatched
  97. immediately. If the antic_expire wait time expires, then the IO scheduler
  98. will dispatch the next read request from the sorted read queue.
  99. To decide whether an anticipatory wait is worthwhile, the scheduler
  100. maintains statistics for each process that can be used to compute
  101. mean "think time" (the time between read requests), and mean seek
  102. distance for that process. One observation is that these statistics
  103. are associated with each process, but those statistics are not associated
  104. with a specific IO device. So for example, if a process is doing IO
  105. on several file systems on separate devices, the statistics will be
  106. a combination of IO behavior from all those devices.
  107. Tuning the anticipatory IO scheduler
  108. ------------------------------------
  109. When using 'as', the anticipatory IO scheduler there are 5 parameters under
  110. /sys/block/*/queue/iosched/. All are units of milliseconds.
  111. The parameters are:
  112. * read_expire
  113. Controls how long until a read request becomes "expired". It also controls the
  114. interval between which expired requests are served, so set to 50, a request
  115. might take anywhere < 100ms to be serviced _if_ it is the next on the
  116. expired list. Obviously request expiration strategies won't make the disk
  117. go faster. The result basically equates to the timeslice a single reader
  118. gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
  119. very roughly the % streaming read efficiency your disk should get with
  120. multiple readers.
  121. * read_batch_expire
  122. Controls how much time a batch of reads is given before pending writes are
  123. served. A higher value is more efficient. This might be set below read_expire
  124. if writes are to be given higher priority than reads, but reads are to be
  125. as efficient as possible when there are no writes. Generally though, it
  126. should be some multiple of read_expire.
  127. * write_expire, and
  128. * write_batch_expire are equivalent to the above, for writes.
  129. * antic_expire
  130. Controls the maximum amount of time we can anticipate a good read (one
  131. with a short seek distance from the most recently completed request) before
  132. giving up. Many other factors may cause anticipation to be stopped early,
  133. or some processes will not be "anticipated" at all. Should be a bit higher
  134. for big seek time devices though not a linear correspondence - most
  135. processes have only a few ms thinktime.
  136. In addition to the tunables above there is a read-only file named est_time
  137. which, when read, will show:
  138. - The probability of a task exiting without a cooperating task
  139. submitting an anticipated IO.
  140. - The current mean think time.
  141. - The seek distance used to determine if an incoming IO is better.