rds.txt 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356
  1. Overview
  2. ========
  3. This readme tries to provide some background on the hows and whys of RDS,
  4. and will hopefully help you find your way around the code.
  5. In addition, please see this email about RDS origins:
  6. http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
  7. RDS Architecture
  8. ================
  9. RDS provides reliable, ordered datagram delivery by using a single
  10. reliable connection between any two nodes in the cluster. This allows
  11. applications to use a single socket to talk to any other process in the
  12. cluster - so in a cluster with N processes you need N sockets, in contrast
  13. to N*N if you use a connection-oriented socket transport like TCP.
  14. RDS is not Infiniband-specific; it was designed to support different
  15. transports. The current implementation used to support RDS over TCP as well
  16. as IB. Work is in progress to support RDS over iWARP, and using DCE to
  17. guarantee no dropped packets on Ethernet, it may be possible to use RDS over
  18. UDP in the future.
  19. The high-level semantics of RDS from the application's point of view are
  20. * Addressing
  21. RDS uses IPv4 addresses and 16bit port numbers to identify
  22. the end point of a connection. All socket operations that involve
  23. passing addresses between kernel and user space generally
  24. use a struct sockaddr_in.
  25. The fact that IPv4 addresses are used does not mean the underlying
  26. transport has to be IP-based. In fact, RDS over IB uses a
  27. reliable IB connection; the IP address is used exclusively to
  28. locate the remote node's GID (by ARPing for the given IP).
  29. The port space is entirely independent of UDP, TCP or any other
  30. protocol.
  31. * Socket interface
  32. RDS sockets work *mostly* as you would expect from a BSD
  33. socket. The next section will cover the details. At any rate,
  34. all I/O is performed through the standard BSD socket API.
  35. Some additions like zerocopy support are implemented through
  36. control messages, while other extensions use the getsockopt/
  37. setsockopt calls.
  38. Sockets must be bound before you can send or receive data.
  39. This is needed because binding also selects a transport and
  40. attaches it to the socket. Once bound, the transport assignment
  41. does not change. RDS will tolerate IPs moving around (eg in
  42. a active-active HA scenario), but only as long as the address
  43. doesn't move to a different transport.
  44. * sysctls
  45. RDS supports a number of sysctls in /proc/sys/net/rds
  46. Socket Interface
  47. ================
  48. AF_RDS, PF_RDS, SOL_RDS
  49. These constants haven't been assigned yet, because RDS isn't in
  50. mainline yet. Currently, the kernel module assigns some constant
  51. and publishes it to user space through two sysctl files
  52. /proc/sys/net/rds/pf_rds
  53. /proc/sys/net/rds/sol_rds
  54. fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
  55. This creates a new, unbound RDS socket.
  56. setsockopt(SOL_SOCKET): send and receive buffer size
  57. RDS honors the send and receive buffer size socket options.
  58. You are not allowed to queue more than SO_SNDSIZE bytes to
  59. a socket. A message is queued when sendmsg is called, and
  60. it leaves the queue when the remote system acknowledges
  61. its arrival.
  62. The SO_RCVSIZE option controls the maximum receive queue length.
  63. This is a soft limit rather than a hard limit - RDS will
  64. continue to accept and queue incoming messages, even if that
  65. takes the queue length over the limit. However, it will also
  66. mark the port as "congested" and send a congestion update to
  67. the source node. The source node is supposed to throttle any
  68. processes sending to this congested port.
  69. bind(fd, &sockaddr_in, ...)
  70. This binds the socket to a local IP address and port, and a
  71. transport.
  72. sendmsg(fd, ...)
  73. Sends a message to the indicated recipient. The kernel will
  74. transparently establish the underlying reliable connection
  75. if it isn't up yet.
  76. An attempt to send a message that exceeds SO_SNDSIZE will
  77. return with -EMSGSIZE
  78. An attempt to send a message that would take the total number
  79. of queued bytes over the SO_SNDSIZE threshold will return
  80. EAGAIN.
  81. An attempt to send a message to a destination that is marked
  82. as "congested" will return ENOBUFS.
  83. recvmsg(fd, ...)
  84. Receives a message that was queued to this socket. The sockets
  85. recv queue accounting is adjusted, and if the queue length
  86. drops below SO_SNDSIZE, the port is marked uncongested, and
  87. a congestion update is sent to all peers.
  88. Applications can ask the RDS kernel module to receive
  89. notifications via control messages (for instance, there is a
  90. notification when a congestion update arrived, or when a RDMA
  91. operation completes). These notifications are received through
  92. the msg.msg_control buffer of struct msghdr. The format of the
  93. messages is described in manpages.
  94. poll(fd)
  95. RDS supports the poll interface to allow the application
  96. to implement async I/O.
  97. POLLIN handling is pretty straightforward. When there's an
  98. incoming message queued to the socket, or a pending notification,
  99. we signal POLLIN.
  100. POLLOUT is a little harder. Since you can essentially send
  101. to any destination, RDS will always signal POLLOUT as long as
  102. there's room on the send queue (ie the number of bytes queued
  103. is less than the sendbuf size).
  104. However, the kernel will refuse to accept messages to
  105. a destination marked congested - in this case you will loop
  106. forever if you rely on poll to tell you what to do.
  107. This isn't a trivial problem, but applications can deal with
  108. this - by using congestion notifications, and by checking for
  109. ENOBUFS errors returned by sendmsg.
  110. setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
  111. This allows the application to discard all messages queued to a
  112. specific destination on this particular socket.
  113. This allows the application to cancel outstanding messages if
  114. it detects a timeout. For instance, if it tried to send a message,
  115. and the remote host is unreachable, RDS will keep trying forever.
  116. The application may decide it's not worth it, and cancel the
  117. operation. In this case, it would use RDS_CANCEL_SENT_TO to
  118. nuke any pending messages.
  119. RDMA for RDS
  120. ============
  121. see rds-rdma(7) manpage (available in rds-tools)
  122. Congestion Notifications
  123. ========================
  124. see rds(7) manpage
  125. RDS Protocol
  126. ============
  127. Message header
  128. The message header is a 'struct rds_header' (see rds.h):
  129. Fields:
  130. h_sequence:
  131. per-packet sequence number
  132. h_ack:
  133. piggybacked acknowledgment of last packet received
  134. h_len:
  135. length of data, not including header
  136. h_sport:
  137. source port
  138. h_dport:
  139. destination port
  140. h_flags:
  141. CONG_BITMAP - this is a congestion update bitmap
  142. ACK_REQUIRED - receiver must ack this packet
  143. RETRANSMITTED - packet has previously been sent
  144. h_credit:
  145. indicate to other end of connection that
  146. it has more credits available (i.e. there is
  147. more send room)
  148. h_padding[4]:
  149. unused, for future use
  150. h_csum:
  151. header checksum
  152. h_exthdr:
  153. optional data can be passed here. This is currently used for
  154. passing RDMA-related information.
  155. ACK and retransmit handling
  156. One might think that with reliable IB connections you wouldn't need
  157. to ack messages that have been received. The problem is that IB
  158. hardware generates an ack message before it has DMAed the message
  159. into memory. This creates a potential message loss if the HCA is
  160. disabled for any reason between when it sends the ack and before
  161. the message is DMAed and processed. This is only a potential issue
  162. if another HCA is available for fail-over.
  163. Sending an ack immediately would allow the sender to free the sent
  164. message from their send queue quickly, but could cause excessive
  165. traffic to be used for acks. RDS piggybacks acks on sent data
  166. packets. Ack-only packets are reduced by only allowing one to be
  167. in flight at a time, and by the sender only asking for acks when
  168. its send buffers start to fill up. All retransmissions are also
  169. acked.
  170. Flow Control
  171. RDS's IB transport uses a credit-based mechanism to verify that
  172. there is space in the peer's receive buffers for more data. This
  173. eliminates the need for hardware retries on the connection.
  174. Congestion
  175. Messages waiting in the receive queue on the receiving socket
  176. are accounted against the sockets SO_RCVBUF option value. Only
  177. the payload bytes in the message are accounted for. If the
  178. number of bytes queued equals or exceeds rcvbuf then the socket
  179. is congested. All sends attempted to this socket's address
  180. should return block or return -EWOULDBLOCK.
  181. Applications are expected to be reasonably tuned such that this
  182. situation very rarely occurs. An application encountering this
  183. "back-pressure" is considered a bug.
  184. This is implemented by having each node maintain bitmaps which
  185. indicate which ports on bound addresses are congested. As the
  186. bitmap changes it is sent through all the connections which
  187. terminate in the local address of the bitmap which changed.
  188. The bitmaps are allocated as connections are brought up. This
  189. avoids allocation in the interrupt handling path which queues
  190. sages on sockets. The dense bitmaps let transports send the
  191. entire bitmap on any bitmap change reasonably efficiently. This
  192. is much easier to implement than some finer-grained
  193. communication of per-port congestion. The sender does a very
  194. inexpensive bit test to test if the port it's about to send to
  195. is congested or not.
  196. RDS Transport Layer
  197. ==================
  198. As mentioned above, RDS is not IB-specific. Its code is divided
  199. into a general RDS layer and a transport layer.
  200. The general layer handles the socket API, congestion handling,
  201. loopback, stats, usermem pinning, and the connection state machine.
  202. The transport layer handles the details of the transport. The IB
  203. transport, for example, handles all the queue pairs, work requests,
  204. CM event handlers, and other Infiniband details.
  205. RDS Kernel Structures
  206. =====================
  207. struct rds_message
  208. aka possibly "rds_outgoing", the generic RDS layer copies data to
  209. be sent and sets header fields as needed, based on the socket API.
  210. This is then queued for the individual connection and sent by the
  211. connection's transport.
  212. struct rds_incoming
  213. a generic struct referring to incoming data that can be handed from
  214. the transport to the general code and queued by the general code
  215. while the socket is awoken. It is then passed back to the transport
  216. code to handle the actual copy-to-user.
  217. struct rds_socket
  218. per-socket information
  219. struct rds_connection
  220. per-connection information
  221. struct rds_transport
  222. pointers to transport-specific functions
  223. struct rds_statistics
  224. non-transport-specific statistics
  225. struct rds_cong_map
  226. wraps the raw congestion bitmap, contains rbnode, waitq, etc.
  227. Connection management
  228. =====================
  229. Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
  230. ERROR states.
  231. The first time an attempt is made by an RDS socket to send data to
  232. a node, a connection is allocated and connected. That connection is
  233. then maintained forever -- if there are transport errors, the
  234. connection will be dropped and re-established.
  235. Dropping a connection while packets are queued will cause queued or
  236. partially-sent datagrams to be retransmitted when the connection is
  237. re-established.
  238. The send path
  239. =============
  240. rds_sendmsg()
  241. struct rds_message built from incoming data
  242. CMSGs parsed (e.g. RDMA ops)
  243. transport connection alloced and connected if not already
  244. rds_message placed on send queue
  245. send worker awoken
  246. rds_send_worker()
  247. calls rds_send_xmit() until queue is empty
  248. rds_send_xmit()
  249. transmits congestion map if one is pending
  250. may set ACK_REQUIRED
  251. calls transport to send either non-RDMA or RDMA message
  252. (RDMA ops never retransmitted)
  253. rds_ib_xmit()
  254. allocs work requests from send ring
  255. adds any new send credits available to peer (h_credits)
  256. maps the rds_message's sg list
  257. piggybacks ack
  258. populates work requests
  259. post send to connection's queue pair
  260. The recv path
  261. =============
  262. rds_ib_recv_cq_comp_handler()
  263. looks at write completions
  264. unmaps recv buffer from device
  265. no errors, call rds_ib_process_recv()
  266. refill recv ring
  267. rds_ib_process_recv()
  268. validate header checksum
  269. copy header to rds_ib_incoming struct if start of a new datagram
  270. add to ibinc's fraglist
  271. if competed datagram:
  272. update cong map if datagram was cong update
  273. call rds_recv_incoming() otherwise
  274. note if ack is required
  275. rds_recv_incoming()
  276. drop duplicate packets
  277. respond to pings
  278. find the sock associated with this datagram
  279. add to sock queue
  280. wake up sock
  281. do some congestion calculations
  282. rds_recvmsg
  283. copy data into user iovec
  284. handle CMSGs
  285. return to application