|
@@ -0,0 +1,356 @@
|
|
|
|
+
|
|
|
|
+Overview
|
|
|
|
+========
|
|
|
|
+
|
|
|
|
+This readme tries to provide some background on the hows and whys of RDS,
|
|
|
|
+and will hopefully help you find your way around the code.
|
|
|
|
+
|
|
|
|
+In addition, please see this email about RDS origins:
|
|
|
|
+http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
|
|
|
|
+
|
|
|
|
+RDS Architecture
|
|
|
|
+================
|
|
|
|
+
|
|
|
|
+RDS provides reliable, ordered datagram delivery by using a single
|
|
|
|
+reliable connection between any two nodes in the cluster. This allows
|
|
|
|
+applications to use a single socket to talk to any other process in the
|
|
|
|
+cluster - so in a cluster with N processes you need N sockets, in contrast
|
|
|
|
+to N*N if you use a connection-oriented socket transport like TCP.
|
|
|
|
+
|
|
|
|
+RDS is not Infiniband-specific; it was designed to support different
|
|
|
|
+transports. The current implementation used to support RDS over TCP as well
|
|
|
|
+as IB. Work is in progress to support RDS over iWARP, and using DCE to
|
|
|
|
+guarantee no dropped packets on Ethernet, it may be possible to use RDS over
|
|
|
|
+UDP in the future.
|
|
|
|
+
|
|
|
|
+The high-level semantics of RDS from the application's point of view are
|
|
|
|
+
|
|
|
|
+ * Addressing
|
|
|
|
+ RDS uses IPv4 addresses and 16bit port numbers to identify
|
|
|
|
+ the end point of a connection. All socket operations that involve
|
|
|
|
+ passing addresses between kernel and user space generally
|
|
|
|
+ use a struct sockaddr_in.
|
|
|
|
+
|
|
|
|
+ The fact that IPv4 addresses are used does not mean the underlying
|
|
|
|
+ transport has to be IP-based. In fact, RDS over IB uses a
|
|
|
|
+ reliable IB connection; the IP address is used exclusively to
|
|
|
|
+ locate the remote node's GID (by ARPing for the given IP).
|
|
|
|
+
|
|
|
|
+ The port space is entirely independent of UDP, TCP or any other
|
|
|
|
+ protocol.
|
|
|
|
+
|
|
|
|
+ * Socket interface
|
|
|
|
+ RDS sockets work *mostly* as you would expect from a BSD
|
|
|
|
+ socket. The next section will cover the details. At any rate,
|
|
|
|
+ all I/O is performed through the standard BSD socket API.
|
|
|
|
+ Some additions like zerocopy support are implemented through
|
|
|
|
+ control messages, while other extensions use the getsockopt/
|
|
|
|
+ setsockopt calls.
|
|
|
|
+
|
|
|
|
+ Sockets must be bound before you can send or receive data.
|
|
|
|
+ This is needed because binding also selects a transport and
|
|
|
|
+ attaches it to the socket. Once bound, the transport assignment
|
|
|
|
+ does not change. RDS will tolerate IPs moving around (eg in
|
|
|
|
+ a active-active HA scenario), but only as long as the address
|
|
|
|
+ doesn't move to a different transport.
|
|
|
|
+
|
|
|
|
+ * sysctls
|
|
|
|
+ RDS supports a number of sysctls in /proc/sys/net/rds
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+Socket Interface
|
|
|
|
+================
|
|
|
|
+
|
|
|
|
+ AF_RDS, PF_RDS, SOL_RDS
|
|
|
|
+ These constants haven't been assigned yet, because RDS isn't in
|
|
|
|
+ mainline yet. Currently, the kernel module assigns some constant
|
|
|
|
+ and publishes it to user space through two sysctl files
|
|
|
|
+ /proc/sys/net/rds/pf_rds
|
|
|
|
+ /proc/sys/net/rds/sol_rds
|
|
|
|
+
|
|
|
|
+ fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
|
|
|
|
+ This creates a new, unbound RDS socket.
|
|
|
|
+
|
|
|
|
+ setsockopt(SOL_SOCKET): send and receive buffer size
|
|
|
|
+ RDS honors the send and receive buffer size socket options.
|
|
|
|
+ You are not allowed to queue more than SO_SNDSIZE bytes to
|
|
|
|
+ a socket. A message is queued when sendmsg is called, and
|
|
|
|
+ it leaves the queue when the remote system acknowledges
|
|
|
|
+ its arrival.
|
|
|
|
+
|
|
|
|
+ The SO_RCVSIZE option controls the maximum receive queue length.
|
|
|
|
+ This is a soft limit rather than a hard limit - RDS will
|
|
|
|
+ continue to accept and queue incoming messages, even if that
|
|
|
|
+ takes the queue length over the limit. However, it will also
|
|
|
|
+ mark the port as "congested" and send a congestion update to
|
|
|
|
+ the source node. The source node is supposed to throttle any
|
|
|
|
+ processes sending to this congested port.
|
|
|
|
+
|
|
|
|
+ bind(fd, &sockaddr_in, ...)
|
|
|
|
+ This binds the socket to a local IP address and port, and a
|
|
|
|
+ transport.
|
|
|
|
+
|
|
|
|
+ sendmsg(fd, ...)
|
|
|
|
+ Sends a message to the indicated recipient. The kernel will
|
|
|
|
+ transparently establish the underlying reliable connection
|
|
|
|
+ if it isn't up yet.
|
|
|
|
+
|
|
|
|
+ An attempt to send a message that exceeds SO_SNDSIZE will
|
|
|
|
+ return with -EMSGSIZE
|
|
|
|
+
|
|
|
|
+ An attempt to send a message that would take the total number
|
|
|
|
+ of queued bytes over the SO_SNDSIZE threshold will return
|
|
|
|
+ EAGAIN.
|
|
|
|
+
|
|
|
|
+ An attempt to send a message to a destination that is marked
|
|
|
|
+ as "congested" will return ENOBUFS.
|
|
|
|
+
|
|
|
|
+ recvmsg(fd, ...)
|
|
|
|
+ Receives a message that was queued to this socket. The sockets
|
|
|
|
+ recv queue accounting is adjusted, and if the queue length
|
|
|
|
+ drops below SO_SNDSIZE, the port is marked uncongested, and
|
|
|
|
+ a congestion update is sent to all peers.
|
|
|
|
+
|
|
|
|
+ Applications can ask the RDS kernel module to receive
|
|
|
|
+ notifications via control messages (for instance, there is a
|
|
|
|
+ notification when a congestion update arrived, or when a RDMA
|
|
|
|
+ operation completes). These notifications are received through
|
|
|
|
+ the msg.msg_control buffer of struct msghdr. The format of the
|
|
|
|
+ messages is described in manpages.
|
|
|
|
+
|
|
|
|
+ poll(fd)
|
|
|
|
+ RDS supports the poll interface to allow the application
|
|
|
|
+ to implement async I/O.
|
|
|
|
+
|
|
|
|
+ POLLIN handling is pretty straightforward. When there's an
|
|
|
|
+ incoming message queued to the socket, or a pending notification,
|
|
|
|
+ we signal POLLIN.
|
|
|
|
+
|
|
|
|
+ POLLOUT is a little harder. Since you can essentially send
|
|
|
|
+ to any destination, RDS will always signal POLLOUT as long as
|
|
|
|
+ there's room on the send queue (ie the number of bytes queued
|
|
|
|
+ is less than the sendbuf size).
|
|
|
|
+
|
|
|
|
+ However, the kernel will refuse to accept messages to
|
|
|
|
+ a destination marked congested - in this case you will loop
|
|
|
|
+ forever if you rely on poll to tell you what to do.
|
|
|
|
+ This isn't a trivial problem, but applications can deal with
|
|
|
|
+ this - by using congestion notifications, and by checking for
|
|
|
|
+ ENOBUFS errors returned by sendmsg.
|
|
|
|
+
|
|
|
|
+ setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
|
|
|
|
+ This allows the application to discard all messages queued to a
|
|
|
|
+ specific destination on this particular socket.
|
|
|
|
+
|
|
|
|
+ This allows the application to cancel outstanding messages if
|
|
|
|
+ it detects a timeout. For instance, if it tried to send a message,
|
|
|
|
+ and the remote host is unreachable, RDS will keep trying forever.
|
|
|
|
+ The application may decide it's not worth it, and cancel the
|
|
|
|
+ operation. In this case, it would use RDS_CANCEL_SENT_TO to
|
|
|
|
+ nuke any pending messages.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+RDMA for RDS
|
|
|
|
+============
|
|
|
|
+
|
|
|
|
+ see rds-rdma(7) manpage (available in rds-tools)
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+Congestion Notifications
|
|
|
|
+========================
|
|
|
|
+
|
|
|
|
+ see rds(7) manpage
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+RDS Protocol
|
|
|
|
+============
|
|
|
|
+
|
|
|
|
+ Message header
|
|
|
|
+
|
|
|
|
+ The message header is a 'struct rds_header' (see rds.h):
|
|
|
|
+ Fields:
|
|
|
|
+ h_sequence:
|
|
|
|
+ per-packet sequence number
|
|
|
|
+ h_ack:
|
|
|
|
+ piggybacked acknowledgment of last packet received
|
|
|
|
+ h_len:
|
|
|
|
+ length of data, not including header
|
|
|
|
+ h_sport:
|
|
|
|
+ source port
|
|
|
|
+ h_dport:
|
|
|
|
+ destination port
|
|
|
|
+ h_flags:
|
|
|
|
+ CONG_BITMAP - this is a congestion update bitmap
|
|
|
|
+ ACK_REQUIRED - receiver must ack this packet
|
|
|
|
+ RETRANSMITTED - packet has previously been sent
|
|
|
|
+ h_credit:
|
|
|
|
+ indicate to other end of connection that
|
|
|
|
+ it has more credits available (i.e. there is
|
|
|
|
+ more send room)
|
|
|
|
+ h_padding[4]:
|
|
|
|
+ unused, for future use
|
|
|
|
+ h_csum:
|
|
|
|
+ header checksum
|
|
|
|
+ h_exthdr:
|
|
|
|
+ optional data can be passed here. This is currently used for
|
|
|
|
+ passing RDMA-related information.
|
|
|
|
+
|
|
|
|
+ ACK and retransmit handling
|
|
|
|
+
|
|
|
|
+ One might think that with reliable IB connections you wouldn't need
|
|
|
|
+ to ack messages that have been received. The problem is that IB
|
|
|
|
+ hardware generates an ack message before it has DMAed the message
|
|
|
|
+ into memory. This creates a potential message loss if the HCA is
|
|
|
|
+ disabled for any reason between when it sends the ack and before
|
|
|
|
+ the message is DMAed and processed. This is only a potential issue
|
|
|
|
+ if another HCA is available for fail-over.
|
|
|
|
+
|
|
|
|
+ Sending an ack immediately would allow the sender to free the sent
|
|
|
|
+ message from their send queue quickly, but could cause excessive
|
|
|
|
+ traffic to be used for acks. RDS piggybacks acks on sent data
|
|
|
|
+ packets. Ack-only packets are reduced by only allowing one to be
|
|
|
|
+ in flight at a time, and by the sender only asking for acks when
|
|
|
|
+ its send buffers start to fill up. All retransmissions are also
|
|
|
|
+ acked.
|
|
|
|
+
|
|
|
|
+ Flow Control
|
|
|
|
+
|
|
|
|
+ RDS's IB transport uses a credit-based mechanism to verify that
|
|
|
|
+ there is space in the peer's receive buffers for more data. This
|
|
|
|
+ eliminates the need for hardware retries on the connection.
|
|
|
|
+
|
|
|
|
+ Congestion
|
|
|
|
+
|
|
|
|
+ Messages waiting in the receive queue on the receiving socket
|
|
|
|
+ are accounted against the sockets SO_RCVBUF option value. Only
|
|
|
|
+ the payload bytes in the message are accounted for. If the
|
|
|
|
+ number of bytes queued equals or exceeds rcvbuf then the socket
|
|
|
|
+ is congested. All sends attempted to this socket's address
|
|
|
|
+ should return block or return -EWOULDBLOCK.
|
|
|
|
+
|
|
|
|
+ Applications are expected to be reasonably tuned such that this
|
|
|
|
+ situation very rarely occurs. An application encountering this
|
|
|
|
+ "back-pressure" is considered a bug.
|
|
|
|
+
|
|
|
|
+ This is implemented by having each node maintain bitmaps which
|
|
|
|
+ indicate which ports on bound addresses are congested. As the
|
|
|
|
+ bitmap changes it is sent through all the connections which
|
|
|
|
+ terminate in the local address of the bitmap which changed.
|
|
|
|
+
|
|
|
|
+ The bitmaps are allocated as connections are brought up. This
|
|
|
|
+ avoids allocation in the interrupt handling path which queues
|
|
|
|
+ sages on sockets. The dense bitmaps let transports send the
|
|
|
|
+ entire bitmap on any bitmap change reasonably efficiently. This
|
|
|
|
+ is much easier to implement than some finer-grained
|
|
|
|
+ communication of per-port congestion. The sender does a very
|
|
|
|
+ inexpensive bit test to test if the port it's about to send to
|
|
|
|
+ is congested or not.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+RDS Transport Layer
|
|
|
|
+==================
|
|
|
|
+
|
|
|
|
+ As mentioned above, RDS is not IB-specific. Its code is divided
|
|
|
|
+ into a general RDS layer and a transport layer.
|
|
|
|
+
|
|
|
|
+ The general layer handles the socket API, congestion handling,
|
|
|
|
+ loopback, stats, usermem pinning, and the connection state machine.
|
|
|
|
+
|
|
|
|
+ The transport layer handles the details of the transport. The IB
|
|
|
|
+ transport, for example, handles all the queue pairs, work requests,
|
|
|
|
+ CM event handlers, and other Infiniband details.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+RDS Kernel Structures
|
|
|
|
+=====================
|
|
|
|
+
|
|
|
|
+ struct rds_message
|
|
|
|
+ aka possibly "rds_outgoing", the generic RDS layer copies data to
|
|
|
|
+ be sent and sets header fields as needed, based on the socket API.
|
|
|
|
+ This is then queued for the individual connection and sent by the
|
|
|
|
+ connection's transport.
|
|
|
|
+ struct rds_incoming
|
|
|
|
+ a generic struct referring to incoming data that can be handed from
|
|
|
|
+ the transport to the general code and queued by the general code
|
|
|
|
+ while the socket is awoken. It is then passed back to the transport
|
|
|
|
+ code to handle the actual copy-to-user.
|
|
|
|
+ struct rds_socket
|
|
|
|
+ per-socket information
|
|
|
|
+ struct rds_connection
|
|
|
|
+ per-connection information
|
|
|
|
+ struct rds_transport
|
|
|
|
+ pointers to transport-specific functions
|
|
|
|
+ struct rds_statistics
|
|
|
|
+ non-transport-specific statistics
|
|
|
|
+ struct rds_cong_map
|
|
|
|
+ wraps the raw congestion bitmap, contains rbnode, waitq, etc.
|
|
|
|
+
|
|
|
|
+Connection management
|
|
|
|
+=====================
|
|
|
|
+
|
|
|
|
+ Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
|
|
|
|
+ ERROR states.
|
|
|
|
+
|
|
|
|
+ The first time an attempt is made by an RDS socket to send data to
|
|
|
|
+ a node, a connection is allocated and connected. That connection is
|
|
|
|
+ then maintained forever -- if there are transport errors, the
|
|
|
|
+ connection will be dropped and re-established.
|
|
|
|
+
|
|
|
|
+ Dropping a connection while packets are queued will cause queued or
|
|
|
|
+ partially-sent datagrams to be retransmitted when the connection is
|
|
|
|
+ re-established.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+The send path
|
|
|
|
+=============
|
|
|
|
+
|
|
|
|
+ rds_sendmsg()
|
|
|
|
+ struct rds_message built from incoming data
|
|
|
|
+ CMSGs parsed (e.g. RDMA ops)
|
|
|
|
+ transport connection alloced and connected if not already
|
|
|
|
+ rds_message placed on send queue
|
|
|
|
+ send worker awoken
|
|
|
|
+ rds_send_worker()
|
|
|
|
+ calls rds_send_xmit() until queue is empty
|
|
|
|
+ rds_send_xmit()
|
|
|
|
+ transmits congestion map if one is pending
|
|
|
|
+ may set ACK_REQUIRED
|
|
|
|
+ calls transport to send either non-RDMA or RDMA message
|
|
|
|
+ (RDMA ops never retransmitted)
|
|
|
|
+ rds_ib_xmit()
|
|
|
|
+ allocs work requests from send ring
|
|
|
|
+ adds any new send credits available to peer (h_credits)
|
|
|
|
+ maps the rds_message's sg list
|
|
|
|
+ piggybacks ack
|
|
|
|
+ populates work requests
|
|
|
|
+ post send to connection's queue pair
|
|
|
|
+
|
|
|
|
+The recv path
|
|
|
|
+=============
|
|
|
|
+
|
|
|
|
+ rds_ib_recv_cq_comp_handler()
|
|
|
|
+ looks at write completions
|
|
|
|
+ unmaps recv buffer from device
|
|
|
|
+ no errors, call rds_ib_process_recv()
|
|
|
|
+ refill recv ring
|
|
|
|
+ rds_ib_process_recv()
|
|
|
|
+ validate header checksum
|
|
|
|
+ copy header to rds_ib_incoming struct if start of a new datagram
|
|
|
|
+ add to ibinc's fraglist
|
|
|
|
+ if competed datagram:
|
|
|
|
+ update cong map if datagram was cong update
|
|
|
|
+ call rds_recv_incoming() otherwise
|
|
|
|
+ note if ack is required
|
|
|
|
+ rds_recv_incoming()
|
|
|
|
+ drop duplicate packets
|
|
|
|
+ respond to pings
|
|
|
|
+ find the sock associated with this datagram
|
|
|
|
+ add to sock queue
|
|
|
|
+ wake up sock
|
|
|
|
+ do some congestion calculations
|
|
|
|
+ rds_recvmsg
|
|
|
|
+ copy data into user iovec
|
|
|
|
+ handle CMSGs
|
|
|
|
+ return to application
|
|
|
|
+
|
|
|
|
+
|