123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442 |
- relayfs - a high-speed data relay filesystem
- ============================================
- relayfs is a filesystem designed to provide an efficient mechanism for
- tools and facilities to relay large and potentially sustained streams
- of data from kernel space to user space.
- The main abstraction of relayfs is the 'channel'. A channel consists
- of a set of per-cpu kernel buffers each represented by a file in the
- relayfs filesystem. Kernel clients write into a channel using
- efficient write functions which automatically log to the current cpu's
- channel buffer. User space applications mmap() the per-cpu files and
- retrieve the data as it becomes available.
- The format of the data logged into the channel buffers is completely
- up to the relayfs client; relayfs does however provide hooks which
- allow clients to impose some structure on the buffer data. Nor does
- relayfs implement any form of data filtering - this also is left to
- the client. The purpose is to keep relayfs as simple as possible.
- This document provides an overview of the relayfs API. The details of
- the function parameters are documented along with the functions in the
- filesystem code - please see that for details.
- Semantics
- =========
- Each relayfs channel has one buffer per CPU, each buffer has one or
- more sub-buffers. Messages are written to the first sub-buffer until
- it is too full to contain a new message, in which case it it is
- written to the next (if available). Messages are never split across
- sub-buffers. At this point, userspace can be notified so it empties
- the first sub-buffer, while the kernel continues writing to the next.
- When notified that a sub-buffer is full, the kernel knows how many
- bytes of it are padding i.e. unused. Userspace can use this knowledge
- to copy only valid data.
- After copying it, userspace can notify the kernel that a sub-buffer
- has been consumed.
- relayfs can operate in a mode where it will overwrite data not yet
- collected by userspace, and not wait for it to consume it.
- relayfs itself does not provide for communication of such data between
- userspace and kernel, allowing the kernel side to remain simple and
- not impose a single interface on userspace. It does provide a set of
- examples and a separate helper though, described below.
- klog and relay-apps example code
- ================================
- relayfs itself is ready to use, but to make things easier, a couple
- simple utility functions and a set of examples are provided.
- The relay-apps example tarball, available on the relayfs sourceforge
- site, contains a set of self-contained examples, each consisting of a
- pair of .c files containing boilerplate code for each of the user and
- kernel sides of a relayfs application; combined these two sets of
- boilerplate code provide glue to easily stream data to disk, without
- having to bother with mundane housekeeping chores.
- The 'klog debugging functions' patch (klog.patch in the relay-apps
- tarball) provides a couple of high-level logging functions to the
- kernel which allow writing formatted text or raw data to a channel,
- regardless of whether a channel to write into exists or not, or
- whether relayfs is compiled into the kernel or is configured as a
- module. These functions allow you to put unconditional 'trace'
- statements anywhere in the kernel or kernel modules; only when there
- is a 'klog handler' registered will data actually be logged (see the
- klog and kleak examples for details).
- It is of course possible to use relayfs from scratch i.e. without
- using any of the relay-apps example code or klog, but you'll have to
- implement communication between userspace and kernel, allowing both to
- convey the state of buffers (full, empty, amount of padding).
- klog and the relay-apps examples can be found in the relay-apps
- tarball on http://relayfs.sourceforge.net
- The relayfs user space API
- ==========================
- relayfs implements basic file operations for user space access to
- relayfs channel buffer data. Here are the file operations that are
- available and some comments regarding their behavior:
- open() enables user to open an _existing_ buffer.
- mmap() results in channel buffer being mapped into the caller's
- memory space. Note that you can't do a partial mmap - you must
- map the entire file, which is NRBUF * SUBBUFSIZE.
- read() read the contents of a channel buffer. The bytes read are
- 'consumed' by the reader i.e. they won't be available again
- to subsequent reads. If the channel is being used in
- no-overwrite mode (the default), it can be read at any time
- even if there's an active kernel writer. If the channel is
- being used in overwrite mode and there are active channel
- writers, results may be unpredictable - users should make
- sure that all logging to the channel has ended before using
- read() with overwrite mode.
- poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
- notified when sub-buffer boundaries are crossed.
- close() decrements the channel buffer's refcount. When the refcount
- reaches 0 i.e. when no process or kernel client has the buffer
- open, the channel buffer is freed.
- In order for a user application to make use of relayfs files, the
- relayfs filesystem must be mounted. For example,
- mount -t relayfs relayfs /mnt/relay
- NOTE: relayfs doesn't need to be mounted for kernel clients to create
- or use channels - it only needs to be mounted when user space
- applications need access to the buffer data.
- The relayfs kernel API
- ======================
- Here's a summary of the API relayfs provides to in-kernel clients:
- channel management functions:
- relay_open(base_filename, parent, subbuf_size, n_subbufs,
- callbacks)
- relay_close(chan)
- relay_flush(chan)
- relay_reset(chan)
- relayfs_create_dir(name, parent)
- relayfs_remove_dir(dentry)
- relayfs_create_file(name, parent, mode, fops, data)
- relayfs_remove_file(dentry)
- channel management typically called on instigation of userspace:
- relay_subbufs_consumed(chan, cpu, subbufs_consumed)
- write functions:
- relay_write(chan, data, length)
- __relay_write(chan, data, length)
- relay_reserve(chan, length)
- callbacks:
- subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
- buf_mapped(buf, filp)
- buf_unmapped(buf, filp)
- create_buf_file(filename, parent, mode, buf, is_global)
- remove_buf_file(dentry)
- helper functions:
- relay_buf_full(buf)
- subbuf_start_reserve(buf, length)
- Creating a channel
- ------------------
- relay_open() is used to create a channel, along with its per-cpu
- channel buffers. Each channel buffer will have an associated file
- created for it in the relayfs filesystem, which can be opened and
- mmapped from user space if desired. The files are named
- basename0...basenameN-1 where N is the number of online cpus, and by
- default will be created in the root of the filesystem. If you want a
- directory structure to contain your relayfs files, you can create it
- with relayfs_create_dir() and pass the parent directory to
- relay_open(). Clients are responsible for cleaning up any directory
- structure they create when the channel is closed - use
- relayfs_remove_dir() for that.
- The total size of each per-cpu buffer is calculated by multiplying the
- number of sub-buffers by the sub-buffer size passed into relay_open().
- The idea behind sub-buffers is that they're basically an extension of
- double-buffering to N buffers, and they also allow applications to
- easily implement random-access-on-buffer-boundary schemes, which can
- be important for some high-volume applications. The number and size
- of sub-buffers is completely dependent on the application and even for
- the same application, different conditions will warrant different
- values for these parameters at different times. Typically, the right
- values to use are best decided after some experimentation; in general,
- though, it's safe to assume that having only 1 sub-buffer is a bad
- idea - you're guaranteed to either overwrite data or lose events
- depending on the channel mode being used.
- Channel 'modes'
- ---------------
- relayfs channels can be used in either of two modes - 'overwrite' or
- 'no-overwrite'. The mode is entirely determined by the implementation
- of the subbuf_start() callback, as described below. In 'overwrite'
- mode, also known as 'flight recorder' mode, writes continuously cycle
- around the buffer and will never fail, but will unconditionally
- overwrite old data regardless of whether it's actually been consumed.
- In no-overwrite mode, writes will fail i.e. data will be lost, if the
- number of unconsumed sub-buffers equals the total number of
- sub-buffers in the channel. It should be clear that if there is no
- consumer or if the consumer can't consume sub-buffers fast enought,
- data will be lost in either case; the only difference is whether data
- is lost from the beginning or the end of a buffer.
- As explained above, a relayfs channel is made of up one or more
- per-cpu channel buffers, each implemented as a circular buffer
- subdivided into one or more sub-buffers. Messages are written into
- the current sub-buffer of the channel's current per-cpu buffer via the
- write functions described below. Whenever a message can't fit into
- the current sub-buffer, because there's no room left for it, the
- client is notified via the subbuf_start() callback that a switch to a
- new sub-buffer is about to occur. The client uses this callback to 1)
- initialize the next sub-buffer if appropriate 2) finalize the previous
- sub-buffer if appropriate and 3) return a boolean value indicating
- whether or not to actually go ahead with the sub-buffer switch.
- To implement 'no-overwrite' mode, the userspace client would provide
- an implementation of the subbuf_start() callback something like the
- following:
- static int subbuf_start(struct rchan_buf *buf,
- void *subbuf,
- void *prev_subbuf,
- unsigned int prev_padding)
- {
- if (prev_subbuf)
- *((unsigned *)prev_subbuf) = prev_padding;
- if (relay_buf_full(buf))
- return 0;
- subbuf_start_reserve(buf, sizeof(unsigned int));
- return 1;
- }
- If the current buffer is full i.e. all sub-buffers remain unconsumed,
- the callback returns 0 to indicate that the buffer switch should not
- occur yet i.e. until the consumer has had a chance to read the current
- set of ready sub-buffers. For the relay_buf_full() function to make
- sense, the consumer is reponsible for notifying relayfs when
- sub-buffers have been consumed via relay_subbufs_consumed(). Any
- subsequent attempts to write into the buffer will again invoke the
- subbuf_start() callback with the same parameters; only when the
- consumer has consumed one or more of the ready sub-buffers will
- relay_buf_full() return 0, in which case the buffer switch can
- continue.
- The implementation of the subbuf_start() callback for 'overwrite' mode
- would be very similar:
- static int subbuf_start(struct rchan_buf *buf,
- void *subbuf,
- void *prev_subbuf,
- unsigned int prev_padding)
- {
- if (prev_subbuf)
- *((unsigned *)prev_subbuf) = prev_padding;
- subbuf_start_reserve(buf, sizeof(unsigned int));
- return 1;
- }
- In this case, the relay_buf_full() check is meaningless and the
- callback always returns 1, causing the buffer switch to occur
- unconditionally. It's also meaningless for the client to use the
- relay_subbufs_consumed() function in this mode, as it's never
- consulted.
- The default subbuf_start() implementation, used if the client doesn't
- define any callbacks, or doesn't define the subbuf_start() callback,
- implements the simplest possible 'no-overwrite' mode i.e. it does
- nothing but return 0.
- Header information can be reserved at the beginning of each sub-buffer
- by calling the subbuf_start_reserve() helper function from within the
- subbuf_start() callback. This reserved area can be used to store
- whatever information the client wants. In the example above, room is
- reserved in each sub-buffer to store the padding count for that
- sub-buffer. This is filled in for the previous sub-buffer in the
- subbuf_start() implementation; the padding value for the previous
- sub-buffer is passed into the subbuf_start() callback along with a
- pointer to the previous sub-buffer, since the padding value isn't
- known until a sub-buffer is filled. The subbuf_start() callback is
- also called for the first sub-buffer when the channel is opened, to
- give the client a chance to reserve space in it. In this case the
- previous sub-buffer pointer passed into the callback will be NULL, so
- the client should check the value of the prev_subbuf pointer before
- writing into the previous sub-buffer.
- Writing to a channel
- --------------------
- kernel clients write data into the current cpu's channel buffer using
- relay_write() or __relay_write(). relay_write() is the main logging
- function - it uses local_irqsave() to protect the buffer and should be
- used if you might be logging from interrupt context. If you know
- you'll never be logging from interrupt context, you can use
- __relay_write(), which only disables preemption. These functions
- don't return a value, so you can't determine whether or not they
- failed - the assumption is that you wouldn't want to check a return
- value in the fast logging path anyway, and that they'll always succeed
- unless the buffer is full and no-overwrite mode is being used, in
- which case you can detect a failed write in the subbuf_start()
- callback by calling the relay_buf_full() helper function.
- relay_reserve() is used to reserve a slot in a channel buffer which
- can be written to later. This would typically be used in applications
- that need to write directly into a channel buffer without having to
- stage data in a temporary buffer beforehand. Because the actual write
- may not happen immediately after the slot is reserved, applications
- using relay_reserve() can keep a count of the number of bytes actually
- written, either in space reserved in the sub-buffers themselves or as
- a separate array. See the 'reserve' example in the relay-apps tarball
- at http://relayfs.sourceforge.net for an example of how this can be
- done. Because the write is under control of the client and is
- separated from the reserve, relay_reserve() doesn't protect the buffer
- at all - it's up to the client to provide the appropriate
- synchronization when using relay_reserve().
- Closing a channel
- -----------------
- The client calls relay_close() when it's finished using the channel.
- The channel and its associated buffers are destroyed when there are no
- longer any references to any of the channel buffers. relay_flush()
- forces a sub-buffer switch on all the channel buffers, and can be used
- to finalize and process the last sub-buffers before the channel is
- closed.
- Creating non-relay files
- ------------------------
- relay_open() automatically creates files in the relayfs filesystem to
- represent the per-cpu kernel buffers; it's often useful for
- applications to be able to create their own files alongside the relay
- files in the relayfs filesystem as well e.g. 'control' files much like
- those created in /proc or debugfs for similar purposes, used to
- communicate control information between the kernel and user sides of a
- relayfs application. For this purpose the relayfs_create_file() and
- relayfs_remove_file() API functions exist. For relayfs_create_file(),
- the caller passes in a set of user-defined file operations to be used
- for the file and an optional void * to a user-specified data item,
- which will be accessible via inode->u.generic_ip (see the relay-apps
- tarball for examples). The file_operations are a required parameter
- to relayfs_create_file() and thus the semantics of these files are
- completely defined by the caller.
- See the relay-apps tarball at http://relayfs.sourceforge.net for
- examples of how these non-relay files are meant to be used.
- Creating relay files in other filesystems
- -----------------------------------------
- By default of course, relay_open() creates relay files in the relayfs
- filesystem. Because relay_file_operations is exported, however, it's
- also possible to create and use relay files in other pseudo-filesytems
- such as debugfs.
- For this purpose, two callback functions are provided,
- create_buf_file() and remove_buf_file(). create_buf_file() is called
- once for each per-cpu buffer from relay_open() to allow the client to
- create a file to be used to represent the corresponding buffer; if
- this callback is not defined, the default implementation will create
- and return a file in the relayfs filesystem to represent the buffer.
- The callback should return the dentry of the file created to represent
- the relay buffer. Note that the parent directory passed to
- relay_open() (and passed along to the callback), if specified, must
- exist in the same filesystem the new relay file is created in. If
- create_buf_file() is defined, remove_buf_file() must also be defined;
- it's responsible for deleting the file(s) created in create_buf_file()
- and is called during relay_close().
- The create_buf_file() implementation can also be defined in such a way
- as to allow the creation of a single 'global' buffer instead of the
- default per-cpu set. This can be useful for applications interested
- mainly in seeing the relative ordering of system-wide events without
- the need to bother with saving explicit timestamps for the purpose of
- merging/sorting per-cpu files in a postprocessing step.
- To have relay_open() create a global buffer, the create_buf_file()
- implementation should set the value of the is_global outparam to a
- non-zero value in addition to creating the file that will be used to
- represent the single buffer. In the case of a global buffer,
- create_buf_file() and remove_buf_file() will be called only once. The
- normal channel-writing functions e.g. relay_write() can still be used
- - writes from any cpu will transparently end up in the global buffer -
- but since it is a global buffer, callers should make sure they use the
- proper locking for such a buffer, either by wrapping writes in a
- spinlock, or by copying a write function from relayfs_fs.h and
- creating a local version that internally does the proper locking.
- See the 'exported-relayfile' examples in the relay-apps tarball for
- examples of creating and using relay files in debugfs.
- Misc
- ----
- Some applications may want to keep a channel around and re-use it
- rather than open and close a new channel for each use. relay_reset()
- can be used for this purpose - it resets a channel to its initial
- state without reallocating channel buffer memory or destroying
- existing mappings. It should however only be called when it's safe to
- do so i.e. when the channel isn't currently being written to.
- Finally, there are a couple of utility callbacks that can be used for
- different purposes. buf_mapped() is called whenever a channel buffer
- is mmapped from user space and buf_unmapped() is called when it's
- unmapped. The client can use this notification to trigger actions
- within the kernel application, such as enabling/disabling logging to
- the channel.
- Resources
- =========
- For news, example code, mailing list, etc. see the relayfs homepage:
- http://relayfs.sourceforge.net
- Credits
- =======
- The ideas and specs for relayfs came about as a result of discussions
- on tracing involving the following:
- Michel Dagenais <michel.dagenais@polymtl.ca>
- Richard Moore <richardj_moore@uk.ibm.com>
- Bob Wisniewski <bob@watson.ibm.com>
- Karim Yaghmour <karim@opersys.com>
- Tom Zanussi <zanussi@us.ibm.com>
- Also thanks to Hubertus Franke for a lot of useful suggestions and bug
- reports.
|