|
@@ -0,0 +1,955 @@
|
|
|
+ Lockless Ring Buffer Design
|
|
|
+ ===========================
|
|
|
+
|
|
|
+Copyright 2009 Red Hat Inc.
|
|
|
+ Author: Steven Rostedt <srostedt@redhat.com>
|
|
|
+ License: The GNU Free Documentation License, Version 1.2
|
|
|
+ (dual licensed under the GPL v2)
|
|
|
+Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
|
|
|
+ and Frederic Weisbecker.
|
|
|
+
|
|
|
+
|
|
|
+Written for: 2.6.31
|
|
|
+
|
|
|
+Terminology used in this Document
|
|
|
+---------------------------------
|
|
|
+
|
|
|
+tail - where new writes happen in the ring buffer.
|
|
|
+
|
|
|
+head - where new reads happen in the ring buffer.
|
|
|
+
|
|
|
+producer - the task that writes into the ring buffer (same as writer)
|
|
|
+
|
|
|
+writer - same as producer
|
|
|
+
|
|
|
+consumer - the task that reads from the buffer (same as reader)
|
|
|
+
|
|
|
+reader - same as consumer.
|
|
|
+
|
|
|
+reader_page - A page outside the ring buffer used solely (for the most part)
|
|
|
+ by the reader.
|
|
|
+
|
|
|
+head_page - a pointer to the page that the reader will use next
|
|
|
+
|
|
|
+tail_page - a pointer to the page that will be written to next
|
|
|
+
|
|
|
+commit_page - a pointer to the page with the last finished non nested write.
|
|
|
+
|
|
|
+cmpxchg - hardware assisted atomic transaction that performs the following:
|
|
|
+
|
|
|
+ A = B iff previous A == C
|
|
|
+
|
|
|
+ R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
|
|
|
+ current A is equal to C, and we put the old (current) A into R
|
|
|
+
|
|
|
+ R gets the previous A regardless if A is updated with B or not.
|
|
|
+
|
|
|
+ To see if the update was successful a compare of R == C may be used.
|
|
|
+
|
|
|
+The Generic Ring Buffer
|
|
|
+-----------------------
|
|
|
+
|
|
|
+The ring buffer can be used in either an overwrite mode or in
|
|
|
+producer/consumer mode.
|
|
|
+
|
|
|
+Producer/consumer mode is where the producer were to fill up the
|
|
|
+buffer before the consumer could free up anything, the producer
|
|
|
+will stop writing to the buffer. This will lose most recent events.
|
|
|
+
|
|
|
+Overwrite mode is where the produce were to fill up the buffer
|
|
|
+before the consumer could free up anything, the producer will
|
|
|
+overwrite the older data. This will lose the oldest events.
|
|
|
+
|
|
|
+No two writers can write at the same time (on the same per cpu buffer),
|
|
|
+but a writer may interrupt another writer, but it must finish writing
|
|
|
+before the previous writer may continue. This is very important to the
|
|
|
+algorithm. The writers act like a "stack". The way interrupts works
|
|
|
+enforces this behavior.
|
|
|
+
|
|
|
+
|
|
|
+ writer1 start
|
|
|
+ <preempted> writer2 start
|
|
|
+ <preempted> writer3 start
|
|
|
+ writer3 finishes
|
|
|
+ writer2 finishes
|
|
|
+ writer1 finishes
|
|
|
+
|
|
|
+This is very much like a writer being preempted by an interrupt and
|
|
|
+the interrupt doing a write as well.
|
|
|
+
|
|
|
+Readers can happen at any time. But no two readers may run at the
|
|
|
+same time, nor can a reader preempt/interrupt another reader. A reader
|
|
|
+can not preempt/interrupt a writer, but it may read/consume from the
|
|
|
+buffer at the same time as a writer is writing, but the reader must be
|
|
|
+on another processor to do so. A reader may read on its own processor
|
|
|
+and can be preempted by a writer.
|
|
|
+
|
|
|
+A writer can preempt a reader, but a reader can not preempt a writer.
|
|
|
+But a reader can read the buffer at the same time (on another processor)
|
|
|
+as a writer.
|
|
|
+
|
|
|
+The ring buffer is made up of a list of pages held together by a link list.
|
|
|
+
|
|
|
+At initialization a reader page is allocated for the reader that is not
|
|
|
+part of the ring buffer.
|
|
|
+
|
|
|
+The head_page, tail_page and commit_page are all initialized to point
|
|
|
+to the same page.
|
|
|
+
|
|
|
+The reader page is initialized to have its next pointer pointing to
|
|
|
+the head page, and its previous pointer pointing to a page before
|
|
|
+the head page.
|
|
|
+
|
|
|
+The reader has its own page to use. At start up time, this page is
|
|
|
+allocated but is not attached to the list. When the reader wants
|
|
|
+to read from the buffer, if its page is empty (like it is on start up)
|
|
|
+it will swap its page with the head_page. The old reader page will
|
|
|
+become part of the ring buffer and the head_page will be removed.
|
|
|
+The page after the inserted page (old reader_page) will become the
|
|
|
+new head page.
|
|
|
+
|
|
|
+Once the new page is given to the reader, the reader could do what
|
|
|
+it wants with it, as long as a writer has left that page.
|
|
|
+
|
|
|
+A sample of how the reader page is swapped: Note this does not
|
|
|
+show the head page in the buffer, it is for demonstrating a swap
|
|
|
+only.
|
|
|
+
|
|
|
+ +------+
|
|
|
+ |reader| RING BUFFER
|
|
|
+ |page |
|
|
|
+ +------+
|
|
|
+ +---+ +---+ +---+
|
|
|
+ | |-->| |-->| |
|
|
|
+ | |<--| |<--| |
|
|
|
+ +---+ +---+ +---+
|
|
|
+ ^ | ^ |
|
|
|
+ | +-------------+ |
|
|
|
+ +-----------------+
|
|
|
+
|
|
|
+
|
|
|
+ +------+
|
|
|
+ |reader| RING BUFFER
|
|
|
+ |page |-------------------+
|
|
|
+ +------+ v
|
|
|
+ | +---+ +---+ +---+
|
|
|
+ | | |-->| |-->| |
|
|
|
+ | | |<--| |<--| |<-+
|
|
|
+ | +---+ +---+ +---+ |
|
|
|
+ | ^ | ^ | |
|
|
|
+ | | +-------------+ | |
|
|
|
+ | +-----------------+ |
|
|
|
+ +------------------------------------+
|
|
|
+
|
|
|
+ +------+
|
|
|
+ |reader| RING BUFFER
|
|
|
+ |page |-------------------+
|
|
|
+ +------+ <---------------+ v
|
|
|
+ | ^ +---+ +---+ +---+
|
|
|
+ | | | |-->| |-->| |
|
|
|
+ | | | | | |<--| |<-+
|
|
|
+ | | +---+ +---+ +---+ |
|
|
|
+ | | | ^ | |
|
|
|
+ | | +-------------+ | |
|
|
|
+ | +-----------------------------+ |
|
|
|
+ +------------------------------------+
|
|
|
+
|
|
|
+ +------+
|
|
|
+ |buffer| RING BUFFER
|
|
|
+ |page |-------------------+
|
|
|
+ +------+ <---------------+ v
|
|
|
+ | ^ +---+ +---+ +---+
|
|
|
+ | | | | | |-->| |
|
|
|
+ | | New | | | |<--| |<-+
|
|
|
+ | | Reader +---+ +---+ +---+ |
|
|
|
+ | | page ----^ | |
|
|
|
+ | | | |
|
|
|
+ | +-----------------------------+ |
|
|
|
+ +------------------------------------+
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+It is possible that the page swapped is the commit page and the tail page,
|
|
|
+if what is in the ring buffer is less than what is held in a buffer page.
|
|
|
+
|
|
|
+
|
|
|
+ reader page commit page tail page
|
|
|
+ | | |
|
|
|
+ v | |
|
|
|
+ +---+ | |
|
|
|
+ | |<----------+ |
|
|
|
+ | |<------------------------+
|
|
|
+ | |------+
|
|
|
+ +---+ |
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |--->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+This case is still valid for this algorithm.
|
|
|
+When the writer leaves the page, it simply goes into the ring buffer
|
|
|
+since the reader page still points to the next location in the ring
|
|
|
+buffer.
|
|
|
+
|
|
|
+
|
|
|
+The main pointers:
|
|
|
+
|
|
|
+ reader page - The page used solely by the reader and is not part
|
|
|
+ of the ring buffer (may be swapped in)
|
|
|
+
|
|
|
+ head page - the next page in the ring buffer that will be swapped
|
|
|
+ with the reader page.
|
|
|
+
|
|
|
+ tail page - the page where the next write will take place.
|
|
|
+
|
|
|
+ commit page - the page that last finished a write.
|
|
|
+
|
|
|
+The commit page only is updated by the outer most writer in the
|
|
|
+writer stack. A writer that preempts another writer will not move the
|
|
|
+commit page.
|
|
|
+
|
|
|
+When data is written into the ring buffer, a position is reserved
|
|
|
+in the ring buffer and passed back to the writer. When the writer
|
|
|
+is finished writing data into that position, it commits the write.
|
|
|
+
|
|
|
+Another write (or a read) may take place at anytime during this
|
|
|
+transaction. If another write happens it must finish before continuing
|
|
|
+with the previous write.
|
|
|
+
|
|
|
+
|
|
|
+ Write reserve:
|
|
|
+
|
|
|
+ Buffer page
|
|
|
+ +---------+
|
|
|
+ |written |
|
|
|
+ +---------+ <--- given back to writer (current commit)
|
|
|
+ |reserved |
|
|
|
+ +---------+ <--- tail pointer
|
|
|
+ | empty |
|
|
|
+ +---------+
|
|
|
+
|
|
|
+ Write commit:
|
|
|
+
|
|
|
+ Buffer page
|
|
|
+ +---------+
|
|
|
+ |written |
|
|
|
+ +---------+
|
|
|
+ |written |
|
|
|
+ +---------+ <--- next positon for write (current commit)
|
|
|
+ | empty |
|
|
|
+ +---------+
|
|
|
+
|
|
|
+
|
|
|
+ If a write happens after the first reserve:
|
|
|
+
|
|
|
+ Buffer page
|
|
|
+ +---------+
|
|
|
+ |written |
|
|
|
+ +---------+ <-- current commit
|
|
|
+ |reserved |
|
|
|
+ +---------+ <--- given back to second writer
|
|
|
+ |reserved |
|
|
|
+ +---------+ <--- tail pointer
|
|
|
+
|
|
|
+ After second writer commits:
|
|
|
+
|
|
|
+
|
|
|
+ Buffer page
|
|
|
+ +---------+
|
|
|
+ |written |
|
|
|
+ +---------+ <--(last full commit)
|
|
|
+ |reserved |
|
|
|
+ +---------+
|
|
|
+ |pending |
|
|
|
+ |commit |
|
|
|
+ +---------+ <--- tail pointer
|
|
|
+
|
|
|
+ When the first writer commits:
|
|
|
+
|
|
|
+ Buffer page
|
|
|
+ +---------+
|
|
|
+ |written |
|
|
|
+ +---------+
|
|
|
+ |written |
|
|
|
+ +---------+
|
|
|
+ |written |
|
|
|
+ +---------+ <--(last full commit and tail pointer)
|
|
|
+
|
|
|
+
|
|
|
+The commit pointer points to the last write location that was
|
|
|
+committed without preempting another write. When a write that
|
|
|
+preempted another write is committed, it only becomes a pending commit
|
|
|
+and will not be a full commit till all writes have been committed.
|
|
|
+
|
|
|
+The commit page points to the page that has the last full commit.
|
|
|
+The tail page points to the page with the last write (before
|
|
|
+committing).
|
|
|
+
|
|
|
+The tail page is always equal to or after the commit page. It may
|
|
|
+be several pages ahead. If the tail page catches up to the commit
|
|
|
+page then no more writes may take place (regardless of the mode
|
|
|
+of the ring buffer: overwrite and produce/consumer).
|
|
|
+
|
|
|
+The order of pages are:
|
|
|
+
|
|
|
+ head page
|
|
|
+ commit page
|
|
|
+ tail page
|
|
|
+
|
|
|
+Possible scenario:
|
|
|
+ tail page
|
|
|
+ head page commit page |
|
|
|
+ | | |
|
|
|
+ v v v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |--->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+There is a special case that the head page is after either the commit page
|
|
|
+and possibly the tail page. That is when the commit (and tail) page has been
|
|
|
+swapped with the reader page. This is because the head page is always
|
|
|
+part of the ring buffer, but the reader page is not. When ever there
|
|
|
+has been less than a full page that has been committed inside the ring buffer,
|
|
|
+and a reader swaps out a page, it will be swapping out the commit page.
|
|
|
+
|
|
|
+
|
|
|
+ reader page commit page tail page
|
|
|
+ | | |
|
|
|
+ v | |
|
|
|
+ +---+ | |
|
|
|
+ | |<----------+ |
|
|
|
+ | |<------------------------+
|
|
|
+ | |------+
|
|
|
+ +---+ |
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |--->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+ ^
|
|
|
+ |
|
|
|
+ head page
|
|
|
+
|
|
|
+
|
|
|
+In this case, the head page will not move when the tail and commit
|
|
|
+move back into the ring buffer.
|
|
|
+
|
|
|
+The reader can not swap a page into the ring buffer if the commit page
|
|
|
+is still on that page. If the read meets the last commit (real commit
|
|
|
+not pending or reserved), then there is nothing more to read.
|
|
|
+The buffer is considered empty until another full commit finishes.
|
|
|
+
|
|
|
+When the tail meets the head page, if the buffer is in overwrite mode,
|
|
|
+the head page will be pushed ahead one. If the buffer is in producer/consumer
|
|
|
+mode, the write will fail.
|
|
|
+
|
|
|
+Overwrite mode:
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |--->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+ ^
|
|
|
+ |
|
|
|
+ head page
|
|
|
+
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |--->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+ ^
|
|
|
+ |
|
|
|
+ head page
|
|
|
+
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |--->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+ ^
|
|
|
+ |
|
|
|
+ head page
|
|
|
+
|
|
|
+Note, the reader page will still point to the previous head page.
|
|
|
+But when a swap takes place, it will use the most recent head page.
|
|
|
+
|
|
|
+
|
|
|
+Making the Ring Buffer Lockless:
|
|
|
+--------------------------------
|
|
|
+
|
|
|
+The main idea behind the lockless algorithm is to combine the moving
|
|
|
+of the head_page pointer with the swapping of pages with the reader.
|
|
|
+State flags are placed inside the pointer to the page. To do this,
|
|
|
+each page must be aligned in memory by 4 bytes. This will allow the 2
|
|
|
+least significant bits of the address to be used as flags. Since
|
|
|
+they will always be zero for the address. To get the address,
|
|
|
+simply mask out the flags.
|
|
|
+
|
|
|
+ MASK = ~3
|
|
|
+
|
|
|
+ address & MASK
|
|
|
+
|
|
|
+Two flags will be kept by these two bits:
|
|
|
+
|
|
|
+ HEADER - the page being pointed to is a head page
|
|
|
+
|
|
|
+ UPDATE - the page being pointed to is being updated by a writer
|
|
|
+ and was or is about to be a head page.
|
|
|
+
|
|
|
+
|
|
|
+ reader page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+
|
|
|
+ | |------+
|
|
|
+ +---+ |
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-H->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+
|
|
|
+The above pointer "-H->" would have the HEADER flag set. That is
|
|
|
+the next page is the next page to be swapped out by the reader.
|
|
|
+This pointer means the next page is the head page.
|
|
|
+
|
|
|
+When the tail page meets the head pointer, it will use cmpxchg to
|
|
|
+change the pointer to the UPDATE state:
|
|
|
+
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-H->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+"-U->" represents a pointer in the UPDATE state.
|
|
|
+
|
|
|
+Any access to the reader will need to take some sort of lock to serialize
|
|
|
+the readers. But the writers will never take a lock to write to the
|
|
|
+ring buffer. This means we only need to worry about a single reader,
|
|
|
+and writes only preempt in "stack" formation.
|
|
|
+
|
|
|
+When the reader tries to swap the page with the ring buffer, it
|
|
|
+will also use cmpxchg. If the flag bit in the pointer to the
|
|
|
+head page does not have the HEADER flag set, the compare will fail
|
|
|
+and the reader will need to look for the new head page and try again.
|
|
|
+Note, the flag UPDATE and HEADER are never set at the same time.
|
|
|
+
|
|
|
+The reader swaps the reader page as follows:
|
|
|
+
|
|
|
+ +------+
|
|
|
+ |reader| RING BUFFER
|
|
|
+ |page |
|
|
|
+ +------+
|
|
|
+ +---+ +---+ +---+
|
|
|
+ | |--->| |--->| |
|
|
|
+ | |<---| |<---| |
|
|
|
+ +---+ +---+ +---+
|
|
|
+ ^ | ^ |
|
|
|
+ | +---------------+ |
|
|
|
+ +-----H-------------+
|
|
|
+
|
|
|
+The reader sets the reader page next pointer as HEADER to the page after
|
|
|
+the head page.
|
|
|
+
|
|
|
+
|
|
|
+ +------+
|
|
|
+ |reader| RING BUFFER
|
|
|
+ |page |-------H-----------+
|
|
|
+ +------+ v
|
|
|
+ | +---+ +---+ +---+
|
|
|
+ | | |--->| |--->| |
|
|
|
+ | | |<---| |<---| |<-+
|
|
|
+ | +---+ +---+ +---+ |
|
|
|
+ | ^ | ^ | |
|
|
|
+ | | +---------------+ | |
|
|
|
+ | +-----H-------------+ |
|
|
|
+ +--------------------------------------+
|
|
|
+
|
|
|
+It does a cmpxchg with the pointer to the previous head page to make it
|
|
|
+point to the reader page. Note that the new pointer does not have the HEADER
|
|
|
+flag set. This action atomically moves the head page forward.
|
|
|
+
|
|
|
+ +------+
|
|
|
+ |reader| RING BUFFER
|
|
|
+ |page |-------H-----------+
|
|
|
+ +------+ v
|
|
|
+ | ^ +---+ +---+ +---+
|
|
|
+ | | | |-->| |-->| |
|
|
|
+ | | | |<--| |<--| |<-+
|
|
|
+ | | +---+ +---+ +---+ |
|
|
|
+ | | | ^ | |
|
|
|
+ | | +-------------+ | |
|
|
|
+ | +-----------------------------+ |
|
|
|
+ +------------------------------------+
|
|
|
+
|
|
|
+After the new head page is set, the previous pointer of the head page is
|
|
|
+updated to the reader page.
|
|
|
+
|
|
|
+ +------+
|
|
|
+ |reader| RING BUFFER
|
|
|
+ |page |-------H-----------+
|
|
|
+ +------+ <---------------+ v
|
|
|
+ | ^ +---+ +---+ +---+
|
|
|
+ | | | |-->| |-->| |
|
|
|
+ | | | | | |<--| |<-+
|
|
|
+ | | +---+ +---+ +---+ |
|
|
|
+ | | | ^ | |
|
|
|
+ | | +-------------+ | |
|
|
|
+ | +-----------------------------+ |
|
|
|
+ +------------------------------------+
|
|
|
+
|
|
|
+ +------+
|
|
|
+ |buffer| RING BUFFER
|
|
|
+ |page |-------H-----------+ <--- New head page
|
|
|
+ +------+ <---------------+ v
|
|
|
+ | ^ +---+ +---+ +---+
|
|
|
+ | | | | | |-->| |
|
|
|
+ | | New | | | |<--| |<-+
|
|
|
+ | | Reader +---+ +---+ +---+ |
|
|
|
+ | | page ----^ | |
|
|
|
+ | | | |
|
|
|
+ | +-----------------------------+ |
|
|
|
+ +------------------------------------+
|
|
|
+
|
|
|
+Another important point. The page that the reader page points back to
|
|
|
+by its previous pointer (the one that now points to the new head page)
|
|
|
+never points back to the reader page. That is because the reader page is
|
|
|
+not part of the ring buffer. Traversing the ring buffer via the next pointers
|
|
|
+will always stay in the ring buffer. Traversing the ring buffer via the
|
|
|
+prev pointers may not.
|
|
|
+
|
|
|
+Note, the way to determine a reader page is simply by examining the previous
|
|
|
+pointer of the page. If the next pointer of the previous page does not
|
|
|
+point back to the original page, then the original page is a reader page:
|
|
|
+
|
|
|
+
|
|
|
+ +--------+
|
|
|
+ | reader | next +----+
|
|
|
+ | page |-------->| |<====== (buffer page)
|
|
|
+ +--------+ +----+
|
|
|
+ | | ^
|
|
|
+ | v | next
|
|
|
+ prev | +----+
|
|
|
+ +------------->| |
|
|
|
+ +----+
|
|
|
+
|
|
|
+The way the head page moves forward:
|
|
|
+
|
|
|
+When the tail page meets the head page and the buffer is in overwrite mode
|
|
|
+and more writes take place, the head page must be moved forward before the
|
|
|
+writer may move the tail page. The way this is done is that the writer
|
|
|
+performs a cmpxchg to convert the pointer to the head page from the HEADER
|
|
|
+flag to have the UPDATE flag set. Once this is done, the reader will
|
|
|
+not be able to swap the head page from the buffer, nor will it be able to
|
|
|
+move the head page, until the writer is finished with the move.
|
|
|
+
|
|
|
+This eliminates any races that the reader can have on the writer. The reader
|
|
|
+must spin, and this is why the reader can not preempt the writer.
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-H->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+The following page will be made into the new head page.
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |-H->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+After the new head page has been set, we can set the old head page
|
|
|
+pointer back to NORMAL.
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |--->| |-H->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+After the head page has been moved, the tail page may now move forward.
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |--->| |-H->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+
|
|
|
+The above are the trivial updates. Now for the more complex scenarios.
|
|
|
+
|
|
|
+
|
|
|
+As stated before, if enough writes preempt the first write, the
|
|
|
+tail page may make it all the way around the buffer and meet the commit
|
|
|
+page. At this time, we must start dropping writes (usually with some kind
|
|
|
+of warning to the user). But what happens if the commit was still on the
|
|
|
+reader page? The commit page is not part of the ring buffer. The tail page
|
|
|
+must account for this.
|
|
|
+
|
|
|
+
|
|
|
+ reader page commit page
|
|
|
+ | |
|
|
|
+ v |
|
|
|
+ +---+ |
|
|
|
+ | |<----------+
|
|
|
+ | |
|
|
|
+ | |------+
|
|
|
+ +---+ |
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-H->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+ ^
|
|
|
+ |
|
|
|
+ tail page
|
|
|
+
|
|
|
+If the tail page were to simply push the head page forward, the commit when
|
|
|
+leaving the reader page would not be pointing to the correct page.
|
|
|
+
|
|
|
+The solution to this is to test if the commit page is on the reader page
|
|
|
+before pushing the head page. If it is, then it can be assumed that the
|
|
|
+tail page wrapped the buffer, and we must drop new writes.
|
|
|
+
|
|
|
+This is not a race condition, because the commit page can only be moved
|
|
|
+by the outter most writer (the writer that was preempted).
|
|
|
+This means that the commit will not move while a writer is moving the
|
|
|
+tail page. The reader can not swap the reader page if it is also being
|
|
|
+used as the commit page. The reader can simply check that the commit
|
|
|
+is off the reader page. Once the commit page leaves the reader page
|
|
|
+it will never go back on it unless a reader does another swap with the
|
|
|
+buffer page that is also the commit page.
|
|
|
+
|
|
|
+
|
|
|
+Nested writes
|
|
|
+-------------
|
|
|
+
|
|
|
+In the pushing forward of the tail page we must first push forward
|
|
|
+the head page if the head page is the next page. If the head page
|
|
|
+is not the next page, the tail page is simply updated with a cmpxchg.
|
|
|
+
|
|
|
+Only writers move the tail page. This must be done atomically to protect
|
|
|
+against nested writers.
|
|
|
+
|
|
|
+ temp_page = tail_page
|
|
|
+ next_page = temp_page->next
|
|
|
+ cmpxchg(tail_page, temp_page, next_page)
|
|
|
+
|
|
|
+The above will update the tail page if it is still pointing to the expected
|
|
|
+page. If this fails, a nested write pushed it forward, the the current write
|
|
|
+does not need to push it.
|
|
|
+
|
|
|
+
|
|
|
+ temp page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |--->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+Nested write comes in and moves the tail page forward:
|
|
|
+
|
|
|
+ tail page (moved by nested writer)
|
|
|
+ temp page |
|
|
|
+ | |
|
|
|
+ v v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |--->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+The above would fail the cmpxchg, but since the tail page has already
|
|
|
+been moved forward, the writer will just try again to reserve storage
|
|
|
+on the new tail page.
|
|
|
+
|
|
|
+But the moving of the head page is a bit more complex.
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-H->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+The write converts the head page pointer to UPDATE.
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+But if a nested writer preempts here. It will see that the next
|
|
|
+page is a head page, but it is also nested. It will detect that
|
|
|
+it is nested and will save that information. The detection is the
|
|
|
+fact that it sees the UPDATE flag instead of a HEADER or NORMAL
|
|
|
+pointer.
|
|
|
+
|
|
|
+The nested writer will set the new head page pointer.
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |-H->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+But it will not reset the update back to normal. Only the writer
|
|
|
+that converted a pointer from HEAD to UPDATE will convert it back
|
|
|
+to NORMAL.
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |-H->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+After the nested writer finishes, the outer most writer will convert
|
|
|
+the UPDATE pointer to NORMAL.
|
|
|
+
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |--->| |-H->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+
|
|
|
+It can be even more complex if several nested writes came in and moved
|
|
|
+the tail page ahead several pages:
|
|
|
+
|
|
|
+
|
|
|
+(first writer)
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-H->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+The write converts the head page pointer to UPDATE.
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |--->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+Next writer comes in, and sees the update and sets up the new
|
|
|
+head page.
|
|
|
+
|
|
|
+(second writer)
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |-H->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+The nested writer moves the tail page forward. But does not set the old
|
|
|
+update page to NORMAL because it is not the outer most writer.
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |-H->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+Another writer preempts and sees the page after the tail page is a head page.
|
|
|
+It changes it from HEAD to UPDATE.
|
|
|
+
|
|
|
+(third writer)
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |-U->| |--->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+The writer will move the head page forward:
|
|
|
+
|
|
|
+
|
|
|
+(third writer)
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |-U->| |-H->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+But now that the third writer did change the HEAD flag to UPDATE it
|
|
|
+will convert it to normal:
|
|
|
+
|
|
|
+
|
|
|
+(third writer)
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |--->| |-H->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+
|
|
|
+Then it will move the tail page, and return back to the second writer.
|
|
|
+
|
|
|
+
|
|
|
+(second writer)
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |--->| |-H->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+
|
|
|
+The second writer will fail to move the tail page because it was already
|
|
|
+moved, so it will try again and add its data to the new tail page.
|
|
|
+It will return to the first writer.
|
|
|
+
|
|
|
+
|
|
|
+(first writer)
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |--->| |-H->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+The first writer can not know atomically test if the tail page moved
|
|
|
+while it updates the HEAD page. It will then update the head page to
|
|
|
+what it thinks is the new head page.
|
|
|
+
|
|
|
+
|
|
|
+(first writer)
|
|
|
+
|
|
|
+ tail page
|
|
|
+ |
|
|
|
+ v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |-H->| |-H->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+Since the cmpxchg returns the old value of the pointer the first writer
|
|
|
+will see it succeeded in updating the pointer from NORMAL to HEAD.
|
|
|
+But as we can see, this is not good enough. It must also check to see
|
|
|
+if the tail page is either where it use to be or on the next page:
|
|
|
+
|
|
|
+
|
|
|
+(first writer)
|
|
|
+
|
|
|
+ A B tail page
|
|
|
+ | | |
|
|
|
+ v v v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |-H->| |-H->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+If tail page != A and tail page does not equal B, then it must reset the
|
|
|
+pointer back to NORMAL. The fact that it only needs to worry about
|
|
|
+nested writers, it only needs to check this after setting the HEAD page.
|
|
|
+
|
|
|
+
|
|
|
+(first writer)
|
|
|
+
|
|
|
+ A B tail page
|
|
|
+ | | |
|
|
|
+ v v v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |-U->| |--->| |-H->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|
|
|
+Now the writer can update the head page. This is also why the head page must
|
|
|
+remain in UPDATE and only reset by the outer most writer. This prevents
|
|
|
+the reader from seeing the incorrect head page.
|
|
|
+
|
|
|
+
|
|
|
+(first writer)
|
|
|
+
|
|
|
+ A B tail page
|
|
|
+ | | |
|
|
|
+ v v v
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+<---| |--->| |--->| |--->| |-H->
|
|
|
+--->| |<---| |<---| |<---| |<---
|
|
|
+ +---+ +---+ +---+ +---+
|
|
|
+
|