[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.00.0906122203100.16299@gandalf.stny.rr.com>
Date: Fri, 12 Jun 2009 22:16:45 -0400 (EDT)
From: Steven Rostedt <rostedt@...dmis.org>
To: Frederic Weisbecker <fweisbec@...il.com>
cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
Andrew Morton <akpm@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
Theodore Tso <tytso@....edu>,
Arnaldo Carvalho de Melo <acme@...hat.com>,
Mathieu Desnoyers <compudj@...stal.dyndns.org>,
Lai Jiangshan <laijs@...fujitsu.com>,
"Martin J. Bligh" <mbligh@...igh.org>,
Christoph Hellwig <hch@...radead.org>,
Li Zefan <lizf@...fujitsu.com>,
Huang Ying <ying.huang@...el.com>,
"H. Peter Anvin" <hpa@...or.com>,
Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>,
Masami Hiramatsu <mhiramat@...hat.com>
Subject: Re: [PATCH 3/3] ring-buffer: add design document
On Sat, 13 Jun 2009, Frederic Weisbecker wrote:
> On Wed, Jun 10, 2009 at 03:53:14PM -0400, Steven Rostedt wrote:
> > From: Steven Rostedt <srostedt@...hat.com>
> >
> > This adds the design document for the ring buffer and also
> > explains how it is designed to have lockless writes.
> >
> > Signed-off-by: Steven Rostedt <rostedt@...dmis.org>
> > ---
> > Documentation/trace/ring-buffer-design.txt | 949 ++++++++++++++++++++++++++++
> > 1 files changed, 949 insertions(+), 0 deletions(-)
> > create mode 100644 Documentation/trace/ring-buffer-design.txt
> >
> > diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt
> > new file mode 100644
> > index 0000000..cca290b
> > --- /dev/null
> > +++ b/Documentation/trace/ring-buffer-design.txt
> > @@ -0,0 +1,949 @@
> > + Lockless Ring Buffer Design
> > + ===========================
> > +
> > +Copyright 2009 Red Hat Inc.
> > + Author: Steven Rostedt <srostedt@...hat.com>
> > + License: The GNU Free Documentation License, Version 1.2
> > + (dual licensed under the GPL v2)
> > +
> > +Written for: 2.6.31
> > +
> > +Terminology used in this Document
> > +---------------------------------
> > +
> > +tail - where new writes happen in the ring buffer.
> > +
> > +head - where new reads happen in the ring buffer.
> > +
> > +producer - the task that writes into the ring buffer (same as writer)
> > +
> > +writer - same as producer
> > +
> > +consumer - the task that reads from the buffer (same as reader)
> > +
> > +reader - same as consumer.
> > +
> > +reader_page - A page outside the ring buffer used solely (for the most part)
> > + by the reader.
> > +
> > +head_page - a pointer to the page that the reader will use next
> > +
> > +tail_page - a pointer to the page that will be written to next
> > +
> > +commit_page - a pointer to the page with the last finished non nested write.
> > +
> > +cmpxchg - hardware assisted atomic transaction that performs the following:
> > +
> > + A = B iff previous A == C
> > +
> > + R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
> > + current A is equal to C, and we put the old (current) A into R
> > +
> > + R gets the previous A regardless if A is updated with B or not.
> > +
> > + To see if the update was successful a compare of R == C may be used.
> > +
> > +The Generic Ring Buffer
> > +-----------------------
> > +
> > +The ring buffer can be used in either an overwrite mode or in
> > +producer/consumer mode.
> > +
> > +Producer/consumer mode is where the producer were to fill up the
> > +buffer before the consumer could free up anything, the producer
> > +will stop writing to the buffer. This will lose most recent events.
> > +
> > +Overwrite mode is where the produce were to fill up the buffer
> > +before the consumer could free up anything, the producer will
> > +overwrite the older data. This will lose the oldest events.
> > +
> > +No two writers can write at the same time (on the same per cpu buffer),
> > +but a writer may preempt another writer, but it must finish writing
> > +before the previous writer may continue. This is very important to the
> > +algorithm. The writers act like a "stack".
> > +
> > +
> > + writer1 start
> > + <preempted> writer2 start
> > + <preempted> writer3 start
> > + writer3 finishes
> > + writer2 finishes
> > + writer1 finishes
> > +
> > +This is very much like a writer being preempted by an interrupt and
> > +the interrupt doing a write as well.
> > +
> > +Readers can happen at any time. But no two readers may run at the
> > +same time, nor can a reader preempt another reader. A reader can not preempt
> > +a writer, but it may read/consume from the buffer at the same time as
> > +a writer is writing, but the reader must be on another processor.
> > +
> > +A writer can preempt a reader, but a reader can not preempt a writer.
> > +But a reader can read the buffer at the same time (on another processor)
> > +as a writer.
> > +
> > +The ring buffer is made up of a list of pages held together by a link list.
> > +
> > +At initialization a reader page is allocated for the reader that is not
> > +part of the ring buffer.
> > +
> > +The head_page, tail_page and commit_page are all initialized to point
> > +to the same page.
> > +
> > +The reader page is initialized to have its next pointer pointing to
> > +the head page, and its previous pointer pointing to a page before
> > +the head page.
> > +
> > +The reader has its own page to use. At start up time, this page is
> > +allocated but is not attached to the list. When the reader wants
> > +to read from the buffer, if its page is empty (like it is on start up)
> > +it will swap its page with the head_page. The old reader page will
> > +become part of the ring buffer and the head_page will be removed.
> > +A new head page goes to the page after the old head page (but not
> > +the page that was swapped in).
>
>
>
> I wonder if you could reformulate this last sentence. It took me
> some time to understand it.
Yuck, that last sentence is ugly.
>
>
> I first understood it as:
>
> """
> A new page which comes from nowhere is
> going to become a (and not "the") head page. Moreover, it will
> be pointed by old_head_page->next...(which is actually true btw),
> but this new head page will not be the next pointer on the page
> that has just been swapped in.
> """
>
> Well, actually may be it's because my english understanding is a bit....
No, I think I wrote that at 3am.
How about this:
"The page after the inserted page (old reader_page) will become the new
head page."
?
>
>
>
> > +
> > +Once the new page is given to the reader, the reader could do what
> > +it wants with it, as long as a writer has left that page.
> > +
> > +A sample of how the reader page is swapped: Note this does not
> > +show the head page in the buffer, it is for demonstrating a swap
> > +only.
Note above.
> > +
> > + +------+
> > + |reader| RING BUFFER
> > + |page |
> > + +------+
> > + +---+ +---+ +---+
> > + | |-->| |-->| |
> > + | |<--| |<--| |
> > + +---+ +---+ +---+
> > + ^ | ^ |
> > + | +-------------+ |
> > + +-----------------+
>
>
>
> But may be you could also show the head page at the same time,
> that would help the readers IMO (not those on the ring buffer,
> but at least those from real life who can preempt several things..)
I could add the H, but I just wanted to concentrate on the swap without
having too many details. But if you think the H would help, I'm fine with
it.
>
>
> +------+
> |reader| RING BUFFER
> |page |
> +------+
> +---+ +---+ +---+
> | |-->| |-->| |
> | H |<--| |<--| |
> +---+ +---+ +---+
> ^ | ^ |
> | +-------------+ |
> +-----------------+
>
>
>
>
> > +
> > + +------+
> > + |reader| RING BUFFER
> > + |page |-------------------+
> > + +------+ v
> > + | +---+ +---+ +---+
> > + | | |-->| |-->| |
> > + | | |<--| |<--| |<-+
> > + | +---+ +---+ +---+ |
> > + | ^ | ^ | |
> > + | | +-------------+ | |
> > + | +-----------------+ |
> > + +------------------------------------+
>
>
>
> +------+
> |reader| RING BUFFER
> |page |-------------------+
> +------+ v
> | +---+ +---+ +---+
> | | |-->| |-->| |
> | | H |<--| |<--| |<-+
> | +---+ +---+ +---+ |
> | ^ | ^ | |
> | | +-------------+ | |
> | +-----------------+ |
> +------------------------------------+
>
>
>
> > + +------+
> > + |reader| RING BUFFER
> > + |page |-------------------+
> > + +------+ <---------------+ v
> > + | ^ +---+ +---+ +---+
> > + | | | |-->| |-->| |
> > + | | | |<--| |<--| |<-+
> > + | | +---+ +---+ +---+ |
> > + | | | ^ | |
> > + | | +-------------+ | |
> > + | +-----------------------------+ |
> > + +------------------------------------+
>
>
>
> +------+
> |reader| RING BUFFER
> |page |-------------------+
> +------+ <---------------+ v
> | ^ +---+ +---+ +---+
> | | | |-->| |-->| |
> | | | H |<--| |<--| |<-+
> | | +---+ +---+ +---+ |
> | | | ^ | |
> | | +-------------+ | |
> | +-----------------------------+ |
> +------------------------------------+
Oooh, you cut and paste the error in the above. Do you see it?
>
>
>
>
> > + +------+
> > + |buffer| RING BUFFER
> > + |page |-------------------+
> > + +------+ <---------------+ v
> > + | ^ +---+ +---+ +---+
> > + | | | | | |-->| |
> > + | | New | | | |<--| |<-+
> > + | | Reader +---+ +---+ +---+ |
> > + | | page ----^ | |
> > + | | | |
> > + | +-----------------------------+ |
> > + +------------------------------------+
> > +
>
>
>
> +------+
> |buffer| RING BUFFER
> |page |-------------------+
> +------+ <---------------+ v
> | ^ +---+ +---+ +---+
> | | | | | |-->| |
> | | New | | | H |<--| |<-+
> | | Reader +---+ +---+ +---+ |
> | | page ----^ | |
> | | | |
> | +-----------------------------+ |
> +------------------------------------+
>
>
> Sorry it was too tempting to try out some ascii art too,
> but also it's an occasion to tell me if I misunderstood something
> about the head page.
Yeah, the above seems correct (except for the error you left in).
>
>
> > +
> > +It is possible that the page swapped is the commit page and the tail page,
> > +if what is in the ring buffer is less than what is held in a buffer page.
> > +
> > +
> > + reader page commit page tail page
> > + | | |
> > + v | |
> > + +---+ | |
> > + | |<----------+ |
> > + | |<------------------------+
> > + | |------+
> > + +---+ |
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |--->| |--->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +This case is still legal for this algorithm.
> > +When the writer leaves the page, it simply goes into the ring buffer
> > +since the reader page still points to the next location in the ring
> > +buffer.
> > +
> > +
> > +The main pointers:
> > +
> > + reader page - The page used solely by the reader and is not part
> > + of the ring buffer (may be swapped in)
> > +
> > + head page - the next page in the ring buffer that will be swapped
> > + with the reader page.
> > +
> > + tail page - the page where the next write will take place.
> > +
> > + commit page - the page that last finished a write.
> > +
> > +The commit page only is updated by the outer most writer in the
> > +writer stack. A writer that preempts another writer will not move the
> > +commit page.
>
>
>
> Btw, how do you check that? Is there a nesting counter or something?
Because only the writer that reserves the pointer after the commit, is the
committer.
static int
rb_is_commit(struct ring_buffer_per_cpu *cpu_buffer,
struct ring_buffer_event *event)
{
unsigned long addr = (unsigned long)event;
unsigned long index;
index = rb_event_index(event);
addr &= PAGE_MASK;
return cpu_buffer->commit_page->page == (void *)addr &&
rb_commit_index(cpu_buffer) == index;
}
Although, I'm thinking of replacing it with a counter. May eliminate some
of the tight races I need to prevent. And may even clean up the code, and
speed it up.
Hmm, I may implement that now.
>
>
>
> > +
> > +When data is written into the ring buffer, a position is reserved
> > +in the ring buffer and passed back to the writer. When the writer
> > +is finished writing data into that position, it commits the write.
> > +
> > +Another write (or a read) may take place at anytime during this
> > +transaction. If another write happens it must finish before continuing
> > +with the previous write.
>
>
> [...]
>
>
> > +Nested writes
> > +-------------
> > +
> > +In the pushing forward of the tail page we must first push forward
> > +the head page if the head page is the next page. If the head page
> > +is not the next page, the tail page is simply updated with a cmpxchg.
> > +
> > +Only writers move the tail page. This must be done atomically to protect
> > +against nested writers.
> > +
> > + temp_page = tail_page
> > + next_page = temp_page->next
> > + cmpxchg(tail_page, temp_page, next_page)
> > +
> > +The above will update the tail page if it is still pointing to the expected
> > +page. If this fails, a nested write pushed it forward, the the current write
> > +does not need to push it.
> > +
> > +
> > + temp page
> > + |
> > + v
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |--->| |--->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +Nested write comes in and moves the tail page forward:
> > +
> > + tail page (moved by nested writer)
> > + temp page |
> > + | |
> > + v v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |--->| |--->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +The above would fail the cmpxchg, but since the tail page has already
> > +been moved forward, the writer will just try again to reserve storage
> > +on the new tail page.
> > +
> > +But the moving of the head page is a bit more complex.
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-H->| |--->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +The write converts the head page pointer to UPDATE.
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |--->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +But if a nested writer preempts here. It will see that the next
> > +page is a head page, but it is also nested. It will detect that
> > +it is nested and will save that information. The detection is the
> > +fact that it sees the UPDATE flag instead of a HEADER or NORMAL
> > +pointer.
> > +
> > +The nested writer will set the new head page pointer.
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |-H->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +But it will not reset the update back to normal. Only the writer
> > +that converted a pointer from HEAD to UPDATE will convert it back
> > +to NORMAL.
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |-H->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +After the nested writer finishes, the outer most writer will convert
> > +the UPDATE pointer to NORMAL.
> > +
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |--->| |-H->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +
> > +It can be even more complex if several nested writes came in and moved
> > +the tail page ahead several pages:
> > +
> > +
> > +(first writer)
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-H->| |--->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +The write converts the head page pointer to UPDATE.
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |--->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +Next writer comes in, and sees the update and sets up the new
> > +head page.
> > +
> > +(second writer)
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |-H->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +The nested writer moves the tail page forward. But does not set the old
> > +update page to NORMAL because it is not the outer most writer.
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |-H->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +Another writer preempts and sees the page after the tail page is a head page.
> > +It changes it from HEAD to UPDATE.
> > +
> > +(third writer)
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |-U->| |--->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +The writer will move the head page forward:
> > +
> > +
> > +(third writer)
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |-U->| |-H->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +But now that the third writer did change the HEAD flag to UPDATE it
> > +will convert it to normal:
> > +
> > +
> > +(third writer)
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |--->| |-H->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +
> > +Then it will move the tail page, and return back to the second writer.
> > +
> > +
> > +(second writer)
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |--->| |-H->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +
> > +The second writer will fail to move the tail page because it was already
> > +moved, so it will try again and add its data to the new tail page.
> > +It will return to the first writer.
> > +
> > +
> > +(first writer)
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |--->| |-H->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +The first writer can not know atomically test if the tail page moved
> > +while it updates the HEAD page. It will then update the head page to
> > +what it thinks is the new head page.
> > +
> > +
> > +(first writer)
> > +
> > + tail page
> > + |
> > + v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |-H->| |-H->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +Since the cmpxchg returns the old value of the pointer the first writer
> > +will see it succeeded in updating the pointer from NORMAL to HEAD.
> > +But as we can see, this is not good enough. It must also check to see
> > +if the tail page is either where it use to be or on the next page:
> > +
> > +
> > +(first writer)
> > +
> > + A B tail page
> > + | | |
> > + v v v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |-H->| |-H->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +If tail page != A and tail page does not equal B, then it must reset the
> > +pointer back to NORMAL. The fact that it only needs to worry about
> > +nested writers, it only needs to check this after setting the HEAD page.
> > +
> > +
> > +(first writer)
> > +
> > + A B tail page
> > + | | |
> > + v v v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |-U->| |--->| |-H->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
> > +
> > +Now the writer can update the head page. This is also why the head page must
> > +remain in UPDATE and only reset by the outer most writer. This prevents
> > +the reader from seeing the incorrect head page.
> > +
> > +
> > +(first writer)
> > +
> > + A B tail page
> > + | | |
> > + v v v
> > + +---+ +---+ +---+ +---+
> > +<---| |--->| |--->| |--->| |-H->
> > +--->| |<---| |<---| |<---| |<---
> > + +---+ +---+ +---+ +---+
>
>
> Even more tricky!
>
> I just have a stupid question: why can't this be done
> only through HEAD and NORMAL flags?
>
> There is something certainly very obvious that I'm missing
> with the point of the UPDATE flag.
If you can demonstrate how to do the above lockless with just HEAD and
NORMAL, then sure, I'm all ears ;-)
When we switch the HEAD to UPDATE, we stop the reader from moving forward
and being another thing to handle while we move the HEAD forward. A reader
does a cmpxchg to move the head too, and that cmpxchg will always fail if
the pointer is has UPDATE set. The reader will just spin until it
succeeds.
Then, the rest of the moving of the header is just races with other
writers that are always on the same CPU, and it becomes a recursive
problem and not a parallel one.
-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists