lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 14 Jun 2009 00:36:38 +0200
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <peterz@...radead.org>,
	Theodore Tso <tytso@....edu>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	Mathieu Desnoyers <compudj@...stal.dyndns.org>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	"Martin J. Bligh" <mbligh@...igh.org>,
	Christoph Hellwig <hch@...radead.org>,
	Li Zefan <lizf@...fujitsu.com>,
	Huang Ying <ying.huang@...el.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>,
	Masami Hiramatsu <mhiramat@...hat.com>
Subject: Re: [PATCH 3/3] ring-buffer: add design document

On Fri, Jun 12, 2009 at 10:16:45PM -0400, Steven Rostedt wrote:
> 
> On Sat, 13 Jun 2009, Frederic Weisbecker wrote:
> 
> > On Wed, Jun 10, 2009 at 03:53:14PM -0400, Steven Rostedt wrote:
> > > From: Steven Rostedt <srostedt@...hat.com>
> > > 
> > > This adds the design document for the ring buffer and also
> > > explains how it is designed to have lockless writes.
> > > 
> > > Signed-off-by: Steven Rostedt <rostedt@...dmis.org>
> > > ---
> > >  Documentation/trace/ring-buffer-design.txt |  949 ++++++++++++++++++++++++++++
> > >  1 files changed, 949 insertions(+), 0 deletions(-)
> > >  create mode 100644 Documentation/trace/ring-buffer-design.txt
> > > 
> > > diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt
> > > new file mode 100644
> > > index 0000000..cca290b
> > > --- /dev/null
> > > +++ b/Documentation/trace/ring-buffer-design.txt
> > > @@ -0,0 +1,949 @@
> > > +		Lockless Ring Buffer Design
> > > +		===========================
> > > +
> > > +Copyright 2009 Red Hat Inc.
> > > +   Author:   Steven Rostedt <srostedt@...hat.com>
> > > +  License:   The GNU Free Documentation License, Version 1.2
> > > +               (dual licensed under the GPL v2)
> > > +
> > > +Written for: 2.6.31
> > > +
> > > +Terminology used in this Document
> > > +---------------------------------
> > > +
> > > +tail - where new writes happen in the ring buffer.
> > > +
> > > +head - where new reads happen in the ring buffer.
> > > +
> > > +producer - the task that writes into the ring buffer (same as writer)
> > > +
> > > +writer - same as producer
> > > +
> > > +consumer - the task that reads from the buffer (same as reader)
> > > +
> > > +reader - same as consumer.
> > > +
> > > +reader_page - A page outside the ring buffer used solely (for the most part)
> > > +    by the reader.
> > > +
> > > +head_page - a pointer to the page that the reader will use next
> > > +
> > > +tail_page - a pointer to the page that will be written to next
> > > +
> > > +commit_page - a pointer to the page with the last finished non nested write.
> > > +
> > > +cmpxchg - hardware assisted atomic transaction that performs the following:
> > > +
> > > +   A = B iff previous A == C
> > > +
> > > +   R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
> > > +      current A is equal to C, and we put the old (current) A into R
> > > +
> > > +   R gets the previous A regardless if A is updated with B or not.
> > > +
> > > +   To see if the update was successful a compare of R == C may be used.
> > > +
> > > +The Generic Ring Buffer
> > > +-----------------------
> > > +
> > > +The ring buffer can be used in either an overwrite mode or in
> > > +producer/consumer mode.
> > > +
> > > +Producer/consumer mode is where the producer were to fill up the
> > > +buffer before the consumer could free up anything, the producer
> > > +will stop writing to the buffer. This will lose most recent events.
> > > +
> > > +Overwrite mode is where the produce were to fill up the buffer
> > > +before the consumer could free up anything, the producer will
> > > +overwrite the older data. This will lose the oldest events.
> > > +
> > > +No two writers can write at the same time (on the same per cpu buffer),
> > > +but a writer may preempt another writer, but it must finish writing
> > > +before the previous writer may continue. This is very important to the
> > > +algorithm. The writers act like a "stack".
> > > +
> > > +
> > > +  writer1 start
> > > +     <preempted> writer2 start
> > > +         <preempted> writer3 start
> > > +                     writer3 finishes
> > > +                 writer2 finishes
> > > +  writer1 finishes
> > > +
> > > +This is very much like a writer being preempted by an interrupt and
> > > +the interrupt doing a write as well.
> > > +
> > > +Readers can happen at any time. But no two readers may run at the
> > > +same time, nor can a reader preempt another reader. A reader can not preempt
> > > +a writer, but it may read/consume from the buffer at the same time as
> > > +a writer is writing, but the reader must be on another processor.
> > > +
> > > +A writer can preempt a reader, but a reader can not preempt a writer.
> > > +But a reader can read the buffer at the same time (on another processor)
> > > +as a writer.
> > > +
> > > +The ring buffer is made up of a list of pages held together by a link list.
> > > +
> > > +At initialization a reader page is allocated for the reader that is not
> > > +part of the ring buffer.
> > > +
> > > +The head_page, tail_page and commit_page are all initialized to point
> > > +to the same page.
> > > +
> > > +The reader page is initialized to have its next pointer pointing to
> > > +the head page, and its previous pointer pointing to a page before
> > > +the head page.
> > > +
> > > +The reader has its own page to use. At start up time, this page is
> > > +allocated but is not attached to the list. When the reader wants
> > > +to read from the buffer, if its page is empty (like it is on start up)
> > > +it will swap its page with the head_page. The old reader page will
> > > +become part of the ring buffer and the head_page will be removed.
> > > +A new head page goes to the page after the old head page (but not
> > > +the page that was swapped in).
> > 
> > 
> > 
> > I wonder if you could reformulate this last sentence. It took me
> > some time to understand it.
> 
> Yuck, that last sentence is ugly.
> 
> > 
> > 
> > I first understood it as:
> > 
> > """
> > A new page which comes from nowhere is
> > going to become a (and not "the") head page. Moreover, it will
> > be pointed by old_head_page->next...(which is actually true btw),
> > but this new head page will not be the next pointer on the page
> > that has just been swapped in.
> > """
> > 
> > Well, actually may be it's because my english understanding is a bit....
> 
> No, I think I wrote that at 3am.
> 
> How about this:
> 
> "The page after the inserted page (old reader_page) will become the new 
> head page."
> 
> ?


Perfect!

 
> > 
> > 
> > 
> > > +
> > > +Once the new page is given to the reader, the reader could do what
> > > +it wants with it, as long as a writer has left that page.
> > > +
> 
> 
> 
> > > +A sample of how the reader page is swapped: Note this does not
> > > +show the head page in the buffer, it is for demonstrating a swap
> > > +only.
> 
> Note above.
> 
> 
> > > +
> > > +  +------+
> > > +  |reader|          RING BUFFER
> > > +  |page  |
> > > +  +------+
> > > +                  +---+   +---+   +---+
> > > +                  |   |-->|   |-->|   |
> > > +                  |   |<--|   |<--|   |
> > > +                  +---+   +---+   +---+
> > > +                   ^ |             ^ |
> > > +                   | +-------------+ |
> > > +                   +-----------------+
> > 
> > 
> > 
> > But may be you could also show the head page at the same time,
> > that would help the readers IMO (not those on the ring buffer,
> > but at least those from real life who can preempt several things..)
> 
> I could add the H, but I just wanted to concentrate on the swap without 
> having too many details. But if you think the H would help, I'm fine with 
> it.
> 


You're right. It's better to only keep the page swapping picture. The
header page is explained just after anyway.


> > 
> > 
> >   +------+
> >   |reader|          RING BUFFER
> >   |page  |
> >   +------+
> >                   +---+   +---+   +---+
> >                   |   |-->|   |-->|   |
> >                   | H |<--|   |<--|   |
> >                   +---+   +---+   +---+
> >                    ^ |             ^ |
> >                    | +-------------+ |
> >                    +-----------------+
> > 
> > 
> > 
> > 
> > > +
> > > +  +------+
> > > +  |reader|          RING BUFFER
> > > +  |page  |-------------------+
> > > +  +------+                   v
> > > +    |             +---+   +---+   +---+
> > > +    |             |   |-->|   |-->|   |
> > > +    |             |   |<--|   |<--|   |<-+
> > > +    |             +---+   +---+   +---+  |
> > > +    |              ^ |             ^ |   |
> > > +    |              | +-------------+ |   |
> > > +    |              +-----------------+   |
> > > +    +------------------------------------+
> > 
> > 
> > 
> >   +------+
> >   |reader|          RING BUFFER
> >   |page  |-------------------+
> >   +------+                   v
> >     |             +---+   +---+   +---+
> >     |             |   |-->|   |-->|   |
> >     |             | H |<--|   |<--|   |<-+
> >     |             +---+   +---+   +---+  |
> >     |              ^ |             ^ |   |
> >     |              | +-------------+ |   |
> >     |              +-----------------+   |
> >     +------------------------------------+
> > 
> > 
> > 
> > > +  +------+
> > > +  |reader|          RING BUFFER
> > > +  |page  |-------------------+
> > > +  +------+ <---------------+ v
> > > +    |  ^          +---+   +---+   +---+
> > > +    |  |          |   |-->|   |-->|   |
> > > +    |  |          |   |<--|   |<--|   |<-+
> > > +    |  |          +---+   +---+   +---+  |
> > > +    |  |             |             ^ |   |
> > > +    |  |             +-------------+ |   |
> > > +    |  +-----------------------------+   |
> > > +    +------------------------------------+
> > 
> > 
> > 
> >   +------+
> >   |reader|          RING BUFFER
> >   |page  |-------------------+
> >   +------+ <---------------+ v
> >     |  ^          +---+   +---+   +---+
> >     |  |          |   |-->|   |-->|   |
> >     |  |          | H |<--|   |<--|   |<-+
> >     |  |          +---+   +---+   +---+  |
> >     |  |             |             ^ |   |
> >     |  |             +-------------+ |   |
> >     |  +-----------------------------+   |
> >     +------------------------------------+
> 
> Oooh, you cut and paste the error in the above. Do you see it?



Yeah but I was too lazy to fix it, and the mistake has already
been reported :)



> > 
> > 
> > 
> > 
> > > +  +------+
> > > +  |buffer|          RING BUFFER
> > > +  |page  |-------------------+
> > > +  +------+ <---------------+ v
> > > +    |  ^          +---+   +---+   +---+
> > > +    |  |          |   |   |   |-->|   |
> > > +    |  |  New     |   |   |   |<--|   |<-+
> > > +    |  | Reader   +---+   +---+   +---+  |
> > > +    |  |  page ----^                 |   |
> > > +    |  |                             |   |
> > > +    |  +-----------------------------+   |
> > > +    +------------------------------------+
> > > +
> > 
> > 
> > 
> >   +------+
> >   |buffer|          RING BUFFER
> >   |page  |-------------------+
> >   +------+ <---------------+ v
> >     |  ^          +---+   +---+   +---+
> >     |  |          |   |   |   |-->|   |
> >     |  |  New     |   |   | H |<--|   |<-+
> >     |  | Reader   +---+   +---+   +---+  |
> >     |  |  page ----^                 |   |
> >     |  |                             |   |
> >     |  +-----------------------------+   |
> >     +------------------------------------+
> > 
> > 
> > Sorry it was too tempting to try out some ascii art too,
> > but also it's an occasion to tell me if I misunderstood something
> > about the head page.
> 
> Yeah, the above seems correct (except for the error you left in).
> 
> > 
> > 
> > > +
> > > +It is possible that the page swapped is the commit page and the tail page,
> > > +if what is in the ring buffer is less than what is held in a buffer page.
> > > +
> > > +
> > > +          reader page    commit page   tail page
> > > +              |              |             |
> > > +              v              |             |
> > > +             +---+           |             |
> > > +             |   |<----------+             |
> > > +             |   |<------------------------+
> > > +             |   |------+
> > > +             +---+      |
> > > +                        |
> > > +                        v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |--->|   |--->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +This case is still legal for this algorithm.
> > > +When the writer leaves the page, it simply goes into the ring buffer
> > > +since the reader page still points to the next location in the ring
> > > +buffer.
> > > +
> > > +
> > > +The main pointers:
> > > +
> > > +  reader page - The page used solely by the reader and is not part
> > > +                of the ring buffer (may be swapped in)
> > > +
> > > +  head page - the next page in the ring buffer that will be swapped
> > > +              with the reader page.
> > > +
> > > +  tail page - the page where the next write will take place.
> > > +
> > > +  commit page - the page that last finished a write.
> > > +
> > > +The commit page only is updated by the outer most writer in the
> > > +writer stack. A writer that preempts another writer will not move the
> > > +commit page.
> > 
> > 
> > 
> > Btw, how do you check that? Is there a nesting counter or something?
> 
> Because only the writer that reserves the pointer after the commit, is the 
> committer.
> 
> static int
> rb_is_commit(struct ring_buffer_per_cpu *cpu_buffer,
>              struct ring_buffer_event *event)
> {
>         unsigned long addr = (unsigned long)event;
>         unsigned long index;
> 
>         index = rb_event_index(event);
>         addr &= PAGE_MASK;
> 
>         return cpu_buffer->commit_page->page == (void *)addr &&
>                 rb_commit_index(cpu_buffer) == index;
> }
> 
> 
> Although, I'm thinking of replacing it with a counter. May eliminate some 
> of the tight races I need to prevent. And may even clean up the code, and 
> speed it up.
> 
> Hmm, I may implement that now.
> 


Yeah it should be sufficient I guess.



> > 
> > 
> > 
> > > +
> > > +When data is written into the ring buffer, a position is reserved
> > > +in the ring buffer and passed back to the writer. When the writer
> > > +is finished writing data into that position, it commits the write.
> > > +
> > > +Another write (or a read) may take place at anytime during this
> > > +transaction. If another write happens it must finish before continuing
> > > +with the previous write.
> > 
> > 
> > [...]
> > 
> > 
> > > +Nested writes
> > > +-------------
> > > +
> > > +In the pushing forward of the tail page we must first push forward
> > > +the head page if the head page is the next page. If the head page
> > > +is not the next page, the tail page is simply updated with a cmpxchg.
> > > +
> > > +Only writers move the tail page. This must be done atomically to protect
> > > +against nested writers.
> > > +
> > > +  temp_page = tail_page
> > > +  next_page = temp_page->next
> > > +  cmpxchg(tail_page, temp_page, next_page)
> > > +
> > > +The above will update the tail page if it is still pointing to the expected
> > > +page. If this fails, a nested write pushed it forward, the the current write
> > > +does not need to push it.
> > > +
> > > +
> > > +           temp page
> > > +               |
> > > +               v
> > > +            tail page
> > > +               |
> > > +               v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |--->|   |--->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +Nested write comes in and moves the tail page forward:
> > > +
> > > +                    tail page (moved by nested writer)
> > > +            temp page   |
> > > +               |        |
> > > +               v        v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |--->|   |--->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +The above would fail the cmpxchg, but since the tail page has already
> > > +been moved forward, the writer will just try again to reserve storage
> > > +on the new tail page.
> > > +
> > > +But the moving of the head page is a bit more complex.
> > > +
> > > +            tail page
> > > +               |
> > > +               v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-H->|   |--->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +The write converts the head page pointer to UPDATE.
> > > +
> > > +            tail page
> > > +               |
> > > +               v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |--->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +But if a nested writer preempts here. It will see that the next
> > > +page is a head page, but it is also nested. It will detect that
> > > +it is nested and will save that information. The detection is the
> > > +fact that it sees the UPDATE flag instead of a HEADER or NORMAL
> > > +pointer.
> > > +
> > > +The nested writer will set the new head page pointer.
> > > +
> > > +           tail page
> > > +               |
> > > +               v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |-H->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +But it will not reset the update back to normal. Only the writer
> > > +that converted a pointer from HEAD to UPDATE will convert it back
> > > +to NORMAL.
> > > +
> > > +                    tail page
> > > +                        |
> > > +                        v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |-H->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +After the nested writer finishes, the outer most writer will convert
> > > +the UPDATE pointer to NORMAL.
> > > +
> > > +
> > > +                    tail page
> > > +                        |
> > > +                        v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |--->|   |-H->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +
> > > +It can be even more complex if several nested writes came in and moved
> > > +the tail page ahead several pages:
> > > +
> > > +
> > > +(first writer)
> > > +
> > > +            tail page
> > > +               |
> > > +               v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-H->|   |--->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +The write converts the head page pointer to UPDATE.
> > > +
> > > +            tail page
> > > +               |
> > > +               v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |--->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +Next writer comes in, and sees the update and sets up the new
> > > +head page.
> > > +
> > > +(second writer)
> > > +
> > > +           tail page
> > > +               |
> > > +               v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |-H->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +The nested writer moves the tail page forward. But does not set the old
> > > +update page to NORMAL because it is not the outer most writer.
> > > +
> > > +                    tail page
> > > +                        |
> > > +                        v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |-H->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +Another writer preempts and sees the page after the tail page is a head page.
> > > +It changes it from HEAD to UPDATE.
> > > +
> > > +(third writer)
> > > +
> > > +                    tail page
> > > +                        |
> > > +                        v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |-U->|   |--->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +The writer will move the head page forward:
> > > +
> > > +
> > > +(third writer)
> > > +
> > > +                    tail page
> > > +                        |
> > > +                        v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |-U->|   |-H->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +But now that the third writer did change the HEAD flag to UPDATE it
> > > +will convert it to normal:
> > > +
> > > +
> > > +(third writer)
> > > +
> > > +                    tail page
> > > +                        |
> > > +                        v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |--->|   |-H->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +
> > > +Then it will move the tail page, and return back to the second writer.
> > > +
> > > +
> > > +(second writer)
> > > +
> > > +                             tail page
> > > +                                 |
> > > +                                 v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |--->|   |-H->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +
> > > +The second writer will fail to move the tail page because it was already
> > > +moved, so it will try again and add its data to the new tail page.
> > > +It will return to the first writer.
> > > +
> > > +
> > > +(first writer)
> > > +
> > > +                             tail page
> > > +                                 |
> > > +                                 v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |--->|   |-H->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +The first writer can not know atomically test if the tail page moved
> > > +while it updates the HEAD page. It will then update the head page to
> > > +what it thinks is the new head page.
> > > +
> > > +
> > > +(first writer)
> > > +
> > > +                             tail page
> > > +                                 |
> > > +                                 v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |-H->|   |-H->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +Since the cmpxchg returns the old value of the pointer the first writer
> > > +will see it succeeded in updating the pointer from NORMAL to HEAD.
> > > +But as we can see, this is not good enough. It must also check to see
> > > +if the tail page is either where it use to be or on the next page:
> > > +
> > > +
> > > +(first writer)
> > > +
> > > +               A        B    tail page
> > > +               |        |        |
> > > +               v        v        v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |-H->|   |-H->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +If tail page != A and tail page does not equal B, then it must reset the
> > > +pointer back to NORMAL. The fact that it only needs to worry about
> > > +nested writers, it only needs to check this after setting the HEAD page.
> > > +
> > > +
> > > +(first writer)
> > > +
> > > +               A        B    tail page
> > > +               |        |        |
> > > +               v        v        v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |-U->|   |--->|   |-H->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > > +
> > > +Now the writer can update the head page. This is also why the head page must
> > > +remain in UPDATE and only reset by the outer most writer. This prevents
> > > +the reader from seeing the incorrect head page.
> > > +
> > > +
> > > +(first writer)
> > > +
> > > +               A        B    tail page
> > > +               |        |        |
> > > +               v        v        v
> > > +    +---+    +---+    +---+    +---+
> > > +<---|   |--->|   |--->|   |--->|   |-H->
> > > +--->|   |<---|   |<---|   |<---|   |<---
> > > +    +---+    +---+    +---+    +---+
> > 
> > 
> > Even more tricky!
> > 
> > I just have a stupid question: why can't this be done
> > only through HEAD and NORMAL flags?
> > 
> > There is something certainly very obvious that I'm missing
> > with the point of the UPDATE flag.
> 
> If you can demonstrate how to do the above lockless with just HEAD and 
> NORMAL, then sure, I'm all ears ;-)
> 
> When we switch the HEAD to UPDATE, we stop the reader from moving forward 
> and being another thing to handle while we move the HEAD forward. A reader 
> does a cmpxchg to move the head too, and that cmpxchg will always fail if 
> the pointer is has UPDATE set. The reader will just spin until it 
> succeeds.


Aah, so it's here to protect against paralell readers from another cpu
reading the current cpu buffer, right?


> Then, the rest of the moving of the header is just races with other 
> writers that are always on the same CPU, and it becomes a recursive 
> problem and not a parallel one.
> 
> -- Steve
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ