linux-kernel - Re: [PATCH 3/3] ring-buffer: add design document

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090611035928.GA15035@Krystal>
Date:	Wed, 10 Jun 2009 23:59:28 -0400
From:	Mathieu Desnoyers <compudj@...stal.dyndns.org>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <peterz@...radead.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Theodore Tso <tytso@....edu>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	"Martin J. Bligh" <mbligh@...igh.org>,
	Christoph Hellwig <hch@...radead.org>,
	Li Zefan <lizf@...fujitsu.com>,
	Huang Ying <ying.huang@...el.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>,
	Masami Hiramatsu <mhiramat@...hat.com>,
	ltt-dev@...ts.casi.polymtl.ca,
	Robert Wisniewski <bob@...son.ibm.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Subject: Re: [PATCH 3/3] ring-buffer: add design document

* Mathieu Desnoyers (compudj@...stal.dyndns.org) wrote:
> * Steven Rostedt (rostedt@...dmis.org) wrote:
> > 
> > On Wed, 10 Jun 2009, Mathieu Desnoyers wrote:
> > > * Steven Rostedt (rostedt@...dmis.org) wrote:
> > > > +The Generic Ring Buffer
> > > > +-----------------------
> > > > +
> > > > +The ring buffer can be used in either an overwrite mode or in
> > > > +producer/consumer mode.
> > > > +
> > > > +Producer/consumer mode is where the producer were to fill up the
> > > > +buffer before the consumer could free up anything, the producer
> > > > +will stop writing to the buffer. This will lose most recent events.
> > > > +
> > > > +Overwrite mode is where the produce were to fill up the buffer
> > > > +before the consumer could free up anything, the producer will
> > > > +overwrite the older data. This will lose the oldest events.
> > > > +
> > > > +No two writers can write at the same time (on the same per cpu buffer),
> > > > +but a writer may preempt another writer, but it must finish writing
> > > 
> > > Hi Steven,
> > > 
> > > I would use "interrupt" instead of "preempt" here, given that preemption
> > > implies scheduler activity which is specifically forbidden here.
> > 
> > Good point, I'll update it.
> > 
> 
> Please also look thorough the code... at some places you seem to imply
> that the reader "must" be on a remote CPU, when you actually mean
> "could" be on a remote or local cpu.
> 
> > > 
> > > > +before the previous writer may continue. This is very important to the
> > > > +algorithm. The writers act like a "stack".
> > > > +
> > > > +
> > > > +  writer1 start
> > > > +     <preempted> writer2 start
> > > > +         <preempted> writer3 start
> > > > +                     writer3 finishes
> > > > +                 writer2 finishes
> > > > +  writer1 finishes
> > > > +
> > > > +This is very much like a writer being preempted by an interrupt and
> > > > +the interrupt doing a write as well.
> > > > +
> > > > +Readers can happen at any time. But no two readers may run at the
> > > > +same time, nor can a reader preempt another reader. A reader can not preempt
> > > > +a writer, but it may read/consume from the buffer at the same time as
> > > > +a writer is writing, but the reader must be on another processor.
> > > > +
> > > > +A writer can preempt a reader, but a reader can not preempt a writer.
> > > > +But a reader can read the buffer at the same time (on another processor)
> > > > +as a writer.
> > > > +
> > > 
> > > This comment is inconsistent with the following code comment :
> > > 
> > > "* Reads can happen on any CPU."
> > > 
> > > Readers should be allowed to read from their own cpu's buffers too, and
> > > support being interrupted by an incoming interrupt writer, but this
> > > design document does not discuss this case. Is it at all supported ? If
> > > not, then this algorithm would not work on uniprocessor.
> > 
> > Yes it is supported. That's what I mean by "A writer can preempt a 
> > reader". I'll change it to "A writer can interrupt a reader". Would that 
> > sound better?
> > 
> > But a reader can not interrupt a writer. I hope you don't plan on doing 
> > reads of the ring buffer from an interrupt.
> > 
> 
> Yep.
> 
> > 
> > > 
> > > > +The ring buffer is made up of a list of pages held together by a link list.
> > > > +
> > > > +At initialization a reader page is allocated for the reader that is not
> > > > +part of the ring buffer.
> > > > +
> > > > +The head_page, tail_page and commit_page are all initialized to point
> > > > +to the same page.
> > > > +
> > > > +The reader page is initialized to have its next pointer pointing to
> > > > +the head page, and its previous pointer pointing to a page before
> > > > +the head page.
> > > > +
> > > > +The reader has its own page to use. At start up time, this page is
> > > > +allocated but is not attached to the list. When the reader wants
> > > > +to read from the buffer, if its page is empty (like it is on start up)
> > > > +it will swap its page with the head_page. The old reader page will
> > > > +become part of the ring buffer and the head_page will be removed.
> > > > +A new head page goes to the page after the old head page (but not
> > > > +the page that was swapped in).
> > > > +
> > > > +Once the new page is given to the reader, the reader could do what
> > > > +it wants with it, as long as a writer has left that page.
> > > > +
> > > > +A sample of how the reader page is swapped: Note this does not
> > > > +show the head page in the buffer, it is for demonstrating a swap
> > > > +only.
> > > > +
> > > > +  +------+
> > > > +  |reader|          RING BUFFER
> > > > +  |page  |
> > > > +  +------+
> > > > +                  +---+   +---+   +---+
> > > > +                  |   |-->|   |-->|   |
> > > > +                  |   |<--|   |<--|   |
> > > > +                  +---+   +---+   +---+
> > > > +                   ^ |             ^ |
> > > > +                   | +-------------+ |
> > > > +                   +-----------------+
> > > > +
> > > > +
> > > > +  +------+
> > > > +  |reader|          RING BUFFER
> > > > +  |page  |-------------------+
> > > > +  +------+                   v
> > > > +    |             +---+   +---+   +---+
> > > > +    |             |   |-->|   |-->|   |
> > > > +    |             |   |<--|   |<--|   |<-+
> > > > +    |             +---+   +---+   +---+  |
> > > > +    |              ^ |             ^ |   |
> > > > +    |              | +-------------+ |   |
> > > > +    |              +-----------------+   |
> > > > +    +------------------------------------+
> > > > +
> > > > +  +------+
> > > > +  |reader|          RING BUFFER
> > > > +  |page  |-------------------+
> > > > +  +------+ <---------------+ v
> > > > +    |  ^          +---+   +---+   +---+
> > > > +    |  |          |   |-->|   |-->|   |
> > > > +    |  |          |   |<--|   |<--|   |<-+
> > > > +    |  |          +---+   +---+   +---+  |
> > > > +    |  |             |             ^ |   |
> > > > +    |  |             +-------------+ |   |
> > > > +    |  +-----------------------------+   |
> > > > +    +------------------------------------+
> > > > +
> > > > +  +------+
> > > > +  |buffer|          RING BUFFER
> > > > +  |page  |-------------------+
> > > > +  +------+ <---------------+ v
> > > > +    |  ^          +---+   +---+   +---+
> > > > +    |  |          |   |   |   |-->|   |
> > > > +    |  |  New     |   |   |   |<--|   |<-+
> > > > +    |  | Reader   +---+   +---+   +---+  |
> > > > +    |  |  page ----^                 |   |
> > > > +    |  |                             |   |
> > > > +    |  +-----------------------------+   |
> > > > +    +------------------------------------+
> > > > +
> > > 
> > > Nice ascii art ;)
> > > 
> > > Some important comments below,
> > 
> > I draw best with plus's and minus's ;-)
> > 
> > > 
> > > > +
> > > > +
> > > > +It is possible that the page swapped is the commit page and the tail page,
> > > > +if what is in the ring buffer is less than what is held in a buffer page.
> > > > +
> > > > +
> > > > +          reader page    commit page   tail page
> > > > +              |              |             |
> > > > +              v              |             |
> > > > +             +---+           |             |
> > > > +             |   |<----------+             |
> > > > +             |   |<------------------------+
> > > > +             |   |------+
> > > > +             +---+      |
> > > > +                        |
> > > > +                        v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |--->|   |--->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +
> > > > +This case is still legal for this algorithm.
> > > > +When the writer leaves the page, it simply goes into the ring buffer
> > > > +since the reader page still points to the next location in the ring
> > > > +buffer.
> > > > +
> > > > +
> > > > +The main pointers:
> > > > +
> > > > +  reader page - The page used solely by the reader and is not part
> > > > +                of the ring buffer (may be swapped in)
> > > > +
> > > > +  head page - the next page in the ring buffer that will be swapped
> > > > +              with the reader page.
> > > > +
> > > > +  tail page - the page where the next write will take place.
> > > > +
> > > > +  commit page - the page that last finished a write.
> > > > +
> > > > +The commit page only is updated by the outer most writer in the
> > > > +writer stack. A writer that preempts another writer will not move the
> > > > +commit page.
> > > > +
> > > > +When data is written into the ring buffer, a position is reserved
> > > > +in the ring buffer and passed back to the writer. When the writer
> > > > +is finished writing data into that position, it commits the write.
> > > > +
> > > > +Another write (or a read) may take place at anytime during this
> > > > +transaction. If another write happens it must finish before continuing
> > > > +with the previous write.
> > > > +
> > > > +
> > > > +   Write reserve:
> > > > +
> > > > +       Buffer page
> > > > +      +---------+
> > > > +      |written  |
> > > > +      +---------+  <--- given back to writer (current commit)
> > > > +      |reserved |
> > > > +      +---------+ <--- tail pointer
> > > > +      | empty   |
> > > > +      +---------+
> > > > +
> > > > +   Write commit:
> > > > +
> > > > +       Buffer page
> > > > +      +---------+
> > > > +      |written  |
> > > > +      +---------+
> > > > +      |written  |
> > > > +      +---------+  <--- next positon for write (current commit)
> > > > +      | empty   |
> > > > +      +---------+
> > > > +
> > > > +
> > > > + If a write happens after the first reserve:
> > > > +
> > > > +       Buffer page
> > > > +      +---------+
> > > > +      |written  |
> > > > +      +---------+  <-- current commit
> > > > +      |reserved |
> > > > +      +---------+  <--- given back to second writer
> > > > +      |reserved |
> > > > +      +---------+ <--- tail pointer
> > > > +
> > > > +  After second writer commits:
> > > > +
> > > > +
> > > > +       Buffer page
> > > > +      +---------+
> > > > +      |written  |
> > > > +      +---------+  <--(last full commit)
> > > > +      |reserved |
> > > > +      +---------+
> > > > +      |pending  |
> > > > +      |commit   |
> > > > +      +---------+ <--- tail pointer
> > > > +
> > > > +  When the first writer commits:
> > > > +
> > > > +       Buffer page
> > > > +      +---------+
> > > > +      |written  |
> > > > +      +---------+
> > > > +      |written  |
> > > > +      +---------+
> > > > +      |written  |
> > > > +      +---------+  <--(last full commit and tail pointer)
> > > > +
> > > > +
> > > > +The commit pointer points to the last write location that was
> > > > +committed without preempting another write. When a write that
> > > > +preempted another write is committed, it only becomes a pending commit
> > > > +and will not be a full commit till all writes have been committed.
> > > > +
> > > > +The commit page points to the page that has the last full commit.
> > > > +The tail page points to the page with the last write (before
> > > > +committing).
> > > > +
> > > > +The tail page is always equal to or after the commit page. It may
> > > > +be several pages ahead. If the tail page catches up to the commit
> > > > +page then no more writes may take place (regardless of the mode
> > > > +of the ring buffer: overwrite and produce/consumer).
> > > > +
> > > > +The order of pages are:
> > > > +
> > > > + head page
> > > > + commit page
> > > > + tail page
> > > > +
> > > > +Possible scenario:
> > > > +                             tail page
> > > > +  head page         commit page  |
> > > > +      |                 |        |
> > > > +      v                 v        v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |--->|   |--->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +
> > > > +There is a special case that the head page is after either the commit page
> > > > +and possibly the tail page. That is when the commit (and tail) page has been
> > > > +swapped with the reader page. This is because the head page is always
> > > > +part of the ring buffer, but the reader page is not. When ever there
> > > > +has been less than a full page that has been committed inside the ring buffer,
> > > > +and a reader swaps out a page, it will be swapping out the commit page.
> > > > +
> > > > +
> > > > +          reader page    commit page   tail page
> > > > +              |              |             |
> > > > +              v              |             |
> > > > +             +---+           |             |
> > > > +             |   |<----------+             |
> > > > +             |   |<------------------------+
> > > > +             |   |------+
> > > > +             +---+      |
> > > > +                        |
> > > > +                        v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |--->|   |--->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +                        ^
> > > > +                        |
> > > > +                    head page
> > > > +
> > > > +
> > > > +In this case, the head page will not move when the tail and commit
> > > > +move back into the ring buffer.
> > > > +
> > > > +The reader can not swap a page into the ring buffer if the commit page
> > > > +is still on that page. If the read meets the last commit (real commit
> > > > +not pending or reserved), then there is nothing more to read.
> > > > +The buffer is considered empty until another full commit finishes.
> > > > +
> > > > +When the tail meets the head page, if the buffer is in overwrite mode,
> > > > +the head page will be pushed ahead one. If the buffer is in producer/consumer
> > > > +mode, the write will fail.
> > > > +
> > > > +Overwrite mode:
> > > > +
> > > > +            tail page
> > > > +               |
> > > > +               v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |--->|   |--->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +                        ^
> > > > +                        |
> > > > +                    head page
> > > > +
> > > > +
> > > > +            tail page
> > > > +               |
> > > > +               v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |--->|   |--->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +                                 ^
> > > > +                                 |
> > > > +                             head page
> > > > +
> > > > +
> > > > +                    tail page
> > > > +                        |
> > > > +                        v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |--->|   |--->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +                                 ^
> > > > +                                 |
> > > > +                             head page
> > > > +
> > > > +Note, the reader page will still point to the previous head page.
> > > > +But when a swap takes place, it will use the most recent head page.
> > > > +
> > > > +
> > > > +Making the Ring Buffer Lockless:
> > > > +--------------------------------
> > > > +
> > > > +The main idea behind the lockless algorithm is to combine the moving
> > > > +of the head_page pointer with the swapping of pages with the reader.
> > > > +State flags are placed inside the pointer to the page. To do this,
> > > > +each page must be aligned in memory by 4 bytes. This will allow the 2
> > > > +least significant bits of the address to be used as flags. Since
> > > > +they will always be zero for the address. To get the address,
> > > > +simply mask out the flags.
> > > > +
> > > > +  MASK = ~3
> > > > +
> > > > +  address & MASK
> > > > +
> > > > +Two flags will be kept by these two bits:
> > > > +
> > > > +   HEADER - the page being pointed to is a head page
> > > > +
> > > > +   UPDATE - the page being pointed to is being updated by a writer
> > > > +          and was or is about to be a head page.
> > > > +
> > > > +
> > > > +          reader page
> > > > +              |
> > > > +              v
> > > > +             +---+
> > > > +             |   |------+
> > > > +             +---+      |
> > > > +                        |
> > > > +                        v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |-H->|   |--->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +
> > > > +
> > > > +The above pointer "-H->" would have the HEADER flag set. That is
> > > > +the next page is the next page to be swapped out by the reader.
> > > > +This pointer means the next page is the head page.
> > > > +
> > > > +When the tail page meets the head pointer, it will use cmpxchg to
> > > > +change the pointer to the UPDATE state:
> > > > +
> > > > +
> > > > +            tail page
> > > > +               |
> > > > +               v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |-H->|   |--->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +
> > > > +            tail page
> > > > +               |
> > > > +               v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |-U->|   |--->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +
> > > > +"-U->" represents a pointer in the UPDATE state.
> > > > +
> > > > +Any access to the reader will need to take some sort of lock to serialize
> > > > +the readers. But the writers will never take a lock to write to the
> > > > +ring buffer. This means we only need to worry about a single reader,
> > > > +and writes only preempt in "stack" formation.
> > > > +
> > > > +When the reader tries to swap the page with the ring buffer, it
> > > > +will also use cmpxchg. If the flag bit in the pointer to the
> > > > +head page does not have the HEADER flag set, the compare will fail
> > > > +and the reader will need to look for the new head page and try again.
> > > > +Note, the flag UPDATE and HEADER are never set at the same time.
> > > > +
> > > > +The reader swaps the reader page as follows:
> > > > +
> > > > +  +------+
> > > > +  |reader|          RING BUFFER
> > > > +  |page  |
> > > > +  +------+
> > > > +                  +---+    +---+    +---+
> > > > +                  |   |--->|   |--->|   |
> > > > +                  |   |<---|   |<---|   |
> > > > +                  +---+    +---+    +---+
> > > > +                   ^ |               ^ |
> > > > +                   | +---------------+ |
> > > > +                   +-----H-------------+
> > > > +
> > > > +The reader sets the reader page next pointer as HEADER to the page after
> > > > +the head page.
> > > > +
> > > > +
> > > > +  +------+
> > > > +  |reader|          RING BUFFER
> > > > +  |page  |-------H-----------+
> > > > +  +------+                   v
> > > > +    |             +---+    +---+    +---+
> > > > +    |             |   |--->|   |--->|   |
> > > > +    |             |   |<---|   |<---|   |<-+
> > > > +    |             +---+    +---+    +---+  |
> > > > +    |              ^ |               ^ |   |
> > > > +    |              | +---------------+ |   |
> > > > +    |              +-----H-------------+   |
> > > > +    +--------------------------------------+
> > > > +
> > > > +It does a cmpxchg with the pointer to the previous head page to make it
> > > > +point to the reader page. Note that the new pointer does not have the HEADER
> > > > +flag set.  This action atomically moves the head page forward.
> > > > +
> > > > +  +------+
> > > > +  |reader|          RING BUFFER
> > > > +  |page  |-------H-----------+
> > > > +  +------+ <---------------+ v
> > > > +    |  ^          +---+   +---+   +---+
> > > > +    |  |          |   |-->|   |-->|   |
> > > > +    |  |          |   |<--|   |<--|   |<-+
> > > > +    |  |          +---+   +---+   +---+  |
> > > > +    |  |             |             ^ |   |
> > > > +    |  |             +-------------+ |   |
> > > > +    |  +-----------------------------+   |
> > > > +    +------------------------------------+
> > > > +
> > > > +After the new head page is set, the previous pointer of the head page is
> > > > +updated to the reader page.
> > > > +
> > > > +  +------+
> > > > +  |reader|          RING BUFFER
> > > > +  |page  |-------H-----------+
> > > > +  +------+                   v
> > > > +    |  ^          +---+   +---+   +---+
> > > > +    |  |          |   |-->|   |-->|   |
> > > > +    |  |          |   |<--|   |<--|   |<-+
> > > > +    |  |          +---+   +---+   +---+  |
> > > > +    |  |             |             ^ |   |
> > > > +    |  |             +-------------+ |   |
> > > > +    |  +-----------------------------+   |
> > > > +    +------------------------------------+
> > > > +
> > > > +  +------+
> > > > +  |buffer|          RING BUFFER
> > > > +  |page  |-------H-----------+  <--- New head page
> > > > +  +------+ <---------------+ v
> > > > +    |  ^          +---+   +---+   +---+
> > > > +    |  |          |   |   |   |-->|   |
> > > > +    |  |  New     |   |   |   |<--|   |<-+
> > > > +    |  | Reader   +---+   +---+   +---+  |
> > > > +    |  |  page ----^                 |   |
> > > > +    |  |                             |   |
> > > > +    |  +-----------------------------+   |
> > > > +    +------------------------------------+
> > > > +
> > > > +Another important point. The page that the reader page points back to
> > > > +by its previous pointer (the one that now points to the new head page)
> > > > +never points back to the reader page. That is because the reader page is
> > > > +not part of the ring buffer. Traversing the ring buffer via the next pointers
> > > > +will always stay in the ring buffer. Traversing the ring buffer via the
> > > > +prev pointers may not.
> > > > +
> > > > +Note, the way to determine a reader page is simply by examining the previous
> > > > +pointer of the page. If the next pointer of the previous page does not
> > > > +point back to the original page, then the original page is a reader page:
> > > > +
> > > > +                                  
> > > > +             +--------+
> > > > +             | reader |  next   +----+
> > > > +             |  page  |-------->|    |<====== (buffer page)
> > > > +             +--------+         +----+
> > > > +                 |                | ^
> > > > +                 |                v | next
> > > > +            prev |              +----+
> > > > +                 +------------->|    |
> > > > +                                +----+
> > > > +
> > > > +The way the head page moves forward:
> > > > +
> > > > +When the tail page meets the head page and the buffer is in overwrite mode
> > > > +and more writes take place, the head page must be moved forward before the
> > > > +writer may move the tail page. The way this is done is that the writer
> > > > +performs a cmpxchg to convert the pointer to the head page from the HEADER
> > > > +flag to have the UPDATE flag set. Once this is done, the reader will
> > > > +not be able to swap the head page from the buffer, nor will it be able to
> > > > +move the head page, until the writer is finished with the move.
> > > > +
> > > > +This eliminates any races that the reader can have on the writer. The reader
> > > > +must spin, and this is why the reader can not preempt the writer.
> > > > +
> > > > +            tail page
> > > > +               |
> > > > +               v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |-H->|   |--->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +
> > > > +            tail page
> > > > +               |
> > > > +               v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |-U->|   |--->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +
> > > > +The following page will be made into the new head page.
> > > > +
> > > > +           tail page
> > > > +               |
> > > > +               v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |-U->|   |-H->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +
> > > > +After the new head page has been set, we can set the old head page
> > > > +pointer back to NORMAL.
> > > > +
> > > > +           tail page
> > > > +               |
> > > > +               v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |--->|   |-H->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +
> > > > +After the head page has been moved, the tail page may now move forward.
> > > > +
> > > > +                    tail page
> > > > +                        |
> > > > +                        v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |--->|   |-H->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +
> > > > +
> > > > +The above are the trivial updates. Now for the more complex scenarios.
> > > > +
> > > > +
> > > > +As stated before, if enough writes preempt the first write, the
> > > > +tail page may make it all the way around the buffer and meet the commit
> > > > +page. At this time, we must start dropping writes (usually with some kind
> > > > +of warning to the user). But what happens if the commit was still on the
> > > > +reader page? The commit page is not part of the ring buffer. The tail page
> > > > +must account for this.
> > > > +
> > > > +
> > > > +          reader page    commit page
> > > > +              |              |
> > > > +              v              |
> > > > +             +---+           |
> > > > +             |   |<----------+
> > > > +             |   |
> > > > +             |   |------+
> > > > +             +---+      |
> > > > +                        |
> > > > +                        v
> > > > +    +---+    +---+    +---+    +---+
> > > > +<---|   |--->|   |-H->|   |--->|   |--->
> > > > +--->|   |<---|   |<---|   |<---|   |<---
> > > > +    +---+    +---+    +---+    +---+
> > > > +               ^
> > > > +               |
> > > > +           tail page
> > > > +
> > > > +If the tail page were to simply push the head page forward, the commit when
> > > > +leaving the reader page would not be pointing to the correct page.
> > > > +
> > > > +The solution to this is to test if the commit page is on the reader page
> > > > +before pushing the head page. If it is, then it can be assumed that the
> > > > +tail page wrapped the buffer, and we must drop new writes.
> > > > +
> > > > +This is not a race condition, because the commit page can only be moved
> > > > +by the outter most writer (the writer that was preempted).
> > > > +This means that the commit will not move while a writer is moving the
> > > > +tail page. The reader can not swap the reader page if it is also being
> > > > +used as the commit page. The reader can simply check that the commit
> > > > +is off the reader page. Once the commit page leaves the reader page
> > > > +it will never go back on it unless a reader does another swap with the
> > > > +buffer page that is also the commit page.
> > > > +
> > > > +
> > > > +Nested writes
> > > > +-------------
> > > > +
> > > > +In the pushing forward of the tail page we must first push forward
> > > > +the head page if the head page is the next page. If the head page
> > > > +is not the next page, the tail page is simply updated with a cmpxchg.
> > > > +
> > > > +Only writers move the tail page. This must be done atomically to protect
> > > > +against nested writers.
> > > > +
> > > > +  temp_page = tail_page
> > > > +  next_page = temp_page->next
> > > > +  cmpxchg(tail_page, temp_page, next_page)
> > > > +
> > > 
> > > OK, I'll bite :
> > > 
> > > What happens if :
> > > 
> > > - a writer is at the end of a page, 
> > > - needs to push the tail_page pointer
> > > - reads tail_page
> > >   - interrupted.
> > >   - nested writers come, successfully updates the tail_page, write
> > >     enough data to fill the whole buffer
> > >   - concurrently (on another CPU), a reader is consuming all the data
> > >   - This brings the tail_page pointer back to its original value
> > >   - iret
> > > - here, the writer will successfully perform the tail_page cmpxchg,
> > >   because the value match. However, the page currently being written to
> > >   could be only partially reserved; the writer will not re-check if the
> > >   page is really full.
> > > 
> > > That's actually one of my main concerns with an approach where two
> > > separate "pointers" are used to keep track of reserved space within a
> > > buffer.
> > 
> > Actually, the two pointers is exactly what prevents the above scenario.
> > 
> > We have a commit page pointer and a commit index pointer. The commit page
> > points to the page that holds the last true commit. A read can never go 
> > pass the commit. That is, it can not read reserved but uncommited data.
> > 
> 
> OK, I see how the commit counter can ensure the reader will not read
> reserved-by-yet-uncommitted data.
> 
> 
> > Another key point is that if the tail page meets the commit page, it will 
> > not move it and drop the data. If your buffer is not big enough to hold 
> > all data in a interrupt, then your buffer is too small. We count these in 
> > the "commit_overrun" counter as well as return NULL on the reserve so the 
> > tracer will know that it had its data dropped.
> > 
> 
> Ah ok, so the following writer-only scenario :
> 
> - a writer is at the end of a page, 
> - needs to push the tail_page pointer
> - reads tail_page
>   - interrupted.
>   - nested writers come, successfully updates the tail_page, write
>     enough data to fill the whole buffer
>   - Brings the tail_page pointer back to its original value <----------
> 
> Cannot happen, because it would meet the commit page.
> 
> That's a bit weird to drop events in overwrite mode. Normally, one would
> expect that mode to just permit to write any amount of events, from any
> execution context, be it nested or not, overwriting the oldest events.
> 

Hrm, about this specific point, now that I think about it again, LTTng
does something similar if nested writes happen to cause a buffer wrap
around and meet a non fully committed subbuffer. It will also drop
events in overwrite mode.

Mathieu


> > 
> > > 
> > > The same concern applies to moving the head page when concurrent writers
> > > are nesting.
> > > 
> > > More generally, I'm also concerned about the lack of memory barriers
> > > around some non-cmpxchg tail/head page set operations in your code, and
> > > the lack of proper rcu_dereference-style primitive for list iteration.
> > 
> > The cmpxchg is a memory barrier. Writes only contend with other writes on 
> > the same CPU. When we update the pointer from HEAD to UPDATE a reader will
> > not pass that point. The update is done via cmpxchg and thus is a memory 
> > barrier. Now if cmpxchg is not a memory barrier, then I need to add 
> > smp_mb() by all of them.
> > 
> 
> Documentation/memory-barriers.txt states that cmpxchg must behave as if
> it had smp_mb() before and after.
> 
> > > 
> > > For those I've added to CC, I'm referring to the patch at :
> > > 
> > > http://patchwork.kernel.org/patch/29395/
> > > 
> > > The great news to me is that no one can say LTTng's lockless buffering
> > > algorithm is complex compared to this. ;)
> > 
> > But can you read without worries about writers? The nice thing about this 
> > approach, which a lot of ftrace depends on, is that I don't need call 
> > backs or copies to check if what I read from the buffer was not stomped 
> > on by a writer. There is a zero copy overhead for readers (when the buffer 
> > is more than a page filled) and when a reader has its data, it belongs to 
> > that reader.
> > 
> 
> Hrm, now that you bring this question on the table again (I remember we
> discussed about it at the collaboration summit), and now that there is
> no alcohol involved, I see a way to do it. Here is the idea :
> 
> I assume you remember a bit how the global per-buffer write_offset and
> read_offset counters work in LTTng : writer head and reader head are
> positions in what we can think of as a virtually contiguous buffer
> address space for the whole buffer.
> 
> First, let's talk in terms of get_subbuf() (reader get a subbuffer
> exclusive access for reading) and put_subbuf() (reader releases its
> exclusive subbuffer access). get_subbuf() returns the current
> read_offset, and put_subbuf() takes this read_offset (the one returned
> by get_subbuf()), as parameter.
> 
> Let's say that we let get_subbuf() use cmpxchg to write a flag in the
> read_offset counter to tell the writer it's actively reading, e.g.
> OFFSET_READING_FLAG = 0x1. It's reading the read_offset at the same
> time because the update is done with a cmpxchg.
> 
> Now for the writer : in overwrite mode, if the write_offset comes to a
> point where it would have to push the reader position (read offset) in
> order to be able to reserve space *and* the reader is actively reading
> data from the buffer (we know it because OFFSET_READING_FLAG is set),
> the writer could set a flag in the read_offset LSBs
> (OFFSET_PUSH_FLAG = 0x2). The writer would simply skip over this
> specific subbuffer, and that's it : it can continue to write in the
> buffer after the subbuffer owned by the reader without problem. If it
> loops a few times over the buffer while the reader is still stucked
> there (think of a slow serial port), it would simply skip over the
> subbuffer owned by the reader.
> 
> Now, when the reader eventually releases its current subbuffer
> (put_subbuf()), it would detect that the read_offset is different than
> the one returned by get_subbuf() because the OFFSET_PUSH_FLAG would have
> been set. This will inform the reader that it must push its own
> read_offset position to the subbuffer following the current
> write_offset position. That's it, we're back on track : the next reader
> will read the oldest available subbuffer.
> 
> I took special care in the design above to make sure the case where
> tracing starts still works. In this case, where the buffer is only
> partially filled, the reader head is not in the subbuffer following the
> writer head, because it points to uninitialized data. But the
> OFFSET_PUSH_FLAG can only ever be set when a writer has at least
> completely filled the buffer once (and meets the read_offset), so we can
> consider that it's safe for put_subbuf() to move right after the write
> offset subbuffer.
> 
> I must admit that the flag idea is a bit inspired from your lockless
> algo I am currently reviewing. :)
> 
> Does it make sense ?
> 
> Mathieu
> 
> Note : I won't be available for implementation work before July 6th...
> got a thesis to write...
> 
> 
> > -- Steve
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/