lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 11 Jun 2009 12:15:21 +0900
From:	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>
To:	Steven Rostedt <rostedt@...dmis.org>
CC:	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <peterz@...radead.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Theodore Tso <tytso@....edu>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	Mathieu Desnoyers <compudj@...stal.dyndns.org>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	"Martin J. Bligh" <mbligh@...igh.org>,
	Christoph Hellwig <hch@...radead.org>,
	Li Zefan <lizf@...fujitsu.com>,
	Huang Ying <ying.huang@...el.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Masami Hiramatsu <mhiramat@...hat.com>
Subject: Re: [PATCH 3/3] ring-buffer: add design document

Steven Rostedt wrote:
> From: Steven Rostedt <srostedt@...hat.com>
> 
> This adds the design document for the ring buffer and also
> explains how it is designed to have lockless writes.
> 
> Signed-off-by: Steven Rostedt <rostedt@...dmis.org>
> ---
>  Documentation/trace/ring-buffer-design.txt |  949 ++++++++++++++++++++++++++++
>  1 files changed, 949 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/trace/ring-buffer-design.txt
> 
> diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt
> new file mode 100644
> index 0000000..cca290b
> --- /dev/null
> +++ b/Documentation/trace/ring-buffer-design.txt
> @@ -0,0 +1,949 @@
> +		Lockless Ring Buffer Design
> +		===========================
> +
> +Copyright 2009 Red Hat Inc.
> +   Author:   Steven Rostedt <srostedt@...hat.com>
> +  License:   The GNU Free Documentation License, Version 1.2
> +               (dual licensed under the GPL v2)
> +
> +Written for: 2.6.31
> +
> +Terminology used in this Document
> +---------------------------------
> +
> +tail - where new writes happen in the ring buffer.
> +
> +head - where new reads happen in the ring buffer.
> +
> +producer - the task that writes into the ring buffer (same as writer)
> +
> +writer - same as producer
> +
> +consumer - the task that reads from the buffer (same as reader)
> +
> +reader - same as consumer.
> +
> +reader_page - A page outside the ring buffer used solely (for the most part)
> +    by the reader.
> +
> +head_page - a pointer to the page that the reader will use next
> +
> +tail_page - a pointer to the page that will be written to next
> +
> +commit_page - a pointer to the page with the last finished non nested write.
> +
> +cmpxchg - hardware assisted atomic transaction that performs the following:
> +
> +   A = B iff previous A == C
> +
> +   R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
> +      current A is equal to C, and we put the old (current) A into R
> +
> +   R gets the previous A regardless if A is updated with B or not.
> +
> +   To see if the update was successful a compare of R == C may be used.
> +
> +The Generic Ring Buffer
> +-----------------------
> +
> +The ring buffer can be used in either an overwrite mode or in
> +producer/consumer mode.
> +
> +Producer/consumer mode is where the producer were to fill up the
> +buffer before the consumer could free up anything, the producer
> +will stop writing to the buffer. This will lose most recent events.
> +
> +Overwrite mode is where the produce were to fill up the buffer
> +before the consumer could free up anything, the producer will
> +overwrite the older data. This will lose the oldest events.
> +
> +No two writers can write at the same time (on the same per cpu buffer),
> +but a writer may preempt another writer, but it must finish writing
> +before the previous writer may continue. This is very important to the
> +algorithm. The writers act like a "stack".
> +
> +
> +  writer1 start
> +     <preempted> writer2 start
> +         <preempted> writer3 start
> +                     writer3 finishes
> +                 writer2 finishes
> +  writer1 finishes
> +
> +This is very much like a writer being preempted by an interrupt and
> +the interrupt doing a write as well.
> +
> +Readers can happen at any time. But no two readers may run at the
> +same time, nor can a reader preempt another reader. A reader can not preempt
> +a writer, but it may read/consume from the buffer at the same time as
> +a writer is writing, but the reader must be on another processor.
> +
> +A writer can preempt a reader, but a reader can not preempt a writer.
> +But a reader can read the buffer at the same time (on another processor)
> +as a writer.
> +
> +The ring buffer is made up of a list of pages held together by a link list.
> +
> +At initialization a reader page is allocated for the reader that is not
> +part of the ring buffer.
> +
> +The head_page, tail_page and commit_page are all initialized to point
> +to the same page.
> +
> +The reader page is initialized to have its next pointer pointing to
> +the head page, and its previous pointer pointing to a page before
> +the head page.
> +
> +The reader has its own page to use. At start up time, this page is
> +allocated but is not attached to the list. When the reader wants
> +to read from the buffer, if its page is empty (like it is on start up)
> +it will swap its page with the head_page. The old reader page will
> +become part of the ring buffer and the head_page will be removed.
> +A new head page goes to the page after the old head page (but not
> +the page that was swapped in).
> +
> +Once the new page is given to the reader, the reader could do what
> +it wants with it, as long as a writer has left that page.
> +
> +A sample of how the reader page is swapped: Note this does not
> +show the head page in the buffer, it is for demonstrating a swap
> +only.
> +
> +  +------+
> +  |reader|          RING BUFFER
> +  |page  |
> +  +------+
> +                  +---+   +---+   +---+
> +                  |   |-->|   |-->|   |
> +                  |   |<--|   |<--|   |
> +                  +---+   +---+   +---+
> +                   ^ |             ^ |
> +                   | +-------------+ |
> +                   +-----------------+
> +
> +
> +  +------+
> +  |reader|          RING BUFFER
> +  |page  |-------------------+
> +  +------+                   v
> +    |             +---+   +---+   +---+
> +    |             |   |-->|   |-->|   |
> +    |             |   |<--|   |<--|   |<-+
> +    |             +---+   +---+   +---+  |
> +    |              ^ |             ^ |   |
> +    |              | +-------------+ |   |
> +    |              +-----------------+   |
> +    +------------------------------------+
> +
> +  +------+
> +  |reader|          RING BUFFER
> +  |page  |-------------------+
> +  +------+ <---------------+ v
> +    |  ^          +---+   +---+   +---+
> +    |  |          |   |-->|   |-->|   |
> +    |  |          |   |<--|   |<--|   |<-+
> +    |  |          +---+   +---+   +---+  |
> +    |  |             |             ^ |   |
> +    |  |             +-------------+ |   |
> +    |  +-----------------------------+   |
> +    +------------------------------------+

It seems the middle of three pages have 2 prev arrows... ?


> +
> +  +------+
> +  |buffer|          RING BUFFER
> +  |page  |-------------------+
> +  +------+ <---------------+ v
> +    |  ^          +---+   +---+   +---+
> +    |  |          |   |   |   |-->|   |
> +    |  |  New     |   |   |   |<--|   |<-+
> +    |  | Reader   +---+   +---+   +---+  |
> +    |  |  page ----^                 |   |
> +    |  |                             |   |
> +    |  +-----------------------------+   |
> +    +------------------------------------+
> +
> +
> +
> +It is possible that the page swapped is the commit page and the tail page,
> +if what is in the ring buffer is less than what is held in a buffer page.
> +
> +
> +          reader page    commit page   tail page
> +              |              |             |
> +              v              |             |
> +             +---+           |             |
> +             |   |<----------+             |
> +             |   |<------------------------+
> +             |   |------+
> +             +---+      |
> +                        |
> +                        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |--->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +This case is still legal for this algorithm.
> +When the writer leaves the page, it simply goes into the ring buffer
> +since the reader page still points to the next location in the ring
> +buffer.
> +
> +
> +The main pointers:
> +
> +  reader page - The page used solely by the reader and is not part
> +                of the ring buffer (may be swapped in)
> +
> +  head page - the next page in the ring buffer that will be swapped
> +              with the reader page.
> +
> +  tail page - the page where the next write will take place.
> +
> +  commit page - the page that last finished a write.
> +
> +The commit page only is updated by the outer most writer in the
> +writer stack. A writer that preempts another writer will not move the
> +commit page.
> +
> +When data is written into the ring buffer, a position is reserved
> +in the ring buffer and passed back to the writer. When the writer
> +is finished writing data into that position, it commits the write.
> +
> +Another write (or a read) may take place at anytime during this
> +transaction. If another write happens it must finish before continuing
> +with the previous write.
> +
> +
> +   Write reserve:
> +
> +       Buffer page
> +      +---------+
> +      |written  |
> +      +---------+  <--- given back to writer (current commit)
> +      |reserved |
> +      +---------+ <--- tail pointer
> +      | empty   |
> +      +---------+
> +
> +   Write commit:
> +
> +       Buffer page
> +      +---------+
> +      |written  |
> +      +---------+
> +      |written  |
> +      +---------+  <--- next positon for write (current commit)
> +      | empty   |
> +      +---------+
> +
> +
> + If a write happens after the first reserve:
> +
> +       Buffer page
> +      +---------+
> +      |written  |
> +      +---------+  <-- current commit
> +      |reserved |
> +      +---------+  <--- given back to second writer
> +      |reserved |
> +      +---------+ <--- tail pointer
> +
> +  After second writer commits:
> +
> +
> +       Buffer page
> +      +---------+
> +      |written  |
> +      +---------+  <--(last full commit)
> +      |reserved |
> +      +---------+
> +      |pending  |
> +      |commit   |
> +      +---------+ <--- tail pointer
> +
> +  When the first writer commits:
> +
> +       Buffer page
> +      +---------+
> +      |written  |
> +      +---------+
> +      |written  |
> +      +---------+
> +      |written  |
> +      +---------+  <--(last full commit and tail pointer)
> +
> +
> +The commit pointer points to the last write location that was
> +committed without preempting another write. When a write that
> +preempted another write is committed, it only becomes a pending commit
> +and will not be a full commit till all writes have been committed.
> +
> +The commit page points to the page that has the last full commit.
> +The tail page points to the page with the last write (before
> +committing).
> +
> +The tail page is always equal to or after the commit page. It may
> +be several pages ahead. If the tail page catches up to the commit
> +page then no more writes may take place (regardless of the mode
> +of the ring buffer: overwrite and produce/consumer).
> +
> +The order of pages are:
> +
> + head page
> + commit page
> + tail page
> +
> +Possible scenario:
> +                             tail page
> +  head page         commit page  |
> +      |                 |        |
> +      v                 v        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |--->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +There is a special case that the head page is after either the commit page
> +and possibly the tail page. That is when the commit (and tail) page has been
> +swapped with the reader page. This is because the head page is always
> +part of the ring buffer, but the reader page is not. When ever there
> +has been less than a full page that has been committed inside the ring buffer,
> +and a reader swaps out a page, it will be swapping out the commit page.
> +
> +
> +          reader page    commit page   tail page
> +              |              |             |
> +              v              |             |
> +             +---+           |             |
> +             |   |<----------+             |
> +             |   |<------------------------+
> +             |   |------+
> +             +---+      |
> +                        |
> +                        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |--->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +                        ^
> +                        |
> +                    head page
> +
> +
> +In this case, the head page will not move when the tail and commit
> +move back into the ring buffer.
> +
> +The reader can not swap a page into the ring buffer if the commit page
> +is still on that page. If the read meets the last commit (real commit
> +not pending or reserved), then there is nothing more to read.
> +The buffer is considered empty until another full commit finishes.
> +
> +When the tail meets the head page, if the buffer is in overwrite mode,
> +the head page will be pushed ahead one. If the buffer is in producer/consumer
> +mode, the write will fail.
> +
> +Overwrite mode:
> +
> +            tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |--->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +                        ^
> +                        |
> +                    head page
> +
> +
> +            tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |--->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +                                 ^
> +                                 |
> +                             head page
> +
> +
> +                    tail page
> +                        |
> +                        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |--->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +                                 ^
> +                                 |
> +                             head page
> +
> +Note, the reader page will still point to the previous head page.
> +But when a swap takes place, it will use the most recent head page.
> +
> +
> +Making the Ring Buffer Lockless:
> +--------------------------------
> +
> +The main idea behind the lockless algorithm is to combine the moving
> +of the head_page pointer with the swapping of pages with the reader.
> +State flags are placed inside the pointer to the page. To do this,
> +each page must be aligned in memory by 4 bytes. This will allow the 2
> +least significant bits of the address to be used as flags. Since
> +they will always be zero for the address. To get the address,
> +simply mask out the flags.
> +
> +  MASK = ~3
> +
> +  address & MASK
> +
> +Two flags will be kept by these two bits:
> +
> +   HEADER - the page being pointed to is a head page
> +
> +   UPDATE - the page being pointed to is being updated by a writer
> +          and was or is about to be a head page.
> +
> +
> +          reader page
> +              |
> +              v
> +             +---+
> +             |   |------+
> +             +---+      |
> +                        |
> +                        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-H->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +
> +The above pointer "-H->" would have the HEADER flag set. That is
> +the next page is the next page to be swapped out by the reader.
> +This pointer means the next page is the head page.
> +
> +When the tail page meets the head pointer, it will use cmpxchg to
> +change the pointer to the UPDATE state:
> +
> +
> +            tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-H->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +            tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +"-U->" represents a pointer in the UPDATE state.
> +
> +Any access to the reader will need to take some sort of lock to serialize
> +the readers. But the writers will never take a lock to write to the
> +ring buffer. This means we only need to worry about a single reader,
> +and writes only preempt in "stack" formation.
> +
> +When the reader tries to swap the page with the ring buffer, it
> +will also use cmpxchg. If the flag bit in the pointer to the
> +head page does not have the HEADER flag set, the compare will fail
> +and the reader will need to look for the new head page and try again.
> +Note, the flag UPDATE and HEADER are never set at the same time.
> +
> +The reader swaps the reader page as follows:
> +
> +  +------+
> +  |reader|          RING BUFFER
> +  |page  |
> +  +------+
> +                  +---+    +---+    +---+
> +                  |   |--->|   |--->|   |
> +                  |   |<---|   |<---|   |
> +                  +---+    +---+    +---+
> +                   ^ |               ^ |
> +                   | +---------------+ |
> +                   +-----H-------------+
> +
> +The reader sets the reader page next pointer as HEADER to the page after
> +the head page.
> +
> +
> +  +------+
> +  |reader|          RING BUFFER
> +  |page  |-------H-----------+
> +  +------+                   v
> +    |             +---+    +---+    +---+
> +    |             |   |--->|   |--->|   |
> +    |             |   |<---|   |<---|   |<-+
> +    |             +---+    +---+    +---+  |
> +    |              ^ |               ^ |   |
> +    |              | +---------------+ |   |
> +    |              +-----H-------------+   |
> +    +--------------------------------------+
> +
> +It does a cmpxchg with the pointer to the previous head page to make it
> +point to the reader page. Note that the new pointer does not have the HEADER
> +flag set.  This action atomically moves the head page forward.
> +
> +  +------+
> +  |reader|          RING BUFFER
> +  |page  |-------H-----------+
> +  +------+ <---------------+ v
> +    |  ^          +---+   +---+   +---+
> +    |  |          |   |-->|   |-->|   |
> +    |  |          |   |<--|   |<--|   |<-+
> +    |  |          +---+   +---+   +---+  |
> +    |  |             |             ^ |   |
> +    |  |             +-------------+ |   |
> +    |  +-----------------------------+   |
> +    +------------------------------------+
> +

Ditto.

Thanks,
H.Seto


> +After the new head page is set, the previous pointer of the head page is
> +updated to the reader page.
> +
> +  +------+
> +  |reader|          RING BUFFER
> +  |page  |-------H-----------+
> +  +------+                   v
> +    |  ^          +---+   +---+   +---+
> +    |  |          |   |-->|   |-->|   |
> +    |  |          |   |<--|   |<--|   |<-+
> +    |  |          +---+   +---+   +---+  |
> +    |  |             |             ^ |   |
> +    |  |             +-------------+ |   |
> +    |  +-----------------------------+   |
> +    +------------------------------------+
> +
> +  +------+
> +  |buffer|          RING BUFFER
> +  |page  |-------H-----------+  <--- New head page
> +  +------+ <---------------+ v
> +    |  ^          +---+   +---+   +---+
> +    |  |          |   |   |   |-->|   |
> +    |  |  New     |   |   |   |<--|   |<-+
> +    |  | Reader   +---+   +---+   +---+  |
> +    |  |  page ----^                 |   |
> +    |  |                             |   |
> +    |  +-----------------------------+   |
> +    +------------------------------------+
> +
> +Another important point. The page that the reader page points back to
> +by its previous pointer (the one that now points to the new head page)
> +never points back to the reader page. That is because the reader page is
> +not part of the ring buffer. Traversing the ring buffer via the next pointers
> +will always stay in the ring buffer. Traversing the ring buffer via the
> +prev pointers may not.
> +
> +Note, the way to determine a reader page is simply by examining the previous
> +pointer of the page. If the next pointer of the previous page does not
> +point back to the original page, then the original page is a reader page:
> +
> +                                  
> +             +--------+
> +             | reader |  next   +----+
> +             |  page  |-------->|    |<====== (buffer page)
> +             +--------+         +----+
> +                 |                | ^
> +                 |                v | next
> +            prev |              +----+
> +                 +------------->|    |
> +                                +----+
> +
> +The way the head page moves forward:
> +
> +When the tail page meets the head page and the buffer is in overwrite mode
> +and more writes take place, the head page must be moved forward before the
> +writer may move the tail page. The way this is done is that the writer
> +performs a cmpxchg to convert the pointer to the head page from the HEADER
> +flag to have the UPDATE flag set. Once this is done, the reader will
> +not be able to swap the head page from the buffer, nor will it be able to
> +move the head page, until the writer is finished with the move.
> +
> +This eliminates any races that the reader can have on the writer. The reader
> +must spin, and this is why the reader can not preempt the writer.
> +
> +            tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-H->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +            tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +The following page will be made into the new head page.
> +
> +           tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |-H->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +After the new head page has been set, we can set the old head page
> +pointer back to NORMAL.
> +
> +           tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |--->|   |-H->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +After the head page has been moved, the tail page may now move forward.
> +
> +                    tail page
> +                        |
> +                        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |--->|   |-H->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +
> +The above are the trivial updates. Now for the more complex scenarios.
> +
> +
> +As stated before, if enough writes preempt the first write, the
> +tail page may make it all the way around the buffer and meet the commit
> +page. At this time, we must start dropping writes (usually with some kind
> +of warning to the user). But what happens if the commit was still on the
> +reader page? The commit page is not part of the ring buffer. The tail page
> +must account for this.
> +
> +
> +          reader page    commit page
> +              |              |
> +              v              |
> +             +---+           |
> +             |   |<----------+
> +             |   |
> +             |   |------+
> +             +---+      |
> +                        |
> +                        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-H->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +               ^
> +               |
> +           tail page
> +
> +If the tail page were to simply push the head page forward, the commit when
> +leaving the reader page would not be pointing to the correct page.
> +
> +The solution to this is to test if the commit page is on the reader page
> +before pushing the head page. If it is, then it can be assumed that the
> +tail page wrapped the buffer, and we must drop new writes.
> +
> +This is not a race condition, because the commit page can only be moved
> +by the outter most writer (the writer that was preempted).
> +This means that the commit will not move while a writer is moving the
> +tail page. The reader can not swap the reader page if it is also being
> +used as the commit page. The reader can simply check that the commit
> +is off the reader page. Once the commit page leaves the reader page
> +it will never go back on it unless a reader does another swap with the
> +buffer page that is also the commit page.
> +
> +
> +Nested writes
> +-------------
> +
> +In the pushing forward of the tail page we must first push forward
> +the head page if the head page is the next page. If the head page
> +is not the next page, the tail page is simply updated with a cmpxchg.
> +
> +Only writers move the tail page. This must be done atomically to protect
> +against nested writers.
> +
> +  temp_page = tail_page
> +  next_page = temp_page->next
> +  cmpxchg(tail_page, temp_page, next_page)
> +
> +The above will update the tail page if it is still pointing to the expected
> +page. If this fails, a nested write pushed it forward, the the current write
> +does not need to push it.
> +
> +
> +           temp page
> +               |
> +               v
> +            tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |--->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +Nested write comes in and moves the tail page forward:
> +
> +                    tail page (moved by nested writer)
> +            temp page   |
> +               |        |
> +               v        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |--->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +The above would fail the cmpxchg, but since the tail page has already
> +been moved forward, the writer will just try again to reserve storage
> +on the new tail page.
> +
> +But the moving of the head page is a bit more complex.
> +
> +            tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-H->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +The write converts the head page pointer to UPDATE.
> +
> +            tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +But if a nested writer preempts here. It will see that the next
> +page is a head page, but it is also nested. It will detect that
> +it is nested and will save that information. The detection is the
> +fact that it sees the UPDATE flag instead of a HEADER or NORMAL
> +pointer.
> +
> +The nested writer will set the new head page pointer.
> +
> +           tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |-H->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +But it will not reset the update back to normal. Only the writer
> +that converted a pointer from HEAD to UPDATE will convert it back
> +to NORMAL.
> +
> +                    tail page
> +                        |
> +                        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |-H->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +After the nested writer finishes, the outer most writer will convert
> +the UPDATE pointer to NORMAL.
> +
> +
> +                    tail page
> +                        |
> +                        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |--->|   |-H->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +
> +It can be even more complex if several nested writes came in and moved
> +the tail page ahead several pages:
> +
> +
> +(first writer)
> +
> +            tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-H->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +The write converts the head page pointer to UPDATE.
> +
> +            tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |--->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +Next writer comes in, and sees the update and sets up the new
> +head page.
> +
> +(second writer)
> +
> +           tail page
> +               |
> +               v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |-H->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +The nested writer moves the tail page forward. But does not set the old
> +update page to NORMAL because it is not the outer most writer.
> +
> +                    tail page
> +                        |
> +                        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |-H->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +Another writer preempts and sees the page after the tail page is a head page.
> +It changes it from HEAD to UPDATE.
> +
> +(third writer)
> +
> +                    tail page
> +                        |
> +                        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |-U->|   |--->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +The writer will move the head page forward:
> +
> +
> +(third writer)
> +
> +                    tail page
> +                        |
> +                        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |-U->|   |-H->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +But now that the third writer did change the HEAD flag to UPDATE it
> +will convert it to normal:
> +
> +
> +(third writer)
> +
> +                    tail page
> +                        |
> +                        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |--->|   |-H->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +
> +Then it will move the tail page, and return back to the second writer.
> +
> +
> +(second writer)
> +
> +                             tail page
> +                                 |
> +                                 v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |--->|   |-H->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +
> +The second writer will fail to move the tail page because it was already
> +moved, so it will try again and add its data to the new tail page.
> +It will return to the first writer.
> +
> +
> +(first writer)
> +
> +                             tail page
> +                                 |
> +                                 v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |--->|   |-H->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +The first writer can not know atomically test if the tail page moved
> +while it updates the HEAD page. It will then update the head page to
> +what it thinks is the new head page.
> +
> +
> +(first writer)
> +
> +                             tail page
> +                                 |
> +                                 v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |-H->|   |-H->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +Since the cmpxchg returns the old value of the pointer the first writer
> +will see it succeeded in updating the pointer from NORMAL to HEAD.
> +But as we can see, this is not good enough. It must also check to see
> +if the tail page is either where it use to be or on the next page:
> +
> +
> +(first writer)
> +
> +               A        B    tail page
> +               |        |        |
> +               v        v        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |-H->|   |-H->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +If tail page != A and tail page does not equal B, then it must reset the
> +pointer back to NORMAL. The fact that it only needs to worry about
> +nested writers, it only needs to check this after setting the HEAD page.
> +
> +
> +(first writer)
> +
> +               A        B    tail page
> +               |        |        |
> +               v        v        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |-U->|   |--->|   |-H->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +
> +Now the writer can update the head page. This is also why the head page must
> +remain in UPDATE and only reset by the outer most writer. This prevents
> +the reader from seeing the incorrect head page.
> +
> +
> +(first writer)
> +
> +               A        B    tail page
> +               |        |        |
> +               v        v        v
> +    +---+    +---+    +---+    +---+
> +<---|   |--->|   |--->|   |--->|   |-H->
> +--->|   |<---|   |<---|   |<---|   |<---
> +    +---+    +---+    +---+    +---+
> +

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ