[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4A3076C9.2030203@jp.fujitsu.com>
Date: Thu, 11 Jun 2009 12:15:21 +0900
From: Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>
To: Steven Rostedt <rostedt@...dmis.org>
CC: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
Andrew Morton <akpm@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
Frederic Weisbecker <fweisbec@...il.com>,
Theodore Tso <tytso@....edu>,
Arnaldo Carvalho de Melo <acme@...hat.com>,
Mathieu Desnoyers <compudj@...stal.dyndns.org>,
Lai Jiangshan <laijs@...fujitsu.com>,
"Martin J. Bligh" <mbligh@...igh.org>,
Christoph Hellwig <hch@...radead.org>,
Li Zefan <lizf@...fujitsu.com>,
Huang Ying <ying.huang@...el.com>,
"H. Peter Anvin" <hpa@...or.com>,
Masami Hiramatsu <mhiramat@...hat.com>
Subject: Re: [PATCH 3/3] ring-buffer: add design document
Steven Rostedt wrote:
> From: Steven Rostedt <srostedt@...hat.com>
>
> This adds the design document for the ring buffer and also
> explains how it is designed to have lockless writes.
>
> Signed-off-by: Steven Rostedt <rostedt@...dmis.org>
> ---
> Documentation/trace/ring-buffer-design.txt | 949 ++++++++++++++++++++++++++++
> 1 files changed, 949 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/trace/ring-buffer-design.txt
>
> diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt
> new file mode 100644
> index 0000000..cca290b
> --- /dev/null
> +++ b/Documentation/trace/ring-buffer-design.txt
> @@ -0,0 +1,949 @@
> + Lockless Ring Buffer Design
> + ===========================
> +
> +Copyright 2009 Red Hat Inc.
> + Author: Steven Rostedt <srostedt@...hat.com>
> + License: The GNU Free Documentation License, Version 1.2
> + (dual licensed under the GPL v2)
> +
> +Written for: 2.6.31
> +
> +Terminology used in this Document
> +---------------------------------
> +
> +tail - where new writes happen in the ring buffer.
> +
> +head - where new reads happen in the ring buffer.
> +
> +producer - the task that writes into the ring buffer (same as writer)
> +
> +writer - same as producer
> +
> +consumer - the task that reads from the buffer (same as reader)
> +
> +reader - same as consumer.
> +
> +reader_page - A page outside the ring buffer used solely (for the most part)
> + by the reader.
> +
> +head_page - a pointer to the page that the reader will use next
> +
> +tail_page - a pointer to the page that will be written to next
> +
> +commit_page - a pointer to the page with the last finished non nested write.
> +
> +cmpxchg - hardware assisted atomic transaction that performs the following:
> +
> + A = B iff previous A == C
> +
> + R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
> + current A is equal to C, and we put the old (current) A into R
> +
> + R gets the previous A regardless if A is updated with B or not.
> +
> + To see if the update was successful a compare of R == C may be used.
> +
> +The Generic Ring Buffer
> +-----------------------
> +
> +The ring buffer can be used in either an overwrite mode or in
> +producer/consumer mode.
> +
> +Producer/consumer mode is where the producer were to fill up the
> +buffer before the consumer could free up anything, the producer
> +will stop writing to the buffer. This will lose most recent events.
> +
> +Overwrite mode is where the produce were to fill up the buffer
> +before the consumer could free up anything, the producer will
> +overwrite the older data. This will lose the oldest events.
> +
> +No two writers can write at the same time (on the same per cpu buffer),
> +but a writer may preempt another writer, but it must finish writing
> +before the previous writer may continue. This is very important to the
> +algorithm. The writers act like a "stack".
> +
> +
> + writer1 start
> + <preempted> writer2 start
> + <preempted> writer3 start
> + writer3 finishes
> + writer2 finishes
> + writer1 finishes
> +
> +This is very much like a writer being preempted by an interrupt and
> +the interrupt doing a write as well.
> +
> +Readers can happen at any time. But no two readers may run at the
> +same time, nor can a reader preempt another reader. A reader can not preempt
> +a writer, but it may read/consume from the buffer at the same time as
> +a writer is writing, but the reader must be on another processor.
> +
> +A writer can preempt a reader, but a reader can not preempt a writer.
> +But a reader can read the buffer at the same time (on another processor)
> +as a writer.
> +
> +The ring buffer is made up of a list of pages held together by a link list.
> +
> +At initialization a reader page is allocated for the reader that is not
> +part of the ring buffer.
> +
> +The head_page, tail_page and commit_page are all initialized to point
> +to the same page.
> +
> +The reader page is initialized to have its next pointer pointing to
> +the head page, and its previous pointer pointing to a page before
> +the head page.
> +
> +The reader has its own page to use. At start up time, this page is
> +allocated but is not attached to the list. When the reader wants
> +to read from the buffer, if its page is empty (like it is on start up)
> +it will swap its page with the head_page. The old reader page will
> +become part of the ring buffer and the head_page will be removed.
> +A new head page goes to the page after the old head page (but not
> +the page that was swapped in).
> +
> +Once the new page is given to the reader, the reader could do what
> +it wants with it, as long as a writer has left that page.
> +
> +A sample of how the reader page is swapped: Note this does not
> +show the head page in the buffer, it is for demonstrating a swap
> +only.
> +
> + +------+
> + |reader| RING BUFFER
> + |page |
> + +------+
> + +---+ +---+ +---+
> + | |-->| |-->| |
> + | |<--| |<--| |
> + +---+ +---+ +---+
> + ^ | ^ |
> + | +-------------+ |
> + +-----------------+
> +
> +
> + +------+
> + |reader| RING BUFFER
> + |page |-------------------+
> + +------+ v
> + | +---+ +---+ +---+
> + | | |-->| |-->| |
> + | | |<--| |<--| |<-+
> + | +---+ +---+ +---+ |
> + | ^ | ^ | |
> + | | +-------------+ | |
> + | +-----------------+ |
> + +------------------------------------+
> +
> + +------+
> + |reader| RING BUFFER
> + |page |-------------------+
> + +------+ <---------------+ v
> + | ^ +---+ +---+ +---+
> + | | | |-->| |-->| |
> + | | | |<--| |<--| |<-+
> + | | +---+ +---+ +---+ |
> + | | | ^ | |
> + | | +-------------+ | |
> + | +-----------------------------+ |
> + +------------------------------------+
It seems the middle of three pages have 2 prev arrows... ?
> +
> + +------+
> + |buffer| RING BUFFER
> + |page |-------------------+
> + +------+ <---------------+ v
> + | ^ +---+ +---+ +---+
> + | | | | | |-->| |
> + | | New | | | |<--| |<-+
> + | | Reader +---+ +---+ +---+ |
> + | | page ----^ | |
> + | | | |
> + | +-----------------------------+ |
> + +------------------------------------+
> +
> +
> +
> +It is possible that the page swapped is the commit page and the tail page,
> +if what is in the ring buffer is less than what is held in a buffer page.
> +
> +
> + reader page commit page tail page
> + | | |
> + v | |
> + +---+ | |
> + | |<----------+ |
> + | |<------------------------+
> + | |------+
> + +---+ |
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |--->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +This case is still legal for this algorithm.
> +When the writer leaves the page, it simply goes into the ring buffer
> +since the reader page still points to the next location in the ring
> +buffer.
> +
> +
> +The main pointers:
> +
> + reader page - The page used solely by the reader and is not part
> + of the ring buffer (may be swapped in)
> +
> + head page - the next page in the ring buffer that will be swapped
> + with the reader page.
> +
> + tail page - the page where the next write will take place.
> +
> + commit page - the page that last finished a write.
> +
> +The commit page only is updated by the outer most writer in the
> +writer stack. A writer that preempts another writer will not move the
> +commit page.
> +
> +When data is written into the ring buffer, a position is reserved
> +in the ring buffer and passed back to the writer. When the writer
> +is finished writing data into that position, it commits the write.
> +
> +Another write (or a read) may take place at anytime during this
> +transaction. If another write happens it must finish before continuing
> +with the previous write.
> +
> +
> + Write reserve:
> +
> + Buffer page
> + +---------+
> + |written |
> + +---------+ <--- given back to writer (current commit)
> + |reserved |
> + +---------+ <--- tail pointer
> + | empty |
> + +---------+
> +
> + Write commit:
> +
> + Buffer page
> + +---------+
> + |written |
> + +---------+
> + |written |
> + +---------+ <--- next positon for write (current commit)
> + | empty |
> + +---------+
> +
> +
> + If a write happens after the first reserve:
> +
> + Buffer page
> + +---------+
> + |written |
> + +---------+ <-- current commit
> + |reserved |
> + +---------+ <--- given back to second writer
> + |reserved |
> + +---------+ <--- tail pointer
> +
> + After second writer commits:
> +
> +
> + Buffer page
> + +---------+
> + |written |
> + +---------+ <--(last full commit)
> + |reserved |
> + +---------+
> + |pending |
> + |commit |
> + +---------+ <--- tail pointer
> +
> + When the first writer commits:
> +
> + Buffer page
> + +---------+
> + |written |
> + +---------+
> + |written |
> + +---------+
> + |written |
> + +---------+ <--(last full commit and tail pointer)
> +
> +
> +The commit pointer points to the last write location that was
> +committed without preempting another write. When a write that
> +preempted another write is committed, it only becomes a pending commit
> +and will not be a full commit till all writes have been committed.
> +
> +The commit page points to the page that has the last full commit.
> +The tail page points to the page with the last write (before
> +committing).
> +
> +The tail page is always equal to or after the commit page. It may
> +be several pages ahead. If the tail page catches up to the commit
> +page then no more writes may take place (regardless of the mode
> +of the ring buffer: overwrite and produce/consumer).
> +
> +The order of pages are:
> +
> + head page
> + commit page
> + tail page
> +
> +Possible scenario:
> + tail page
> + head page commit page |
> + | | |
> + v v v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |--->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +There is a special case that the head page is after either the commit page
> +and possibly the tail page. That is when the commit (and tail) page has been
> +swapped with the reader page. This is because the head page is always
> +part of the ring buffer, but the reader page is not. When ever there
> +has been less than a full page that has been committed inside the ring buffer,
> +and a reader swaps out a page, it will be swapping out the commit page.
> +
> +
> + reader page commit page tail page
> + | | |
> + v | |
> + +---+ | |
> + | |<----------+ |
> + | |<------------------------+
> + | |------+
> + +---+ |
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |--->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> + ^
> + |
> + head page
> +
> +
> +In this case, the head page will not move when the tail and commit
> +move back into the ring buffer.
> +
> +The reader can not swap a page into the ring buffer if the commit page
> +is still on that page. If the read meets the last commit (real commit
> +not pending or reserved), then there is nothing more to read.
> +The buffer is considered empty until another full commit finishes.
> +
> +When the tail meets the head page, if the buffer is in overwrite mode,
> +the head page will be pushed ahead one. If the buffer is in producer/consumer
> +mode, the write will fail.
> +
> +Overwrite mode:
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |--->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> + ^
> + |
> + head page
> +
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |--->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> + ^
> + |
> + head page
> +
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |--->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> + ^
> + |
> + head page
> +
> +Note, the reader page will still point to the previous head page.
> +But when a swap takes place, it will use the most recent head page.
> +
> +
> +Making the Ring Buffer Lockless:
> +--------------------------------
> +
> +The main idea behind the lockless algorithm is to combine the moving
> +of the head_page pointer with the swapping of pages with the reader.
> +State flags are placed inside the pointer to the page. To do this,
> +each page must be aligned in memory by 4 bytes. This will allow the 2
> +least significant bits of the address to be used as flags. Since
> +they will always be zero for the address. To get the address,
> +simply mask out the flags.
> +
> + MASK = ~3
> +
> + address & MASK
> +
> +Two flags will be kept by these two bits:
> +
> + HEADER - the page being pointed to is a head page
> +
> + UPDATE - the page being pointed to is being updated by a writer
> + and was or is about to be a head page.
> +
> +
> + reader page
> + |
> + v
> + +---+
> + | |------+
> + +---+ |
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-H->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +
> +The above pointer "-H->" would have the HEADER flag set. That is
> +the next page is the next page to be swapped out by the reader.
> +This pointer means the next page is the head page.
> +
> +When the tail page meets the head pointer, it will use cmpxchg to
> +change the pointer to the UPDATE state:
> +
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-H->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +"-U->" represents a pointer in the UPDATE state.
> +
> +Any access to the reader will need to take some sort of lock to serialize
> +the readers. But the writers will never take a lock to write to the
> +ring buffer. This means we only need to worry about a single reader,
> +and writes only preempt in "stack" formation.
> +
> +When the reader tries to swap the page with the ring buffer, it
> +will also use cmpxchg. If the flag bit in the pointer to the
> +head page does not have the HEADER flag set, the compare will fail
> +and the reader will need to look for the new head page and try again.
> +Note, the flag UPDATE and HEADER are never set at the same time.
> +
> +The reader swaps the reader page as follows:
> +
> + +------+
> + |reader| RING BUFFER
> + |page |
> + +------+
> + +---+ +---+ +---+
> + | |--->| |--->| |
> + | |<---| |<---| |
> + +---+ +---+ +---+
> + ^ | ^ |
> + | +---------------+ |
> + +-----H-------------+
> +
> +The reader sets the reader page next pointer as HEADER to the page after
> +the head page.
> +
> +
> + +------+
> + |reader| RING BUFFER
> + |page |-------H-----------+
> + +------+ v
> + | +---+ +---+ +---+
> + | | |--->| |--->| |
> + | | |<---| |<---| |<-+
> + | +---+ +---+ +---+ |
> + | ^ | ^ | |
> + | | +---------------+ | |
> + | +-----H-------------+ |
> + +--------------------------------------+
> +
> +It does a cmpxchg with the pointer to the previous head page to make it
> +point to the reader page. Note that the new pointer does not have the HEADER
> +flag set. This action atomically moves the head page forward.
> +
> + +------+
> + |reader| RING BUFFER
> + |page |-------H-----------+
> + +------+ <---------------+ v
> + | ^ +---+ +---+ +---+
> + | | | |-->| |-->| |
> + | | | |<--| |<--| |<-+
> + | | +---+ +---+ +---+ |
> + | | | ^ | |
> + | | +-------------+ | |
> + | +-----------------------------+ |
> + +------------------------------------+
> +
Ditto.
Thanks,
H.Seto
> +After the new head page is set, the previous pointer of the head page is
> +updated to the reader page.
> +
> + +------+
> + |reader| RING BUFFER
> + |page |-------H-----------+
> + +------+ v
> + | ^ +---+ +---+ +---+
> + | | | |-->| |-->| |
> + | | | |<--| |<--| |<-+
> + | | +---+ +---+ +---+ |
> + | | | ^ | |
> + | | +-------------+ | |
> + | +-----------------------------+ |
> + +------------------------------------+
> +
> + +------+
> + |buffer| RING BUFFER
> + |page |-------H-----------+ <--- New head page
> + +------+ <---------------+ v
> + | ^ +---+ +---+ +---+
> + | | | | | |-->| |
> + | | New | | | |<--| |<-+
> + | | Reader +---+ +---+ +---+ |
> + | | page ----^ | |
> + | | | |
> + | +-----------------------------+ |
> + +------------------------------------+
> +
> +Another important point. The page that the reader page points back to
> +by its previous pointer (the one that now points to the new head page)
> +never points back to the reader page. That is because the reader page is
> +not part of the ring buffer. Traversing the ring buffer via the next pointers
> +will always stay in the ring buffer. Traversing the ring buffer via the
> +prev pointers may not.
> +
> +Note, the way to determine a reader page is simply by examining the previous
> +pointer of the page. If the next pointer of the previous page does not
> +point back to the original page, then the original page is a reader page:
> +
> +
> + +--------+
> + | reader | next +----+
> + | page |-------->| |<====== (buffer page)
> + +--------+ +----+
> + | | ^
> + | v | next
> + prev | +----+
> + +------------->| |
> + +----+
> +
> +The way the head page moves forward:
> +
> +When the tail page meets the head page and the buffer is in overwrite mode
> +and more writes take place, the head page must be moved forward before the
> +writer may move the tail page. The way this is done is that the writer
> +performs a cmpxchg to convert the pointer to the head page from the HEADER
> +flag to have the UPDATE flag set. Once this is done, the reader will
> +not be able to swap the head page from the buffer, nor will it be able to
> +move the head page, until the writer is finished with the move.
> +
> +This eliminates any races that the reader can have on the writer. The reader
> +must spin, and this is why the reader can not preempt the writer.
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-H->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +The following page will be made into the new head page.
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |-H->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +After the new head page has been set, we can set the old head page
> +pointer back to NORMAL.
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |--->| |-H->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +After the head page has been moved, the tail page may now move forward.
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |--->| |-H->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +
> +The above are the trivial updates. Now for the more complex scenarios.
> +
> +
> +As stated before, if enough writes preempt the first write, the
> +tail page may make it all the way around the buffer and meet the commit
> +page. At this time, we must start dropping writes (usually with some kind
> +of warning to the user). But what happens if the commit was still on the
> +reader page? The commit page is not part of the ring buffer. The tail page
> +must account for this.
> +
> +
> + reader page commit page
> + | |
> + v |
> + +---+ |
> + | |<----------+
> + | |
> + | |------+
> + +---+ |
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-H->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> + ^
> + |
> + tail page
> +
> +If the tail page were to simply push the head page forward, the commit when
> +leaving the reader page would not be pointing to the correct page.
> +
> +The solution to this is to test if the commit page is on the reader page
> +before pushing the head page. If it is, then it can be assumed that the
> +tail page wrapped the buffer, and we must drop new writes.
> +
> +This is not a race condition, because the commit page can only be moved
> +by the outter most writer (the writer that was preempted).
> +This means that the commit will not move while a writer is moving the
> +tail page. The reader can not swap the reader page if it is also being
> +used as the commit page. The reader can simply check that the commit
> +is off the reader page. Once the commit page leaves the reader page
> +it will never go back on it unless a reader does another swap with the
> +buffer page that is also the commit page.
> +
> +
> +Nested writes
> +-------------
> +
> +In the pushing forward of the tail page we must first push forward
> +the head page if the head page is the next page. If the head page
> +is not the next page, the tail page is simply updated with a cmpxchg.
> +
> +Only writers move the tail page. This must be done atomically to protect
> +against nested writers.
> +
> + temp_page = tail_page
> + next_page = temp_page->next
> + cmpxchg(tail_page, temp_page, next_page)
> +
> +The above will update the tail page if it is still pointing to the expected
> +page. If this fails, a nested write pushed it forward, the the current write
> +does not need to push it.
> +
> +
> + temp page
> + |
> + v
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |--->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +Nested write comes in and moves the tail page forward:
> +
> + tail page (moved by nested writer)
> + temp page |
> + | |
> + v v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |--->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +The above would fail the cmpxchg, but since the tail page has already
> +been moved forward, the writer will just try again to reserve storage
> +on the new tail page.
> +
> +But the moving of the head page is a bit more complex.
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-H->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +The write converts the head page pointer to UPDATE.
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +But if a nested writer preempts here. It will see that the next
> +page is a head page, but it is also nested. It will detect that
> +it is nested and will save that information. The detection is the
> +fact that it sees the UPDATE flag instead of a HEADER or NORMAL
> +pointer.
> +
> +The nested writer will set the new head page pointer.
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |-H->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +But it will not reset the update back to normal. Only the writer
> +that converted a pointer from HEAD to UPDATE will convert it back
> +to NORMAL.
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |-H->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +After the nested writer finishes, the outer most writer will convert
> +the UPDATE pointer to NORMAL.
> +
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |--->| |-H->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +
> +It can be even more complex if several nested writes came in and moved
> +the tail page ahead several pages:
> +
> +
> +(first writer)
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-H->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +The write converts the head page pointer to UPDATE.
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |--->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +Next writer comes in, and sees the update and sets up the new
> +head page.
> +
> +(second writer)
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |-H->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +The nested writer moves the tail page forward. But does not set the old
> +update page to NORMAL because it is not the outer most writer.
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |-H->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +Another writer preempts and sees the page after the tail page is a head page.
> +It changes it from HEAD to UPDATE.
> +
> +(third writer)
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |-U->| |--->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +The writer will move the head page forward:
> +
> +
> +(third writer)
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |-U->| |-H->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +But now that the third writer did change the HEAD flag to UPDATE it
> +will convert it to normal:
> +
> +
> +(third writer)
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |--->| |-H->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +
> +Then it will move the tail page, and return back to the second writer.
> +
> +
> +(second writer)
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |--->| |-H->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +
> +The second writer will fail to move the tail page because it was already
> +moved, so it will try again and add its data to the new tail page.
> +It will return to the first writer.
> +
> +
> +(first writer)
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |--->| |-H->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +The first writer can not know atomically test if the tail page moved
> +while it updates the HEAD page. It will then update the head page to
> +what it thinks is the new head page.
> +
> +
> +(first writer)
> +
> + tail page
> + |
> + v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |-H->| |-H->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +Since the cmpxchg returns the old value of the pointer the first writer
> +will see it succeeded in updating the pointer from NORMAL to HEAD.
> +But as we can see, this is not good enough. It must also check to see
> +if the tail page is either where it use to be or on the next page:
> +
> +
> +(first writer)
> +
> + A B tail page
> + | | |
> + v v v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |-H->| |-H->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +If tail page != A and tail page does not equal B, then it must reset the
> +pointer back to NORMAL. The fact that it only needs to worry about
> +nested writers, it only needs to check this after setting the HEAD page.
> +
> +
> +(first writer)
> +
> + A B tail page
> + | | |
> + v v v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |-U->| |--->| |-H->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
> +Now the writer can update the head page. This is also why the head page must
> +remain in UPDATE and only reset by the outer most writer. This prevents
> +the reader from seeing the incorrect head page.
> +
> +
> +(first writer)
> +
> + A B tail page
> + | | |
> + v v v
> + +---+ +---+ +---+ +---+
> +<---| |--->| |--->| |--->| |-H->
> +--->| |<---| |<---| |<---| |<---
> + +---+ +---+ +---+ +---+
> +
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists