lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090714225730.GA19199@Krystal>
Date:	Tue, 14 Jul 2009 18:57:30 -0400
From:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	ltt-dev@...ts.casi.polymtl.ca, linux-kernel@...r.kernel.org,
	Lai Jiangshan <laijs@...fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Robert Wisniewski <bob@...son.ibm.com>,
	Ingo Molnar <mingo@...e.hu>
Subject: Re: LTTng 0.146, adds extra read-side sub-buffer for flight
	recorder

* Steven Rostedt (rostedt@...dmis.org) wrote:
> 
> 
> On Mon, 13 Jul 2009, Mathieu Desnoyers wrote:
> 
> > Hi,
> > 
> > So, I needed a weekend break from writing my thesis (It's almost over!) ;)
> > and I had the great idea to try to come up with a way to ensure that
> > LTTng flight recorder mode permits to have a read-side that never sees
> > corrupted data.
> > 
> > Basically, this is the main thing Steven have been asking me for a
> > while. And it looks like I just figured out a way to do it.
> > 
> > So for flight recorder tracing, this new LTTng version allocates an
> > extra subbuffer which gets exchanged by the reader with the writer
> > subbuffer before it gets read.
> > 
> > Normal tracing does not need this extra subbuffer, because the
> > write-side just drops events when the buffer is full. So we don't
> > allocate it and we don't perform any exchange. The space
> > reservation/commit code plays nicely with both flight recorder and
> > normal tracing schemes.
> > 
> > Here is how I did it:
> > 
> > No modification was required to the buffer space reservation/commit
> > algorithm. I just had to do the following at the backend level
> > (responsible for writing data to/reader data from the buffer):
> > 
> > I am using an array of pointers (one pointer for each subbuffer), plus a
> > pointer to the reader subbuffer. Each of these pointers are pointing to
> > an array of pages, which are all the pages that constitute a subbuffer.
> > Reads/writes from/to the buffer are done by accessors which pick up the
> > right page location within this page table. By modifying the top-level
> > subbuffer pointer, we can swap a whole subbuffer in a single operation.
> > 
> > There is a trick to deal with concurrency between writer and reader.
> > When the top-level subbuffer pointers are not used (no writer is
> > currently writing into it, no reader is reading from its subbuffer), we
> > set a RCHAN_NOREF_FLAG (value: 0x1) which indicates that no reference is
> > currently taken to this subbuffer. As long as this flag is set in the
> > pointer, it is safe for the reader to exchange it. When the writer needs
> > to access this subbuffer for writing, it clears the flag, and sets it
> > back after committing the last piece of data to it.
> > 
> > When the reader figures out that the write-side subbuffer it is trying
> > to exchange has a reference, it fails with -EAGAIN.
> > 
> > Nice things about the way I do it here:
> > 
> > - I keep the separation between the space reservation layer and back-end
> >   buffer layer. The extra reader subbuffer exchange is done at the
> >   back-end layer. The reason why it took me so long to try to come up
> >   with something is that I tried to do it at the space reservation
> >   layer, which was not fitting well the space reservation semantics.
> > 
> > - Keeping space reservation and physical buffer management separate
> >   helps splitting complexity into sub-layers easier to verify.
> > 
> > - Given the space reservation/commit is separate from the subbuffer
> >   exchange per se, I don't need any special-cases for "if the tail
> >   pointer is in the reader page".... these things never happen because
> >   the reserve, commit and consumed counts are completely unrelated to
> >   the pointers to physical subbuffers.
> 
> I don't yet have time to read the patches (not this week, anyway), but I'm 
> assuming that you can only get the new page (swap) while a writer is not 
> writing to it. Thus if it is not a full page, then you must either copy 
> the data, or swap out a non full page. Not complaining here, just trying 
> to understand it :-)

Yep, it's a requirement that when a subbuffer is being written to, it's
not possible for the reader to exchange it, so it's impossible to read
it. It simplifies a lot of things.

> 
> Thus the trick is that you have a series of pointers to the data, and you 
> swap out the data and not the list?  Hmm, actually the ring buffer is 
> already like that and I probably could do the same.

There is no list involved per se. I exchange the pointers to these
structures, not the data itself.

Let's say I have 2 sub-buffers for the writer and one extra sub-buffer
for the reader. I would have:

- An array of 2 pointers to sub-buffer structures. (owned by the writer)
- 1 pointer to subbuffer structure (owned by the reader).

> 
> Here's another thing that the ring buffer does (and makes things a little 
> complex too) is that it keeps track of the number of entries in the buffer 
> as well as the number of overruns. The number of entries in the page is 
> kept in the list data and not the data page itself.

I could add these counters to the "sub-buffer structure". They are not
part of the data itself. When I exchange the top-level pointers to these
structures, the counters, which are part of these structures, will
follow.

> 
> Using a special flag to switch out the data instead of breaking the link 
> list may make things much simpler.

Yeah :) That's why I've chosen to use such flag. And using a linked list
seems like overly complex compared to the simple 2-levels page table I
use here.

> 
> Hmm, I'll take a round to make the ring buffer closer to what you have 
> done. At this rate, we may finally merge the two to handle things that we 
> both need ;-)

Hopefully :) Please don't hesitate for more info if you need some.

I'll have to go back to my thesis next week, but hopefully within 2 more
weeks I should be almost done and more available.

Mathieu

> 
> -- Steve
> 
> 
> > 
> > As always, the tree is available at:
> > 
> > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git
> > git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-lttng.git
> > 
> > The commits implementing this the extra reader page for the lockless
> > scheme are:
> > 
> > lttng-relay-per-subbuffer-index.patch
> > lttng-relay-per-subbuffer-index-low-bit-noref.patch
> > lttng-relay-lockless-writer-use-noref-flag.patch
> > lttng-relay-default-sb-index-to-noref.patch
> > lttng-relay-lockless-exchange-reader-writer-pages.patch
> > 
> > Comments are welcome,
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ