lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100815133513.GA18175@Krystal>
Date:	Sun, 15 Aug 2010 09:35:13 -0400
From:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...e.hu>,
	LKML <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Christoph Hellwig <hch@....de>, Li Zefan <lizf@...fujitsu.com>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	Johannes Berg <johannes.berg@...el.com>,
	Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	Tom Zanussi <tzanussi@...il.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Andi Kleen <andi@...stfloor.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	Jeremy Fitzhardinge <jeremy@...p.org>,
	"Frank Ch. Eigler" <fche@...hat.com>, Tejun Heo <htejun@...il.com>
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

* Steven Rostedt (rostedt@...dmis.org) wrote:
> Egad! Go on vacation and the world falls apart.
> 
> On Wed, 2010-08-04 at 08:27 +0200, Peter Zijlstra wrote:
> > On Tue, 2010-08-03 at 11:56 -0700, Linus Torvalds wrote:
> > > On Tue, Aug 3, 2010 at 10:18 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> > > >
> > > > FWIW I really utterly detest the whole concept of sub-buffers.
> > > 
> > > I'm not quite sure why. Is it something fundamental, or just an
> > > implementation issue?
> > 
> > The sub-buffer thing that both ftrace and lttng have is creating a large
> > buffer from a lot of small buffers, I simply don't see the point of
> > doing that. It adds complexity and limitations for very little gain.
> 
> So, I want to allocate a 10Meg buffer. I need to make sure the kernel
> has 10megs of memory available. If the memory is quite fragmented, then
> too bad, I lose out.
> 
> Oh wait, I could also use vmalloc. But then again, now I'm blasting
> valuable TLB entries for a tracing utility, thus making the tracer have
> a even bigger impact on the entire system.
> 
> BAH!
> 
> I originally wanted to go with the continuous buffer, but I was
> convinced after trying to implement it, that it was a bad choice.
> Specifically, because of needing to 1) get large amounts of memory that
> is continuous, or 2) eating up TLB entries and causing the system to
> perform poorer.
> 
> I chose page size "sub-buffers" to solve the above. It also made
> implementing splice trivial. OK, I admit, I never thought about mmapping
> the buffers, just because I figured splice was faster. But I do have
> patches that allow a user to mmap the entire ring buffer, but only in a
> "producer/consumer" mode.

FYI: the generic ring buffer also implements the mmap() interface for the flight
recorder mode.

> 
> Note, I use page size sub-buffers, but the design could work with any
> size sub-buffers. I just never implemented that (even though, when I
> wrote the code it was secretly on my todo list).

The main difference between our designs is that Ftrace use a linked list and the
generic ring buffer lib. uses a sub-buffer/page table. Considering the use-case
of reading available flight recorder pages in reverse order I've hear about at
LinuxCon (heard about it from both from Peter and Masami, and it actually makes
a whole lot of sense, because the data we care about the most and want to read
ASAP is the last subbuffers), I think the page table is more appropriate (and
flexible) than a single-direction linked list, because it allows to pick a
random page (or subbuffer) in the buffer without walking over all pages.

> 
> 
> > 
> > Their benefit is known synchronization points into the stream, you can
> > parse each sub-buffer independently, but you can always break up a
> > continuous stream into smaller parts or use a transport that includes
> > index points or whatever.
> > 
> > Their down side is that you can never have individual events larger than
> > the sub-buffer, you need to be aware of the sub-buffer when reserving
> > space etc..
> 
> The answer to that is to make a macro to do the assignment of the event,
> and add a new API.
> 
> 	event = ring_buffer_reserve_unlimited();
> 
> 	ring_buffer_assign(event, data1);
> 	ring_buffer_assign(event, data2);
> 
> 	ring_buffer_commit(event);
> 
> The ring_buffer_reserve_unlimited() could reserve a bunch of space
> beyond one ring buffer. It could reserve data in fragments. Then the
> ring_buffer_assgin() could either copy directly to the event (if the
> event exists on one sub buffer) or do a copy the space was fragmented.
> 
> Of course, userspace would need to know how to read it. And it can get
> complex due to interrupts coming in and also reserving between
> fragments, or what happens if a partial fragment is overwritten. But all
> these are not impossible to solve.

Dealing with fragmentation, sub-buffer loss, etc. is then pushed up to the trace
analyzer. While I agree that we have to keep the burden of complexity out of the
kernel as much as possible, I also think that an elegant design at the data
producer level which keeps the trace reader/analyzer simple and reliable should
be favored if it keeps a similar level of complexity in the kernel code.

A good argument supporting this is that some tracing users want to use a mmap()
scheme directly on the trace buffers to analyze the data directly in user-space
on the traced machine. In these cases, the complexity/overhead added to the
analyzer will impact the overall performance of the system being traced.

Thanks,

Mathieu

> 
> -- Steve
> 
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ