lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100513163150.GA13005@Krystal>
Date:	Thu, 13 May 2010 12:31:50 -0400
From:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Pierre Tardy <tardyp@...il.com>, Ingo Molnar <mingo@...e.hu>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	Tom Zanussi <tzanussi@...il.com>,
	Paul Mackerras <paulus@...ba.org>,
	linux-kernel@...r.kernel.org, arjan@...radead.org,
	ziga.mahkovec@...il.com, davem <davem@...emloft.net>
Subject: Re: Perf and ftrace [was Re: PyTimechart]

* Steven Rostedt (rostedt@...dmis.org) wrote:
> On Thu, 2010-05-13 at 09:20 -0400, Mathieu Desnoyers wrote:
[...]
> > > > 
> > > > ...
> > > > 
> > > > 97 /**
> > > > 98  * ring_buffer_clear_noref_flag - Clear the noref subbuffer flag, for writer.
> > > > 99  */
> > > > 100 static __inline__
> > > > 101 void ring_buffer_clear_noref_flag(struct ring_buffer_backend *bufb,
> > > > 102                                   unsigned long idx)
> > > > 103 {
> > > > 104         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
> > > > 105
> > > > 106         sb_pages = bufb->buf_wsb[idx].pages;
> > > > 107         for (;;) {
> > > > 108                 if (!RCHAN_SB_IS_NOREF(sb_pages))
> > > > 109                         return; /* Already writing to this buffer */
> > > > 110                 new_sb_pages = sb_pages;
> > > > 111                 RCHAN_SB_CLEAR_NOREF(new_sb_pages);
> > > > 112                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
> > > > 113                         sb_pages, new_sb_pages);
> > > > 114                 if (likely(new_sb_pages == sb_pages))
> > > > 115                         break;
> > > > 116                 sb_pages = new_sb_pages;
> > > 
> > > The writer calls this??
> > 
> > Yes. But the common case (for each event) is simply a
> > "if (!RCHAN_SB_IS_NOREF(sb_pages))" test that returns. The cmpxchg() is only
> > performed at subbuffer boundary.
> 
> Is the cmpxchg only contending with other writers?

No. Would have this been the case, I would have used a cmpxchg_local(). This
cmpxchg used to deal with subbuffer swap is touching the subbuffer "pages"
pointer, which can be updated concurrently by other writers as well as readers.

The writer clears the noref flags when starting to write in a subbuffers, and
sets it when delivering the subbuffer (when it is fully committed).

The reader can only ever swap the subbuffer with the one it owns if the noref
flag is set. The reader uses a cmpxchg() too to perform the swap.

[...]

> > > 
> > > This looks just like the swap with reader_page that I do, except you use
> > > a table and I use the list.  How do you replenish the buf_rsb.pages if
> > > the splice keeps the page you just received active?
> > 
> > I don't allow other reads to proceed as long as splice is holding pages that
> > belong to the reader-owned subbuffer. The read semantic is basically:
> > 
> > ring_buffer_open_read() /* only one reader at a time can open a ring buffer */
> > get_subbuf_size()
> > while (buffer is not finalized and empty) {
> >   poll()
> >   ret = ring_buffer_get_subbuf()
> >   if (!ret)
> >     continue;
> >   /* The splice ops below can be performed in multiple calls, e.g. first splice
> >    * only a portion of a subbuffer to a pipe, then splice to the disk/network,
> >    * and move to the next subbuffer portion until all the subbuffer is sent.
> >    */
> >   splice one subbuffer worth of data to a pipe
> >   splice the data from pipe to disk/network
> >   ring_buffer_put_subbuf()
> > }
> > ring_buffer_close_read()
> > 
> > The reader code above works both with flight recorder and non-overwrite mode.
> > 
> > The code above assumes that upon return from the splice() to disk/network,
> > splice() is not using the pages anymore (I assume that splice() performs the
> > transfer synchronously with the call).
> > 
> > The VFS interface I use for get_subbuf_size(), ring_buffer_get_subbuf() and
> > ring_buffer_put_subbuf() are new ioctls. Note that these can be used for both
> > splice() and mmap() types of backend access, as they only call into the
> > frontend.
> 
> Hmm, so basically you lose pages until they are returned. I guess I can
> trivially add the same thing now to the current ring buffer.

Yep. Having the ability to keep an array of pages (rather that just a single
page at a time) allows splice() to move many pages at once efficiently, while
permitting this "pages owned by the readers, lend to splice() until it returns"
simplification. I also never have to allocate pages while tracing: all the pages
I need are allocated when the buffer is created (and at the special case of cpu
hotplug, but this is expected for per-cpu buffers).

In addition, this would play well with mmap() too: we can simply add a
ring_buffer_get_mmap_offset() method to the backend (exported through another
ioctl) that would let user-space know the start of the mmap'd buffer range
currently owned by the reader. So we can inform user-space of the currently
owned page range without even changing the underlying memory map.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ