linux-kernel - Re: Perf and ftrace [was Re: PyTimechart]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 13 May 2010 12:46:18 -0400
From:	Steven Rostedt <rostedt@...dmis.org>
To:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Pierre Tardy <tardyp@...il.com>, Ingo Molnar <mingo@...e.hu>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	Tom Zanussi <tzanussi@...il.com>,
	Paul Mackerras <paulus@...ba.org>,
	linux-kernel@...r.kernel.org, arjan@...radead.org,
	ziga.mahkovec@...il.com, davem <davem@...emloft.net>
Subject: Re: Perf and ftrace [was Re: PyTimechart]

On Thu, 2010-05-13 at 12:31 -0400, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@...dmis.org) wrote:
> > On Thu, 2010-05-13 at 09:20 -0400, Mathieu Desnoyers wrote:
> [...]
> > > > > 
> > > > > ...
> > > > > 
> > > > > 97 /**
> > > > > 98  * ring_buffer_clear_noref_flag - Clear the noref subbuffer flag, for writer.
> > > > > 99  */
> > > > > 100 static __inline__
> > > > > 101 void ring_buffer_clear_noref_flag(struct ring_buffer_backend *bufb,
> > > > > 102                                   unsigned long idx)
> > > > > 103 {
> > > > > 104         struct ring_buffer_backend_page *sb_pages, *new_sb_pages;
> > > > > 105
> > > > > 106         sb_pages = bufb->buf_wsb[idx].pages;
> > > > > 107         for (;;) {
> > > > > 108                 if (!RCHAN_SB_IS_NOREF(sb_pages))
> > > > > 109                         return; /* Already writing to this buffer */
> > > > > 110                 new_sb_pages = sb_pages;
> > > > > 111                 RCHAN_SB_CLEAR_NOREF(new_sb_pages);
> > > > > 112                 new_sb_pages = cmpxchg(&bufb->buf_wsb[idx].pages,
> > > > > 113                         sb_pages, new_sb_pages);
> > > > > 114                 if (likely(new_sb_pages == sb_pages))
> > > > > 115                         break;
> > > > > 116                 sb_pages = new_sb_pages;
> > > > 
> > > > The writer calls this??
> > > 
> > > Yes. But the common case (for each event) is simply a
> > > "if (!RCHAN_SB_IS_NOREF(sb_pages))" test that returns. The cmpxchg() is only
> > > performed at subbuffer boundary.
> > 
> > Is the cmpxchg only contending with other writers?
> 
> No. Would have this been the case, I would have used a cmpxchg_local(). This
> cmpxchg used to deal with subbuffer swap is touching the subbuffer "pages"
> pointer, which can be updated concurrently by other writers as well as readers.
> 
> The writer clears the noref flags when starting to write in a subbuffers, and
> sets it when delivering the subbuffer (when it is fully committed).
> 
> The reader can only ever swap the subbuffer with the one it owns if the noref
> flag is set. The reader uses a cmpxchg() too to perform the swap.
> 

So the writer can loop because of a reader? And theoretically, a
aggressive reader that is constantly swapping pages can halt a writer?

I thought you said you did not have loops in the writers?

> [...]
> 
> > > > 
> > > > This looks just like the swap with reader_page that I do, except you use
> > > > a table and I use the list.  How do you replenish the buf_rsb.pages if
> > > > the splice keeps the page you just received active?
> > > 
> > > I don't allow other reads to proceed as long as splice is holding pages that
> > > belong to the reader-owned subbuffer. The read semantic is basically:
> > > 
> > > ring_buffer_open_read() /* only one reader at a time can open a ring buffer */
> > > get_subbuf_size()
> > > while (buffer is not finalized and empty) {
> > >   poll()
> > >   ret = ring_buffer_get_subbuf()
> > >   if (!ret)
> > >     continue;
> > >   /* The splice ops below can be performed in multiple calls, e.g. first splice
> > >    * only a portion of a subbuffer to a pipe, then splice to the disk/network,
> > >    * and move to the next subbuffer portion until all the subbuffer is sent.
> > >    */
> > >   splice one subbuffer worth of data to a pipe
> > >   splice the data from pipe to disk/network
> > >   ring_buffer_put_subbuf()
> > > }
> > > ring_buffer_close_read()
> > > 
> > > The reader code above works both with flight recorder and non-overwrite mode.
> > > 
> > > The code above assumes that upon return from the splice() to disk/network,
> > > splice() is not using the pages anymore (I assume that splice() performs the
> > > transfer synchronously with the call).
> > > 
> > > The VFS interface I use for get_subbuf_size(), ring_buffer_get_subbuf() and
> > > ring_buffer_put_subbuf() are new ioctls. Note that these can be used for both
> > > splice() and mmap() types of backend access, as they only call into the
> > > frontend.
> > 
> > Hmm, so basically you lose pages until they are returned. I guess I can
> > trivially add the same thing now to the current ring buffer.
> 
> Yep. Having the ability to keep an array of pages (rather that just a single
> page at a time) allows splice() to move many pages at once efficiently, while
> permitting this "pages owned by the readers, lend to splice() until it returns"
> simplification. I also never have to allocate pages while tracing: all the pages
> I need are allocated when the buffer is created (and at the special case of cpu
> hotplug, but this is expected for per-cpu buffers).

Yep, I just wrote the changes for my buffer. It was rather trivial ;-)
Although, I just set a hard coded 16 pages max of cached pages.

-- Steve

> 
> In addition, this would play well with mmap() too: we can simply add a
> ring_buffer_get_mmap_offset() method to the backend (exported through another
> ioctl) that would let user-space know the start of the mmap'd buffer range
> currently owned by the reader. So we can inform user-space of the currently
> owned page range without even changing the underlying memory map.
> 
> Thanks,
> 
> Mathieu
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/