netdev - Re: [PATCH bpf-next 2/2] bpf, libbpf: simplify perf RB walk and do incremental updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20181012083941.1904f59c@cakuba.netronome.com>
Date:   Fri, 12 Oct 2018 08:39:41 -0700
From:   Jakub Kicinski <jakub.kicinski@...ronome.com>
To:     Daniel Borkmann <daniel@...earbox.net>
Cc:     alexei.starovoitov@...il.com, netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next 2/2] bpf, libbpf: simplify perf RB walk and do
 incremental updates

On Fri, 12 Oct 2018 15:30:57 +0200, Daniel Borkmann wrote:
> On 10/12/2018 10:39 AM, Daniel Borkmann wrote:
> > On 10/12/2018 05:04 AM, Jakub Kicinski wrote:  
> >> On Thu, 11 Oct 2018 16:02:07 +0200, Daniel Borkmann wrote:  
> >>> Clean up and improve bpf_perf_event_read_simple() ring walk a bit
> >>> to use similar tail update scheme as in perf and bcc allowing the
> >>> kernel to make forward progress not only after full timely walk.  
> >>
> >> The extra memory barriers won't impact performance?  If I read the code
> >> correctly we now have:
> >>
> >> 	while (bla) {
> >> 		head = HEAD
> >> 		rmb()
> >>
> >> 		...
> >>
> >> 		mb()
> >> 		TAIL = tail
> >> 	}
> >>
> >> Would it make sense to try to piggy back on the mb() for head re-read
> >> at least?  Perhaps that's a non-issue, just wondering.  
> > 
> > From the scheme specified in the comment in prior patch my understanding
> > would be that they don't pair (see B and C) so there would be no guarantee
> > that load of head would happen before load of data. Fwiw, I've been using
> > the exact same semantics as user space perf tool walks the perf mmap'ed
> > ring buffer (tools/perf/util/mmap.{h,c}) here. Given kernel doesn't stop  
> 
> On that note, I'll also respin, after some clarification with PeterZ on
> why perf is using {rmb,mb}() barriers today as opposed to more lightweight
> smp_{rmb,mb}() ones it turns out there is no real reason other than
> historic one and perf can be changed and made more efficient as well. ;)

Cool, to be clear I meant something like:

	head = HEAD
	rmb()

	while (bla) {
		...

		head = HEAD
		mb()
		TAIL = tail
	}

> > pushing into ring buffer while user space walks it and indicates how far
> > it has consumed data via tail update, it would allow for making room
> > successively and not only after full run has complete, so we don't make
> > any assumptions in the generic libbpf library helper on how slow/quick
> > the callback would be processing resp. how full ring is, etc, and kernel
> > pushing new data can be processed in the same run if necessary. One thing
> > we could consider is to batch tail updates, say, every 8 elements and a
> > final update once we break out walking the ring; probably okay'ish as a
> > heuristic..

Makes sense, I would be tempted to do some batching, but since I'm
mostly working with I/O devices I may think that memory barriers cost
more than they do for main memory-only accesses :)

> >>> Also few other improvements to use realloc() instead of free() and
> >>> malloc() combination and for the callback use proper perf_event_header
> >>> instead of void pointer, so that real applications can use container_of()
> >>> macro with proper type checking.  
> >>
> >> FWIW the free() + malloc() was to avoid the the needless copy of the
> >> previous event realloc() may do.  It makes sense to use realloc()
> >> especially if you want to put extra info in front of the buffer, just
> >> sayin' it wasn't a complete braino ;)  
> > 
> > No strong preference from my side, I'd think that it might be sensible in
> > any case from applications to call the bpf_perf_event_read_simple() with a
> > already preallocated buffer, depending on the expected max element size from
> > BPF could e.g. be a buffer of 1 page or so. Given 512 byte stack space from
> > the BPF prog and MTU 1500 this would more than suffice to avoid new
> > allocations altogether. Anyway, given we only grow the new memory area I
> > did some testing on realloc() with an array of pointers to prior malloc()'ed
> > buffers, running randomly 10M realloc()s to increase size over them and
> > saw <1% where area had to be moved, so we're hitting corner case of a corner
> > case, I'm also ok to leave the combination, though. :)

Thanks for doing the experiment, not strong preference here either, so
realloc() seems good!