lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080715132543.GB20037@Krystal>
Date:	Tue, 15 Jul 2008 09:25:43 -0400
From:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	akpm@...ux-foundation.org, Ingo Molnar <mingo@...e.hu>,
	linux-kernel@...r.kernel.org,
	Masami Hiramatsu <mhiramat@...hat.com>,
	"Frank Ch. Eigler" <fche@...hat.com>,
	Hideo AOKI <haoki@...hat.com>,
	Takashi Nishiie <t-nishiie@...css.fujitsu.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Eduard - Gabriel Munteanu <eduard.munteanu@...ux360.ro>
Subject: Re: [patch 01/15] Kernel Tracepoints

* Peter Zijlstra (peterz@...radead.org) wrote:
> On Wed, 2008-07-09 at 10:59 -0400, Mathieu Desnoyers wrote:
> > plain text document attachment (tracepoints.patch)
> > Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers.
> > Allows complete typing verification. No format string required. See the
> > tracepoint Documentation and Samples patches for usage examples.
> 

Hi Peter,

Thanks for the review,

> I think the patch description (aka changelog, not to be confused with
> the below) could use a lot more attention.. There are a lot of things
> going on in this code non of which are mentioned.
> 
> I often read changelogs when I try to understand a piece of code, this
> one is utterly unfulfilling.
> 

Yes, given that I started from the marker code as a base, I did not
re-explain everything that wasn't changed from those. I guess it's good
to give more details about this though. I'll address this.

> Aside from that, I think the general picture looks good.
> 
> I sprinkled some comments in the code below...
> 

Thanks, let's looks at them,

> > Changelog :
> > - Use #name ":" #proto as string to identify the tracepoint in the
> >   tracepoint table. This will make sure not type mismatch happens due to
> >   connexion of a probe with the wrong type to a tracepoint declared with
> >   the same name in a different header.
> > - Add tracepoint_entry_free_old.
> > 
> > Masami Hiramatsu <mhiramat@...hat.com> :
> > Tested on x86-64.
> > 
> > Performance impact of a tracepoint : same as markers, except that it adds about
> > 70 bytes of instructions in an unlikely branch of each instrumented function
> > (the for loop, the stack setup and the function call). It currently adds a
> > memory read, a test and a conditional branch at the instrumentation site (in the
> > hot path). Immediate values will eventually change this into a load immediate,
> > test and branch, which removes the memory read which will make the i-cache
> > impact smaller (changing the memory read for a load immediate removes 3-4 bytes
> > per site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it also
> > saves the d-cache hit).
> > 
> > About the performance impact of tracepoints (which is comparable to markers),
> > even without immediate values optimizations, tests done by Hideo Aoki on ia64
> > show no regression. His test case was using hackbench on a kernel where
> > scheduler instrumentation (about 5 events in code scheduler code) was added.
> 
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
> > Acked-by: Masami Hiramatsu <mhiramat@...hat.com>
> > CC: 'Peter Zijlstra' <peterz@...radead.org>
> > CC: "Frank Ch. Eigler" <fche@...hat.com>
> > CC: 'Ingo Molnar' <mingo@...e.hu>
> > CC: 'Hideo AOKI' <haoki@...hat.com>
> > CC: Takashi Nishiie <t-nishiie@...css.fujitsu.com>
> > CC: 'Steven Rostedt' <rostedt@...dmis.org>
> > CC: Alexander Viro <viro@...iv.linux.org.uk>
> > CC: Eduard - Gabriel Munteanu <eduard.munteanu@...ux360.ro>
> > ---
> >  include/asm-generic/vmlinux.lds.h |    6 
> >  include/linux/module.h            |   17 +
> >  include/linux/tracepoint.h        |  123 +++++++++
> >  init/Kconfig                      |    7 
> >  kernel/Makefile                   |    1 
> >  kernel/module.c                   |   66 +++++
> >  kernel/tracepoint.c               |  474 ++++++++++++++++++++++++++++++++++++++
> >  7 files changed, 692 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6-lttng/init/Kconfig
> > ===================================================================
> > --- linux-2.6-lttng.orig/init/Kconfig	2008-07-09 10:55:46.000000000 -0400
> > +++ linux-2.6-lttng/init/Kconfig	2008-07-09 10:55:58.000000000 -0400
> > @@ -782,6 +782,13 @@ config PROFILING
> >  	  Say Y here to enable the extended profiling support mechanisms used
> >  	  by profilers such as OProfile.
> >  
> > +config TRACEPOINTS
> > +	bool "Activate tracepoints"
> > +	default y
> > +	help
> > +	  Place an empty function call at each tracepoint site. Can be
> > +	  dynamically changed for a probe function.
> > +
> >  config MARKERS
> >  	bool "Activate markers"
> >  	help
> > Index: linux-2.6-lttng/kernel/Makefile
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/Makefile	2008-07-09 10:55:46.000000000 -0400
> > +++ linux-2.6-lttng/kernel/Makefile	2008-07-09 10:55:58.000000000 -0400
> > @@ -77,6 +77,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
> >  obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
> >  obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
> >  obj-$(CONFIG_MARKERS) += marker.o
> > +obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
> >  obj-$(CONFIG_LATENCYTOP) += latencytop.o
> >  obj-$(CONFIG_FTRACE) += trace/
> >  obj-$(CONFIG_TRACING) += trace/
> > Index: linux-2.6-lttng/include/linux/tracepoint.h
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ linux-2.6-lttng/include/linux/tracepoint.h	2008-07-09 10:55:58.000000000 -0400
> > @@ -0,0 +1,123 @@
> > +#ifndef _LINUX_TRACEPOINT_H
> > +#define _LINUX_TRACEPOINT_H
> > +
> > +/*
> > + * Kernel Tracepoint API.
> > + *
> > + * See Documentation/tracepoint.txt.
> > + *
> > + * (C) Copyright 2008 Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
> > + *
> > + * Heavily inspired from the Linux Kernel Markers.
> > + *
> > + * This file is released under the GPLv2.
> > + * See the file COPYING for more details.
> > + */
> > +
> > +#include <linux/types.h>
> > +
> > +struct module;
> > +struct tracepoint;
> > +
> > +struct tracepoint {
> > +	const char *name;		/* Tracepoint name */
> > +	int state;			/* State. */
> > +	void **funcs;
> > +} __attribute__((aligned(8)));
> > +
> > +
> > +#define TPPROTO(args...)	args
> > +#define TPARGS(args...)		args
> > +
> > +#ifdef CONFIG_TRACEPOINTS
> > +
> > +#define __DO_TRACE(tp, proto, args)					\
> > +	do {								\
> > +		int i;							\
> > +		void **funcs;						\
> > +		preempt_disable();					\
> > +		funcs = (tp)->funcs;					\
> > +		smp_read_barrier_depends();				\
> > +		if (funcs) {						\
> > +			for (i = 0; funcs[i]; i++) {			\
> 
> can't you get rid of 'i' and write:
> 
>   void **func;
> 
>   preempt_disable();
>   func = (tp)->funcs;
>   smp_read_barrier_depends();
>   for (; func; func++)
>     ((void (*)(proto))func)(args);
>   preempt_enable();
> 

Yes, I though there would be an optimization to do here, I'll use your
proposal. This code snippet is especially important since it will
generate instructions near every tracepoint side. Saving a few bytes
becomes important.

Given that (tp)->funcs references an array of function pointers and that
it can be NULL, the if (funcs) test must still be there and we must use

#define __DO_TRACE(tp, proto, args)					\
	do {								\
		void *func;						\
									\
		preempt_disable();					\
		if ((tp)->funcs) {					\
			func = rcu_dereference((tp)->funcs);		\
			for (; func; func++) {				\
				((void(*)(proto))(func))(args);		\
			}						\
		}							\
		preempt_enable();					\
	} while (0)


The resulting assembly is a bit more dense than my previous
implementation, which is good :

On x86_64 :

 820:   bf 01 00 00 00          mov    $0x1,%edi
 825:   e8 00 00 00 00          callq  82a <thread_return+0x136>
 82a:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 831 <thread_retur
n+0x13d>
 831:   48 85 c0                test   %rax,%rax
 834:   74 22                   je     858 <thread_return+0x164>
 836:   48 8b 18                mov    (%rax),%rbx
 839:   48 85 db                test   %rbx,%rbx
 83c:   74 1a                   je     858 <thread_return+0x164>
 83e:   66 90                   xchg   %ax,%ax
 840:   48 8b 95 68 ff ff ff    mov    -0x98(%rbp),%rdx
 847:   48 8b b5 60 ff ff ff    mov    -0xa0(%rbp),%rsi
 84e:   4c 89 e7                mov    %r12,%rdi
 851:   ff d3                   callq  *%rbx
 853:   48 ff c3                inc    %rbx
 856:   75 e8                   jne    840 <thread_return+0x14c>
 858:   bf 01 00 00 00          mov    $0x1,%edi
 85d:   e8 00 00 00 00          callq  862 <thread_return+0x16e>
 862:

For 66 bytes (a 2 arguments tracepoint). Note that these bytes are
outside of the critical path in an unlikely branch. The branch test
bytes have been discussed thoroughly in the "Immediate Values" work,
which can be seen as an optimisation to be integrated later.

The current branch code added to the critical path is, on x86_64 :

 5ff:   b8 00 00 00 00          mov    $0x0,%eax
 604:   85 c0                   test   %eax,%eax
 606:   0f 85 14 02 00 00       jne    820 <thread_return+0x12c>

The immediate values can make this smaller by using a 2-bytes movb
instructions to populate eax, which would save 3 bytes of cache-hot
cachelines.

> Also, why is the preempt_disable needed?
> 

Addition and removal of tracepoints is synchronized by RCU using the
scheduler (and preempt_disable) as guarantees to find a quiescent state
(this is really RCU "classic"). The update side uses rcu_barrier_sched()
with call_rcu_sched() and the read/execute side uses
"preempt_disable()/preempt_enable()".


> > +				((void(*)(proto))(funcs[i]))(args);	\
> > +			}						\
> > +		}							\
> > +		preempt_enable();					\
> > +	} while (0)
> > +
> > +/*
> > + * Make sure the alignment of the structure in the __tracepoints section will
> > + * not add unwanted padding between the beginning of the section and the
> > + * structure. Force alignment to the same alignment as the section start.
> > + */
> > +#define DEFINE_TRACE(name, proto, args)					\
> > +	static inline void trace_##name(proto)				\
> > +	{								\
> > +		static const char __tpstrtab_##name[]			\
> > +		__attribute__((section("__tracepoints_strings")))	\
> > +		= #name ":" #proto;					\
> > +		static struct tracepoint __tracepoint_##name		\
> > +		__attribute__((section("__tracepoints"), aligned(8))) =	\
> > +		{ __tpstrtab_##name, 0, NULL };				\
> > +		if (unlikely(__tracepoint_##name.state))		\
> > +			__DO_TRACE(&__tracepoint_##name,		\
> > +				TPPROTO(proto), TPARGS(args));		\
> > +	}								\
> > +	static inline int register_trace_##name(void (*probe)(proto))	\
> > +	{								\
> > +		return tracepoint_probe_register(#name ":" #proto,	\
> > +			(void *)probe);					\
> > +	}								\
> > +	static inline void unregister_trace_##name(void (*probe)(proto))\
> > +	{								\
> > +		tracepoint_probe_unregister(#name ":" #proto,		\
> > +			(void *)probe);					\
> > +	}
> > +
> > +extern void tracepoint_update_probe_range(struct tracepoint *begin,
> > +	struct tracepoint *end);
> > +
> > +#else /* !CONFIG_TRACEPOINTS */
> > +#define DEFINE_TRACE(name, proto, args)			\
> > +	static inline void _do_trace_##name(struct tracepoint *tp, proto) \
> > +	{ }								\
> > +	static inline void trace_##name(proto)				\
> > +	{ }								\
> > +	static inline int register_trace_##name(void (*probe)(proto))	\
> > +	{								\
> > +		return -ENOSYS;						\
> > +	}								\
> > +	static inline void unregister_trace_##name(void (*probe)(proto))\
> > +	{ }
> > +
> > +static inline void tracepoint_update_probe_range(struct tracepoint *begin,
> > +	struct tracepoint *end)
> > +{ }
> > +#endif /* CONFIG_TRACEPOINTS */
> > +
> > +/*
> > + * Connect a probe to a tracepoint.
> > + * Internal API, should not be used directly.
> > + */
> > +extern int tracepoint_probe_register(const char *name, void *probe);
> > +
> > +/*
> > + * Disconnect a probe from a tracepoint.
> > + * Internal API, should not be used directly.
> > + */
> > +extern int tracepoint_probe_unregister(const char *name, void *probe);
> > +
> > +struct tracepoint_iter {
> > +	struct module *module;
> > +	struct tracepoint *tracepoint;
> > +};
> > +
> > +extern void tracepoint_iter_start(struct tracepoint_iter *iter);
> > +extern void tracepoint_iter_next(struct tracepoint_iter *iter);
> > +extern void tracepoint_iter_stop(struct tracepoint_iter *iter);
> > +extern void tracepoint_iter_reset(struct tracepoint_iter *iter);
> > +extern int tracepoint_get_iter_range(struct tracepoint **tracepoint,
> > +	struct tracepoint *begin, struct tracepoint *end);
> > +
> > +#endif
> > Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2008-07-09 10:55:46.000000000 -0400
> > +++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2008-07-09 10:55:58.000000000 -0400
> > @@ -52,7 +52,10 @@
> >  	. = ALIGN(8);							\
> >  	VMLINUX_SYMBOL(__start___markers) = .;				\
> >  	*(__markers)							\
> > -	VMLINUX_SYMBOL(__stop___markers) = .;
> > +	VMLINUX_SYMBOL(__stop___markers) = .;				\
> > +	VMLINUX_SYMBOL(__start___tracepoints) = .;			\
> > +	*(__tracepoints)						\
> > +	VMLINUX_SYMBOL(__stop___tracepoints) = .;
> >  
> >  #define RO_DATA(align)							\
> >  	. = ALIGN((align));						\
> > @@ -61,6 +64,7 @@
> >  		*(.rodata) *(.rodata.*)					\
> >  		*(__vermagic)		/* Kernel version magic */	\
> >  		*(__markers_strings)	/* Markers: strings */		\
> > +		*(__tracepoints_strings)/* Tracepoints: strings */	\
> >  	}								\
> >  									\
> >  	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\
> > Index: linux-2.6-lttng/kernel/tracepoint.c
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ linux-2.6-lttng/kernel/tracepoint.c	2008-07-09 10:55:58.000000000 -0400
> > @@ -0,0 +1,474 @@
> > +/*
> > + * Copyright (C) 2008 Mathieu Desnoyers
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> > + */
> > +#include <linux/module.h>
> > +#include <linux/mutex.h>
> > +#include <linux/types.h>
> > +#include <linux/jhash.h>
> > +#include <linux/list.h>
> > +#include <linux/rcupdate.h>
> > +#include <linux/tracepoint.h>
> > +#include <linux/err.h>
> > +#include <linux/slab.h>
> > +
> > +extern struct tracepoint __start___tracepoints[];
> > +extern struct tracepoint __stop___tracepoints[];
> > +
> > +/* Set to 1 to enable tracepoint debug output */
> > +static const int tracepoint_debug;
> > +
> > +/*
> > + * tracepoints_mutex nests inside module_mutex. Tracepoints mutex protects the
> > + * builtin and module tracepoints and the hash table.
> > + */
> > +static DEFINE_MUTEX(tracepoints_mutex);
> > +
> > +/*
> > + * Tracepoint hash table, containing the active tracepoints.
> > + * Protected by tracepoints_mutex.
> > + */
> > +#define TRACEPOINT_HASH_BITS 6
> > +#define TRACEPOINT_TABLE_SIZE (1 << TRACEPOINT_HASH_BITS)
> > +
> > +/*
> > + * Note about RCU :
> > + * It is used to to delay the free of multiple probes array until a quiescent
> > + * state is reached.
> > + * Tracepoint entries modifications are protected by the tracepoints_mutex.
> > + */
> > +struct tracepoint_entry {
> > +	struct hlist_node hlist;
> > +	void **funcs;
> > +	int refcount;	/* Number of times armed. 0 if disarmed. */
> > +	struct rcu_head rcu;
> > +	void *oldptr;
> > +	unsigned char rcu_pending:1;
> > +	char name[0];
> > +};
> > +
> > +static struct hlist_head tracepoint_table[TRACEPOINT_TABLE_SIZE];
> > +
> > +static void free_old_closure(struct rcu_head *head)
> > +{
> > +	struct tracepoint_entry *entry = container_of(head,
> > +		struct tracepoint_entry, rcu);
> > +	kfree(entry->oldptr);
> > +	/* Make sure we free the data before setting the pending flag to 0 */
> > +	smp_wmb();
> > +	entry->rcu_pending = 0;
> > +}
> > +
> > +static void tracepoint_entry_free_old(struct tracepoint_entry *entry, void *old)
> > +{
> > +	if (!old)
> > +		return;
> > +	entry->oldptr = old;
> > +	entry->rcu_pending = 1;
> > +	/* write rcu_pending before calling the RCU callback */
> > +	smp_wmb();
> > +#ifdef CONFIG_PREEMPT_RCU
> > +	synchronize_sched();	/* Until we have the call_rcu_sched() */
> > +#endif
> 
> Does this have something to do with the preempt_disable above?
> 

Yes, it does. We make sure the previous array containing probes, which
has been scheduled for deletion by the rcu callback, is indeed freed
before we proceed to the next update. It therefore limits the rate of
modification of a single tracepoint to one update per RCU period. The
objective here is to permit fast batch add/removal of probes on
_different_ tracepoints.

This use of "synchronize_sched()" can be changed for call_rcu_sched() in
linux-next, I'll fix this.


> > +	call_rcu(&entry->rcu, free_old_closure);
> > +}
> > +
> > +static void debug_print_probes(struct tracepoint_entry *entry)
> > +{
> > +	int i;
> > +
> > +	if (!tracepoint_debug)
> > +		return;
> > +
> > +	for (i = 0; entry->funcs[i]; i++)
> > +		printk(KERN_DEBUG "Probe %d : %p\n", i, entry->funcs[i]);
> > +}
> > +
> > +static void *
> > +tracepoint_entry_add_probe(struct tracepoint_entry *entry, void *probe)
> > +{
> > +	int nr_probes = 0;
> > +	void **old, **new;
> > +
> > +	WARN_ON(!probe);
> > +
> > +	debug_print_probes(entry);
> > +	old = entry->funcs;
> > +	if (old) {
> > +		/* (N -> N+1), (N != 0, 1) probes */
> > +		for (nr_probes = 0; old[nr_probes]; nr_probes++)
> > +			if (old[nr_probes] == probe)
> > +				return ERR_PTR(-EBUSY);
> 
> -EEXIST ?
> 

good point.

> > +	}
> > +	/* + 2 : one for new probe, one for NULL func */
> > +	new = kzalloc((nr_probes + 2) * sizeof(void *), GFP_KERNEL);
> > +	if (new == NULL)
> > +		return ERR_PTR(-ENOMEM);
> > +	if (old)
> > +		memcpy(new, old, nr_probes * sizeof(void *));
> > +	new[nr_probes] = probe;
> > +	entry->refcount = nr_probes + 1;
> > +	entry->funcs = new;
> > +	debug_print_probes(entry);
> > +	return old;
> > +}
> > +
> > +static void *
> > +tracepoint_entry_remove_probe(struct tracepoint_entry *entry, void *probe)
> > +{
> > +	int nr_probes = 0, nr_del = 0, i;
> > +	void **old, **new;
> > +
> > +	old = entry->funcs;
> > +
> > +	debug_print_probes(entry);
> > +	/* (N -> M), (N > 1, M >= 0) probes */
> > +	for (nr_probes = 0; old[nr_probes]; nr_probes++) {
> > +		if ((!probe || old[nr_probes] == probe))
> > +			nr_del++;
> > +	}
> > +
> > +	if (nr_probes - nr_del == 0) {
> > +		/* N -> 0, (N > 1) */
> > +		entry->funcs = NULL;
> > +		entry->refcount = 0;
> > +		debug_print_probes(entry);
> > +		return old;
> > +	} else {
> > +		int j = 0;
> > +		/* N -> M, (N > 1, M > 0) */
> > +		/* + 1 for NULL */
> > +		new = kzalloc((nr_probes - nr_del + 1)
> > +			* sizeof(void *), GFP_KERNEL);
> > +		if (new == NULL)
> > +			return ERR_PTR(-ENOMEM);
> > +		for (i = 0; old[i]; i++)
> > +			if ((probe && old[i] != probe))
> > +				new[j++] = old[i];
> > +		entry->refcount = nr_probes - nr_del;
> > +		entry->funcs = new;
> > +	}
> > +	debug_print_probes(entry);
> > +	return old;
> > +}
> > +
> > +/*
> > + * Get tracepoint if the tracepoint is present in the tracepoint hash table.
> > + * Must be called with tracepoints_mutex held.
> > + * Returns NULL if not present.
> > + */
> > +static struct tracepoint_entry *get_tracepoint(const char *name)
> > +{
> > +	struct hlist_head *head;
> > +	struct hlist_node *node;
> > +	struct tracepoint_entry *e;
> > +	u32 hash = jhash(name, strlen(name), 0);
> > +
> > +	head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
> > +	hlist_for_each_entry(e, node, head, hlist) {
> > +		if (!strcmp(name, e->name))
> > +			return e;
> > +	}
> > +	return NULL;
> > +}
> > +
> > +/*
> > + * Add the tracepoint to the tracepoint hash table. Must be called with
> > + * tracepoints_mutex held.
> > + */
> > +static struct tracepoint_entry *add_tracepoint(const char *name)
> > +{
> > +	struct hlist_head *head;
> > +	struct hlist_node *node;
> > +	struct tracepoint_entry *e;
> > +	size_t name_len = strlen(name) + 1;
> > +	u32 hash = jhash(name, name_len-1, 0);
> > +
> > +	head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
> > +	hlist_for_each_entry(e, node, head, hlist) {
> > +		if (!strcmp(name, e->name)) {
> > +			printk(KERN_NOTICE
> > +				"tracepoint %s busy\n", name);
> > +			return ERR_PTR(-EBUSY);	/* Already there */
> 
> -EEXIST
> 

Yes.

> > +		}
> > +	}
> > +	/*
> > +	 * Using kmalloc here to allocate a variable length element. Could
> > +	 * cause some memory fragmentation if overused.
> > +	 */
> > +	e = kmalloc(sizeof(struct tracepoint_entry) + name_len, GFP_KERNEL);
> > +	if (!e)
> > +		return ERR_PTR(-ENOMEM);
> > +	memcpy(&e->name[0], name, name_len);
> > +	e->funcs = NULL;
> > +	e->refcount = 0;
> > +	e->rcu_pending = 0;
> > +	hlist_add_head(&e->hlist, head);
> > +	return e;
> > +}
> > +
> > +/*
> > + * Remove the tracepoint from the tracepoint hash table. Must be called with
> > + * mutex_lock held.
> > + */
> > +static int remove_tracepoint(const char *name)
> > +{
> > +	struct hlist_head *head;
> > +	struct hlist_node *node;
> > +	struct tracepoint_entry *e;
> > +	int found = 0;
> > +	size_t len = strlen(name) + 1;
> > +	u32 hash = jhash(name, len-1, 0);
> > +
> > +	head = &tracepoint_table[hash & ((1 << TRACEPOINT_HASH_BITS)-1)];
> > +	hlist_for_each_entry(e, node, head, hlist) {
> > +		if (!strcmp(name, e->name)) {
> > +			found = 1;
> > +			break;
> > +		}
> > +	}
> > +	if (!found)
> > +		return -ENOENT;
> > +	if (e->refcount)
> > +		return -EBUSY;
> 
> ok, this really is busy.
> 
> > +	hlist_del(&e->hlist);
> > +	/* Make sure the call_rcu has been executed */
> > +	if (e->rcu_pending)
> > +		rcu_barrier();
> > +	kfree(e);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Sets the probe callback corresponding to one tracepoint.
> > + */
> > +static void set_tracepoint(struct tracepoint_entry **entry,
> > +	struct tracepoint *elem, int active)
> > +{
> > +	WARN_ON(strcmp((*entry)->name, elem->name) != 0);
> > +
> > +	smp_wmb();
> > +	/*
> > +	 * We also make sure that the new probe callbacks array is consistent
> > +	 * before setting a pointer to it.
> > +	 */
> > +	rcu_assign_pointer(elem->funcs, (*entry)->funcs);
> 
> rcu_assign_pointer() already does that wmb !?
> Also, its polite to reference the pairing site in the barrier comment.
> 

Good point. I'll then remove the wmb, and change the
smp_read_barrier_depends for a rcu_dereference in __DO_TRACE. It will
make things clearer.

The comments becomes :

        /*
         * rcu_assign_pointer has a smp_wmb() which makes sure that the new
         * probe callbacks array is consistent before setting a pointer to it.
         * This array is referenced by __DO_TRACE from
         * include/linux/tracepoints.h. A matching rcu_dereference() is used.
         */

I'll release a new version including those changes shortly,

Thanks,

Mathieu

> > +	elem->state = active;
> > +}
> > +
> > +/*
> > + * Disable a tracepoint and its probe callback.
> > + * Note: only waiting an RCU period after setting elem->call to the empty
> > + * function insures that the original callback is not used anymore. This insured
> > + * by preempt_disable around the call site.
> > + */
> > +static void disable_tracepoint(struct tracepoint *elem)
> > +{
> > +	elem->state = 0;
> > +}
> > +
> > +/**
> > + * tracepoint_update_probe_range - Update a probe range
> > + * @begin: beginning of the range
> > + * @end: end of the range
> > + *
> > + * Updates the probe callback corresponding to a range of tracepoints.
> > + */
> > +void tracepoint_update_probe_range(struct tracepoint *begin,
> > +	struct tracepoint *end)
> > +{
> > +	struct tracepoint *iter;
> > +	struct tracepoint_entry *mark_entry;
> > +
> > +	mutex_lock(&tracepoints_mutex);
> > +	for (iter = begin; iter < end; iter++) {
> > +		mark_entry = get_tracepoint(iter->name);
> > +		if (mark_entry) {
> > +			set_tracepoint(&mark_entry, iter,
> > +					!!mark_entry->refcount);
> > +		} else {
> > +			disable_tracepoint(iter);
> > +		}
> > +	}
> > +	mutex_unlock(&tracepoints_mutex);
> > +}
> > +
> > +/*
> > + * Update probes, removing the faulty probes.
> > + */
> > +static void tracepoint_update_probes(void)
> > +{
> > +	/* Core kernel tracepoints */
> > +	tracepoint_update_probe_range(__start___tracepoints,
> > +		__stop___tracepoints);
> > +	/* tracepoints in modules. */
> > +	module_update_tracepoints();
> > +}
> > +
> > +/**
> > + * tracepoint_probe_register -  Connect a probe to a tracepoint
> > + * @name: tracepoint name
> > + * @probe: probe handler
> > + *
> > + * Returns 0 if ok, error value on error.
> > + * The probe address must at least be aligned on the architecture pointer size.
> > + */
> > +int tracepoint_probe_register(const char *name, void *probe)
> > +{
> > +	struct tracepoint_entry *entry;
> > +	int ret = 0;
> > +	void *old;
> > +
> > +	mutex_lock(&tracepoints_mutex);
> > +	entry = get_tracepoint(name);
> > +	if (!entry) {
> > +		entry = add_tracepoint(name);
> > +		if (IS_ERR(entry)) {
> > +			ret = PTR_ERR(entry);
> > +			goto end;
> > +		}
> > +	}
> > +	/*
> > +	 * If we detect that a call_rcu is pending for this tracepoint,
> > +	 * make sure it's executed now.
> > +	 */
> > +	if (entry->rcu_pending)
> > +		rcu_barrier();
> > +	old = tracepoint_entry_add_probe(entry, probe);
> > +	if (IS_ERR(old)) {
> > +		ret = PTR_ERR(old);
> > +		goto end;
> > +	}
> > +	mutex_unlock(&tracepoints_mutex);
> > +	tracepoint_update_probes();		/* may update entry */
> > +	mutex_lock(&tracepoints_mutex);
> > +	entry = get_tracepoint(name);
> > +	WARN_ON(!entry);
> > +	tracepoint_entry_free_old(entry, old);
> > +end:
> > +	mutex_unlock(&tracepoints_mutex);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_probe_register);
> > +
> > +/**
> > + * tracepoint_probe_unregister -  Disconnect a probe from a tracepoint
> > + * @name: tracepoint name
> > + * @probe: probe function pointer
> > + *
> > + * We do not need to call a synchronize_sched to make sure the probes have
> > + * finished running before doing a module unload, because the module unload
> > + * itself uses stop_machine(), which insures that every preempt disabled section
> > + * have finished.
> > + */
> > +int tracepoint_probe_unregister(const char *name, void *probe)
> > +{
> > +	struct tracepoint_entry *entry;
> > +	void *old;
> > +	int ret = -ENOENT;
> > +
> > +	mutex_lock(&tracepoints_mutex);
> > +	entry = get_tracepoint(name);
> > +	if (!entry)
> > +		goto end;
> > +	if (entry->rcu_pending)
> > +		rcu_barrier();
> > +	old = tracepoint_entry_remove_probe(entry, probe);
> > +	mutex_unlock(&tracepoints_mutex);
> > +	tracepoint_update_probes();		/* may update entry */
> > +	mutex_lock(&tracepoints_mutex);
> > +	entry = get_tracepoint(name);
> > +	if (!entry)
> > +		goto end;
> > +	tracepoint_entry_free_old(entry, old);
> > +	remove_tracepoint(name);	/* Ignore busy error message */
> > +	ret = 0;
> > +end:
> > +	mutex_unlock(&tracepoints_mutex);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_probe_unregister);
> > +
> > +/**
> > + * tracepoint_get_iter_range - Get a next tracepoint iterator given a range.
> > + * @tracepoint: current tracepoints (in), next tracepoint (out)
> > + * @begin: beginning of the range
> > + * @end: end of the range
> > + *
> > + * Returns whether a next tracepoint has been found (1) or not (0).
> > + * Will return the first tracepoint in the range if the input tracepoint is
> > + * NULL.
> > + */
> > +int tracepoint_get_iter_range(struct tracepoint **tracepoint,
> > +	struct tracepoint *begin, struct tracepoint *end)
> > +{
> > +	if (!*tracepoint && begin != end) {
> > +		*tracepoint = begin;
> > +		return 1;
> > +	}
> > +	if (*tracepoint >= begin && *tracepoint < end)
> > +		return 1;
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_get_iter_range);
> > +
> > +static void tracepoint_get_iter(struct tracepoint_iter *iter)
> > +{
> > +	int found = 0;
> > +
> > +	/* Core kernel tracepoints */
> > +	if (!iter->module) {
> > +		found = tracepoint_get_iter_range(&iter->tracepoint,
> > +				__start___tracepoints, __stop___tracepoints);
> > +		if (found)
> > +			goto end;
> > +	}
> > +	/* tracepoints in modules. */
> > +	found = module_get_iter_tracepoints(iter);
> > +end:
> > +	if (!found)
> > +		tracepoint_iter_reset(iter);
> > +}
> > +
> > +void tracepoint_iter_start(struct tracepoint_iter *iter)
> > +{
> > +	tracepoint_get_iter(iter);
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_iter_start);
> > +
> > +void tracepoint_iter_next(struct tracepoint_iter *iter)
> > +{
> > +	iter->tracepoint++;
> > +	/*
> > +	 * iter->tracepoint may be invalid because we blindly incremented it.
> > +	 * Make sure it is valid by marshalling on the tracepoints, getting the
> > +	 * tracepoints from following modules if necessary.
> > +	 */
> > +	tracepoint_get_iter(iter);
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_iter_next);
> > +
> > +void tracepoint_iter_stop(struct tracepoint_iter *iter)
> > +{
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_iter_stop);
> > +
> > +void tracepoint_iter_reset(struct tracepoint_iter *iter)
> > +{
> > +	iter->module = NULL;
> > +	iter->tracepoint = NULL;
> > +}
> > +EXPORT_SYMBOL_GPL(tracepoint_iter_reset);
> > Index: linux-2.6-lttng/kernel/module.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/module.c	2008-07-09 10:55:46.000000000 -0400
> > +++ linux-2.6-lttng/kernel/module.c	2008-07-09 10:55:58.000000000 -0400
> > @@ -47,6 +47,7 @@
> >  #include <asm/sections.h>
> >  #include <linux/license.h>
> >  #include <asm/sections.h>
> > +#include <linux/tracepoint.h>
> >  
> >  #if 0
> >  #define DEBUGP printk
> > @@ -1824,6 +1825,8 @@ static struct module *load_module(void _
> >  #endif
> >  	unsigned int markersindex;
> >  	unsigned int markersstringsindex;
> > +	unsigned int tracepointsindex;
> > +	unsigned int tracepointsstringsindex;
> >  	struct module *mod;
> >  	long err = 0;
> >  	void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */
> > @@ -2110,6 +2113,9 @@ static struct module *load_module(void _
> >  	markersindex = find_sec(hdr, sechdrs, secstrings, "__markers");
> >   	markersstringsindex = find_sec(hdr, sechdrs, secstrings,
> >  					"__markers_strings");
> > +	tracepointsindex = find_sec(hdr, sechdrs, secstrings, "__tracepoints");
> > +	tracepointsstringsindex = find_sec(hdr, sechdrs, secstrings,
> > +					"__tracepoints_strings");
> >  
> >  	/* Now do relocations. */
> >  	for (i = 1; i < hdr->e_shnum; i++) {
> > @@ -2137,6 +2143,12 @@ static struct module *load_module(void _
> >  	mod->num_markers =
> >  		sechdrs[markersindex].sh_size / sizeof(*mod->markers);
> >  #endif
> > +#ifdef CONFIG_TRACEPOINTS
> > +	mod->tracepoints = (void *)sechdrs[tracepointsindex].sh_addr;
> > +	mod->num_tracepoints =
> > +		sechdrs[tracepointsindex].sh_size / sizeof(*mod->tracepoints);
> > +#endif
> > +
> >  
> >          /* Find duplicate symbols */
> >  	err = verify_export_symbols(mod);
> > @@ -2155,11 +2167,16 @@ static struct module *load_module(void _
> >  
> >  	add_kallsyms(mod, sechdrs, symindex, strindex, secstrings);
> >  
> > +	if (!mod->taints) {
> >  #ifdef CONFIG_MARKERS
> > -	if (!mod->taints)
> >  		marker_update_probe_range(mod->markers,
> >  			mod->markers + mod->num_markers);
> >  #endif
> > +#ifdef CONFIG_TRACEPOINTS
> > +		tracepoint_update_probe_range(mod->tracepoints,
> > +			mod->tracepoints + mod->num_tracepoints);
> > +#endif
> > +	}
> >  	err = module_finalize(hdr, sechdrs, mod);
> >  	if (err < 0)
> >  		goto cleanup;
> > @@ -2710,3 +2727,50 @@ void module_update_markers(void)
> >  	mutex_unlock(&module_mutex);
> >  }
> >  #endif
> > +
> > +#ifdef CONFIG_TRACEPOINTS
> > +void module_update_tracepoints(void)
> > +{
> > +	struct module *mod;
> > +
> > +	mutex_lock(&module_mutex);
> > +	list_for_each_entry(mod, &modules, list)
> > +		if (!mod->taints)
> > +			tracepoint_update_probe_range(mod->tracepoints,
> > +				mod->tracepoints + mod->num_tracepoints);
> > +	mutex_unlock(&module_mutex);
> > +}
> > +
> > +/*
> > + * Returns 0 if current not found.
> > + * Returns 1 if current found.
> > + */
> > +int module_get_iter_tracepoints(struct tracepoint_iter *iter)
> > +{
> > +	struct module *iter_mod;
> > +	int found = 0;
> > +
> > +	mutex_lock(&module_mutex);
> > +	list_for_each_entry(iter_mod, &modules, list) {
> > +		if (!iter_mod->taints) {
> > +			/*
> > +			 * Sorted module list
> > +			 */
> > +			if (iter_mod < iter->module)
> > +				continue;
> > +			else if (iter_mod > iter->module)
> > +				iter->tracepoint = NULL;
> > +			found = tracepoint_get_iter_range(&iter->tracepoint,
> > +				iter_mod->tracepoints,
> > +				iter_mod->tracepoints
> > +					+ iter_mod->num_tracepoints);
> > +			if (found) {
> > +				iter->module = iter_mod;
> > +				break;
> > +			}
> > +		}
> > +	}
> > +	mutex_unlock(&module_mutex);
> > +	return found;
> > +}
> > +#endif
> > Index: linux-2.6-lttng/include/linux/module.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/include/linux/module.h	2008-07-09 10:55:46.000000000 -0400
> > +++ linux-2.6-lttng/include/linux/module.h	2008-07-09 10:57:22.000000000 -0400
> > @@ -16,6 +16,7 @@
> >  #include <linux/kobject.h>
> >  #include <linux/moduleparam.h>
> >  #include <linux/marker.h>
> > +#include <linux/tracepoint.h>
> >  #include <asm/local.h>
> >  
> >  #include <asm/module.h>
> > @@ -331,6 +332,10 @@ struct module
> >  	struct marker *markers;
> >  	unsigned int num_markers;
> >  #endif
> > +#ifdef CONFIG_TRACEPOINTS
> > +	struct tracepoint *tracepoints;
> > +	unsigned int num_tracepoints;
> > +#endif
> >  
> >  #ifdef CONFIG_MODULE_UNLOAD
> >  	/* What modules depend on me? */
> > @@ -454,6 +459,9 @@ extern void print_modules(void);
> >  
> >  extern void module_update_markers(void);
> >  
> > +extern void module_update_tracepoints(void);
> > +extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);
> > +
> >  #else /* !CONFIG_MODULES... */
> >  #define EXPORT_SYMBOL(sym)
> >  #define EXPORT_SYMBOL_GPL(sym)
> > @@ -558,6 +566,15 @@ static inline void module_update_markers
> >  {
> >  }
> >  
> > +static inline void module_update_tracepoints(void)
> > +{
> > +}
> > +
> > +static inline int module_get_iter_tracepoints(struct tracepoint_iter *iter)
> > +{
> > +	return 0;
> > +}
> > +
> >  #endif /* CONFIG_MODULES */
> >  
> >  struct device_driver;
> > 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ