lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <996087648.185312.1424977593755.JavaMail.zimbra@efficios.com>
Date:	Thu, 26 Feb 2015 19:06:33 +0000 (UTC)
From:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:	paulmck@...ux.vnet.ibm.com
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Andi Kleen <ak@...ux.intel.com>,
	Andi Kleen <andi@...stfloor.org>, x86@...nel.org,
	linux-kernel@...r.kernel.org, oleg@...hat.com,
	rusty@...tcorp.com.au, mingo@...nel.org
Subject: Re: [RFC][PATCH] module: Optimize __module_address() using a
 latched RB-tree

----- Original Message -----
> From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
> To: "Peter Zijlstra" <peterz@...radead.org>
> Cc: "Mathieu Desnoyers" <mathieu.desnoyers@...icios.com>, "Andi Kleen" <ak@...ux.intel.com>, "Andi Kleen"
> <andi@...stfloor.org>, x86@...nel.org, linux-kernel@...r.kernel.org, oleg@...hat.com, rusty@...tcorp.com.au,
> mingo@...nel.org
> Sent: Thursday, February 26, 2015 1:28:17 PM
> Subject: Re: [RFC][PATCH] module: Optimize __module_address() using a latched RB-tree
> 
> On Thu, Feb 26, 2015 at 05:43:56PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 26, 2015 at 04:02:43PM +0000, Mathieu Desnoyers wrote:
> > > Perhaps you could use mod_value() below, and introduce a
> > > "mod_size()" too. This would keep the init vs core selection
> > > out of the traversal code.
> > 
> > Indeed!
> > 
> > > Is it customary to define static variables in the
> > > middle of a C file ?
> > 
> > Dunno, it seemed like a good a place as any.
> > 
> > > The rest looks good, especially for use of the latch.
> > > I'd be tempted to turn "0, 1, 2, 3" into an enum though,
> > > so we can follow in the code what each of those array
> > > entry really means.
> > 
> > Agreed.
> > 
> > ---
> > Subject: module: Optimize __module_address() using a latched RB-tree
> > From: Peter Zijlstra <peterz@...radead.org>
> > Date: Thu Feb 26 10:57:34 CET 2015
> > 
> > Currently __module_address() is using a linear search through all
> > modules in order to find the module corresponding to the provided
> > address. With a lot of modules this can take a lot of time.
> > 
> > One of the users of this is __kernel_text_address() which is employed
> > in many stack unwinders; which in turn are used by perf-callchain and
> > ftrace (possibly from NMI context).
> > 
> > So by optimizing __module_address() we optimize many stack unwinders
> > which are used by both perf and tracing in performance sensitive code.
> > 
> > Cc: Oleg Nesterov <oleg@...hat.com>
> > Cc: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
> > Cc: Rusty Russell <rusty@...tcorp.com.au>
> > Cc: Steven Rostedt <rostedt@...dmis.org>
> > Cc: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> 
> Color me confused, both by the existing code and the modifications.
> 
> It appears that you are using seqlock to force readers to retry when
> a concurrent update occurs, but I don't see what is ensuring that the
> readers see good data when racing with an insertion or a deletion.  Yes,
> the reader will be retried, courtesy of the seqlock, but that won't help
> if the reader segfaults before it gets to the read_seqcount_retry().

Good point! This scheme works when readers have no "side effect" due
to reading a data structure while it is changing. Dereferencing a
pointer is indeed a side-effect which could lead to a segfault.

> 
> Questions below.
> 
> > ---
> >  include/linux/module.h |   19 +++-
> >  kernel/module.c        |  212
> >  +++++++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 224 insertions(+), 7 deletions(-)
> > 
> > --- a/include/linux/module.h
> > +++ b/include/linux/module.h
> > @@ -17,6 +17,7 @@
> >  #include <linux/moduleparam.h>
> >  #include <linux/jump_label.h>
> >  #include <linux/export.h>
> > +#include <linux/rbtree.h>
> > 
> >  #include <linux/percpu.h>
> >  #include <asm/module.h>
> > @@ -210,6 +211,11 @@ enum module_state {
> >  	MODULE_STATE_UNFORMED,	/* Still setting it up. */
> >  };
> > 
> > +struct module_node {
> > +	struct module *mod;
> > +	struct rb_node node;
> > +};
> > +
> >  struct module {
> >  	enum module_state state;
> > 
> > @@ -269,8 +275,15 @@ struct module {
> >  	/* Startup function. */
> >  	int (*init)(void);
> > 
> > -	/* If this is non-NULL, vfree after init() returns */
> > -	void *module_init;
> > +	/*
> > +	 * If this is non-NULL, vfree after init() returns
> > +	 *
> > +	 * cacheline align here, such that:
> > +	 *   module_init, module_core, init_size, core_size and
> > +	 *   tree_node[0]
> > +	 * are on the same cacheline.
> > +	 */
> > +	void *module_init	____cacheline_aligned;
> > 
> >  	/* Here is the actual code + data, vfree'd on unload. */
> >  	void *module_core;
> > @@ -281,6 +294,8 @@ struct module {
> >  	/* The size of the executable code in each section.  */
> >  	unsigned int init_text_size, core_text_size;
> > 
> > +	struct module_node tree_node[4];
> > +
> >  	/* Size of RO sections of the module (text+rodata) */
> >  	unsigned int init_ro_size, core_ro_size;
> > 
> > --- a/kernel/module.c
> > +++ b/kernel/module.c
> > @@ -102,6 +102,204 @@
> >  DEFINE_MUTEX(module_mutex);
> >  EXPORT_SYMBOL_GPL(module_mutex);
> >  static LIST_HEAD(modules);
> > +
> > +
> > +/*
> > + * Use a latched RB-tree for __module_address()
> > + *
> > + * The latch concept is a multi-value concurrency data structure which
> > allows
> > + * unserialized access and guarantees at least one version is stable.
> > + *
> > + * It is employed here to optimize __module_address(), which needs to find
> > the
> > + * module (if any) associated with an address. Such questions are best
> > answered
> > + * using a binary search tree.
> > + *
> > + * Since modules use non-overlapping memory ranges we can use a regular
> > RB-tree
> > + * (as opposed to the interval tree). However, BSTs cannot be iterated
> > while
> > + * modified.
> > + *
> > + * To overcome this we use the latched RB-tree, it basically consists of
> > two
> > + * RB-trees which are modified in order, ensuring one is always
> > consistent.
> > + *
> > + * Things are somewhat more complicated because there are two ranges per
> > + * module, but other than that its pretty straight forward.
> > + *
> > + */
> > +
> > +enum {
> > +	latch0_core = 0,
> > +	latch1_core = 1,
> > +	latch0_init = 2,
> > +	latch1_init = 3,
> > +};
> > +
> > +enum {
> > +	latch_bit = 0x01,
> > +	init_bit  = 0x02,
> > +};
> > +
> > +struct latch_tree_root {
> > +	seqcount_t	seq;
> > +	struct rb_root	tree[2];
> > +};
> > +
> > +static unsigned long mod_value(struct module *mod, int idx)
> > +{
> > +	if (idx & init_bit)
> > +		return (unsigned long)mod->module_init;
> > +	else
> > +		return (unsigned long)mod->module_core;
> > +}
> > +
> > +static unsigned long mod_size(struct module *mod, int idx)
> > +{
> > +	if (idx & init_bit)
> > +		return mod->init_size;
> > +	else
> > +		return mod->core_size;
> > +}
> > +
> > +static struct module *mod_entry(struct rb_node *n)
> > +{
> > +	struct module_node *mn = container_of(n, struct module_node, node);
> > +	return mn->mod;
> > +}
> > +
> > +static int mod_node_idx(struct module *m, struct rb_node *n)
> > +{
> > +	struct module_node *mn = container_of(n, struct module_node, node);
> > +	int idx = mn - m->tree_node;
> > +
> > +	BUG_ON(mn->mod != m);
> > +	BUG_ON((unsigned)idx > 3);
> > +
> > +	return idx;
> > +}
> > +
> > +static void __tree_insert(struct latch_tree_root *mod_tree, struct module
> > *mod, int idx)
> > +{
> > +	struct rb_root *root = &mod_tree->tree[idx & latch_bit];
> > +	struct module_node *mn = &mod->tree_node[idx];
> > +	struct rb_node **link = &root->rb_node;
> > +	struct rb_node *parent = NULL;
> > +	unsigned long mod_val, m_val;
> > +	struct module *m;
> > +	int i;
> > +
> > +	mn->mod = mod;
> > +	mod_val = mod_value(mod, idx);
> > +
> > +	while (*link) {
> > +		parent = *link;
> > +		m = mod_entry(parent);
> > +		i = mod_node_idx(m, parent);
> > +		m_val = mod_value(m, i);
> > +
> > +		if (mod_val < m_val)
> > +			link = &parent->rb_left;
> > +		else
> > +			link = &parent->rb_right;
> > +	}
> > +
> > +	rb_link_node(&mn->node, parent, link);
> 
> This makes the new module visible to readers, if I understand correctly.
> There needs to be a memory barrier between initialization and this call
> to rb_link_node(), otherwise, both the CPU (well, for Alpha) and the
> compiler (for everyone) can reorder, which could result in some hapless
> reader seeing pre-initialized data.
> 
> Or did I miss the memory barrier?

Given that readers (which will retry eventually) can indeed
try to concurrently access the object, those would be needed.

> 
> > +	rb_insert_color(&mn->node, root);
> 
> This -might- be OK -- the rotations do not make new nodes visible,
> instead simply rearranging nodes that are already visible.
> 
> For both rb_link_node() and rb_insert_color(), given that we are updating
> pointers while readers are traversing them, READ_ONCE() and WRITE_ONCE()
> would be good.  Yes, the compiler -probably- doesn't mess you up, but it
> would be completely within its rights to do so.  :-/
> 
> Alpha would like either rcu_dereference() or lockless_dereference()
> instead of READ_ONCE(), of course!

Yes, probably better to err on the safe side here. We have to
take into account that a reader can observe the data structure
while being modified, but if this occurs, it will retry (but should
not segfault nor loop endlessly).

> 
> > +}
> > +
> > +static void __tree_remove(struct latch_tree_root *mod_tree, struct module
> > *mod, int idx)
> > +{
> > +	struct rb_root *root = &mod_tree->tree[idx & latch_bit];
> > +	struct module_node *mn = &mod->tree_node[idx];
> > +
> > +	rb_erase(&mn->node, root);
> 
> This is just removing and not freeing, so OK from a read-side viewpoint.
> WRITE_ONCE() would be good.

I'd be worried about the eventual free after this function.
There is a need to use RCU here to delay reclaim.

> 
> > +}
> > +
> > +/*
> > + * struct module is arranged such that:
> > + *
> > + *   module_init, module_core, init_size, core_size,
> > + *   init_text_size, core_text_size and tree_node[0]
> > + *
> > + * are on the same cacheline, therefore if the below iteration is
> > + * on latch0 and all module init has completed, we'll only hit
> > + * tree_node[0] and every intermediate level will hit only a single
> > + * cacheline.
> > + *
> > + * Furthermore, by ensuring init_text_size and core_text_size are
> > + * also in this same cacheline we've made sure is_module_text_address()
> > + * will also not require additional lines.
> > + */
> > +static struct module *__tree_find(struct rb_root *r, unsigned long addr)
> > +{
> > +	struct rb_node *n = r->rb_node;
> > +	unsigned long m_val, m_size;
> > +
> > +	while (n) {
> > +		struct module *m = mod_entry(n);
> > +		int idx = mod_node_idx(m, n);
> > +
> > +		m_val = mod_value(m, idx);
> > +		m_size = mod_size(m, idx);
> > +
> > +		if (addr < m_val)
> > +			n = n->rb_left;
> 
> We need a rcu_dereference() or lockless_dereference() here, I think.
> 
> > +		else if (addr >= m_val + m_size)
> > +			n = n->rb_right;
> 
> And here.
> 
> > +		else
> > +			return m;
> > +	}
> > +
> > +	return NULL;
> 
> I did finally find the RCU read-side critical section, supplied by
> is_module_text_address()'s preempt_disable() and preempt_enable().
> The matching synchronize_sched() is supplied by do_init_module() and
> load_module().  I expected a synchronize_sched() in free_module() as well,
> but I only see a synchronize_rcu().  Is the required synchronize_sched()
> on this code path hiding somewhere?
> 
> Or did I miss an rcu_read_lock()/rcu_read_unlock() pair somewhere?

I did not see any, nor any documentation that points to
those. We might want to add documentation to
raw_write_seqcount_latch() telling about the requirements
about making sure concurrent readers don't trigger side-effects.
It's not as simple as the timekeeper use-case, due to having
pointers here.

Good catch !

Thanks,

Mathieu

> 
> 							Thanx, Paul
> 
> > +}
> > +
> > +static struct latch_tree_root mod_tree;
> > +
> > +static void mod_tree_insert(struct module *mod)
> > +{
> > +	raw_write_seqcount_latch(&mod_tree.seq);
> > +	__tree_insert(&mod_tree, mod, latch0_core);
> > +	if (mod->init_size)
> > +		__tree_insert(&mod_tree, mod, latch0_init);
> > +	raw_write_seqcount_latch(&mod_tree.seq);
> > +	__tree_insert(&mod_tree, mod, latch1_core);
> > +	if (mod->init_size)
> > +		__tree_insert(&mod_tree, mod, latch1_init);
> > +}
> > +
> > +static void mod_tree_remove_init(struct module *mod)
> > +{
> > +	raw_write_seqcount_latch(&mod_tree.seq);
> > +	if (mod->init_size)
> > +		__tree_remove(&mod_tree, mod, latch0_init);
> > +	raw_write_seqcount_latch(&mod_tree.seq);
> > +	if (mod->init_size)
> > +		__tree_remove(&mod_tree, mod, latch1_init);
> > +}
> > +
> > +static void mod_tree_remove(struct module *mod)
> > +{
> > +	raw_write_seqcount_latch(&mod_tree.seq);
> > +	__tree_remove(&mod_tree, mod, latch0_core);
> > +	if (mod->init_size)
> > +		__tree_remove(&mod_tree, mod, latch0_init);
> > +	raw_write_seqcount_latch(&mod_tree.seq);
> > +	__tree_remove(&mod_tree, mod, latch1_core);
> > +	if (mod->init_size)
> > +		__tree_remove(&mod_tree, mod, latch1_init);
> > +}
> > +
> > +static struct module *mod_tree_find(unsigned long addr)
> > +{
> > +	struct module *m;
> > +	unsigned int seq;
> > +
> > +	do {
> > +		seq = raw_read_seqcount(&mod_tree.seq);
> > +		m = __tree_find(&mod_tree.tree[seq & latch_bit], addr);
> > +	} while (read_seqcount_retry(&mod_tree.seq, seq));
> > +
> > +	return m;
> > +}
> > +
> >  #ifdef CONFIG_KGDB_KDB
> >  struct list_head *kdb_modules = &modules; /* kdb needs the list of modules
> >  */
> >  #endif /* CONFIG_KGDB_KDB */
> > @@ -1854,6 +2052,7 @@ static void free_module(struct module *m
> >  	mutex_lock(&module_mutex);
> >  	/* Unlink carefully: kallsyms could be walking list. */
> >  	list_del_rcu(&mod->list);
> > +	mod_tree_remove(mod);
> >  	/* Remove this module from bug list, this uses list_del_rcu */
> >  	module_bug_cleanup(mod);
> >  	/* Wait for RCU synchronizing before releasing mod->list and buglist. */
> > @@ -3098,6 +3297,7 @@ static noinline int do_init_module(struc
> >  	mod->symtab = mod->core_symtab;
> >  	mod->strtab = mod->core_strtab;
> >  #endif
> > +	mod_tree_remove_init(mod);
> >  	unset_module_init_ro_nx(mod);
> >  	module_arch_freeing_init(mod);
> >  	mod->module_init = NULL;
> > @@ -3168,6 +3368,7 @@ static int add_unformed_module(struct mo
> >  		goto out;
> >  	}
> >  	list_add_rcu(&mod->list, &modules);
> > +	mod_tree_insert(mod);
> >  	err = 0;
> > 
> >  out:
> > @@ -3367,6 +3568,7 @@ static int load_module(struct load_info
> >  	mutex_lock(&module_mutex);
> >  	/* Unlink carefully: kallsyms could be walking list. */
> >  	list_del_rcu(&mod->list);
> > +	mod_tree_remove(mod);
> >  	wake_up_all(&module_wq);
> >  	/* Wait for RCU synchronizing before releasing mod->list. */
> >  	synchronize_rcu();
> > @@ -3810,13 +4012,13 @@ struct module *__module_address(unsigned
> >  	if (addr < module_addr_min || addr > module_addr_max)
> >  		return NULL;
> > 
> > -	list_for_each_entry_rcu(mod, &modules, list) {
> > +	mod = mod_tree_find(addr);
> > +	if (mod) {
> > +		BUG_ON(!within_module(addr, mod));
> >  		if (mod->state == MODULE_STATE_UNFORMED)
> > -			continue;
> > -		if (within_module(addr, mod))
> > -			return mod;
> > +			mod = NULL;
> >  	}
> > -	return NULL;
> > +	return mod;
> >  }
> >  EXPORT_SYMBOL_GPL(__module_address);
> > 
> > 
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ