lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1243584388.23657.156.camel@twins>
Date:	Fri, 29 May 2009 10:06:28 +0200
From:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
To:	Paul Mackerras <paulus@...ba.org>
Cc:	Ingo Molnar <mingo@...e.hu>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC] perf_counter: Don't swap contexts containing
 locked mutex

On Fri, 2009-05-29 at 16:06 +1000, Paul Mackerras wrote:
> Peter Zijlstra pointed out that under some circumstances, we can take
> the mutex in a context or a counter and then swap that context or
> counter to another task, potentially leading to lock order inversions
> or the mutexes not protecting what they are supposed to protect.
> 
> This fixes the problem by making sure that we never take a mutex in a
> context or counter which could get swapped to another task.  Most of
> the cases where we take a mutex is on a top-level counter or context,
> i.e. a counter which has an fd associated with it or a context that
> contains such a counter.  This adds WARN_ON_ONCE statements to verify
> that.
> 
> The two cases where we need to take the mutex on a context that is a
> clone of another are in perf_counter_exit_task and
> perf_counter_init_task.  The perf_counter_exit_task case is solved by
> uncloning the context before starting to remove the counters from it.
> The perf_counter_init_task is a little trickier; we temporarily
> disable context swapping for the parent (forking) task by setting its
> ctx->parent_gen to the all-1s value after locking the context, if it
> is a cloned context, and restore the ctx->parent_gen value at the end
> if the context didn't get uncloned in the meantime.
> 
> This also moves the increment of the context generation count to be
> within the same critical section, protected by the context mutex, that
> adds the new counter to the context.  That way, taking the mutex is
> sufficient to ensure that both the counter list and the generation
> count are stable.
> 
> Signed-off-by: Paul Mackerras <paulus@...ba.org>
> ---
>  kernel/perf_counter.c |   89 ++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 79 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
> index 52e5a15..db843f8 100644
> --- a/kernel/perf_counter.c
> +++ b/kernel/perf_counter.c
> @@ -919,7 +919,8 @@ static int context_equiv(struct perf_counter_context *ctx1,
>  			 struct perf_counter_context *ctx2)
>  {
>  	return ctx1->parent_ctx && ctx1->parent_ctx == ctx2->parent_ctx
> -		&& ctx1->parent_gen == ctx2->parent_gen;
> +		&& ctx1->parent_gen == ctx2->parent_gen
> +		&& ctx1->parent_gen != ~0ull;
>  }

There's a nasty surprise for people a few generations down the line. All
of a sudden performance drops for a while for some unknown reason, and
then its good again,.. how odd ;-)

But yeah, seems fine, given that the alternative is yet another
variable.


> @@ -1414,6 +1414,7 @@ static int perf_release(struct inode *inode, struct file *file)
>  
>  	file->private_data = NULL;
>  
> +	WARN_ON_ONCE(ctx->parent_ctx);
>  	mutex_lock(&ctx->mutex);
>  	perf_counter_remove_from_context(counter);
>  	mutex_unlock(&ctx->mutex);

<snip all the other WARN_ON_ONCEs>

How about:

#define ASSERT_CTX_STABLE(ctx) \
  WARN_ON_ONCE((ctx)->parent_gen != ~0ull || ctx->parent_ctx)

which would deal with both a 'locked' context and uncloned one?


> @@ -3571,9 +3596,11 @@ again:
>  int perf_counter_init_task(struct task_struct *child)
>  {
>  	struct perf_counter_context *child_ctx, *parent_ctx;
> +	struct perf_counter_context *cloned_ctx;
>  	struct perf_counter *counter;
>  	struct task_struct *parent = current;
>  	int inherited_all = 1;
> +	u64 cloned_gen;
>  	int ret = 0;
>  
>  	child->perf_counter_ctxp = NULL;
> @@ -3581,8 +3608,7 @@ int perf_counter_init_task(struct task_struct *child)
>  	mutex_init(&child->perf_counter_mutex);
>  	INIT_LIST_HEAD(&child->perf_counter_list);
>  
> -	parent_ctx = parent->perf_counter_ctxp;
> -	if (likely(!parent_ctx || !parent_ctx->nr_counters))
> +	if (likely(!parent->perf_counter_ctxp))
>  		return 0;
>  
>  	/*
> @@ -3600,6 +3626,34 @@ int perf_counter_init_task(struct task_struct *child)
>  	get_task_struct(child);
>  
>  	/*
> +	 * If the parent's context is a clone, temporarily set its
> +	 * parent_gen to an impossible value (all 1s) so it won't get
> +	 * swapped under us.  The rcu_read_lock makes sure that
> +	 * parent_ctx continues to exist even if it gets swapped to
> +	 * another process and then freed while we are trying to get
> +	 * its lock.
> +	 */
> +	rcu_read_lock();
> + retry:
> +	parent_ctx = rcu_dereference(parent->perf_counter_ctxp);
> +	/*
> +	 * No need to check if parent_ctx != NULL here; since we saw
> +	 * it non-NULL earlier, the only reason for it to become NULL
> +	 * is if we exit, and since we're currently in the middle of
> +	 * a fork we can't be exiting at the same time.
> +	 */
> +	spin_lock_irq(&parent_ctx->lock);
> +	if (parent_ctx != rcu_dereference(parent->perf_counter_ctxp)) {
> +		spin_unlock_irq(&parent_ctx->lock);
> +		goto retry;
> +	}
> +	cloned_gen = parent_ctx->parent_gen;
> +	if (parent_ctx->parent_ctx)
> +		parent_ctx->parent_gen = ~0ull;
> +	spin_unlock_irq(&parent_ctx->lock);
> +	rcu_read_unlock();
> +
> +	/*
>  	 * Lock the parent list. No need to lock the child - not PID
>  	 * hashed yet and not running, so nobody can access it.
>  	 */
> @@ -3630,10 +3684,15 @@ int perf_counter_init_task(struct task_struct *child)
>  		/*
>  		 * Mark the child context as a clone of the parent
>  		 * context, or of whatever the parent is a clone of.
> +		 * Note that if the parent is a clone, it could get
> +		 * uncloned at any point, but that doesn't matter
> +		 * because the list of counters and the generation
> +		 * count can't have changed since we took the mutex.
>  		 */
> -		if (parent_ctx->parent_ctx) {
> -			child_ctx->parent_ctx = parent_ctx->parent_ctx;
> -			child_ctx->parent_gen = parent_ctx->parent_gen;
> +		cloned_ctx = rcu_dereference(parent_ctx->parent_ctx);
> +		if (cloned_ctx) {
> +			child_ctx->parent_ctx = cloned_ctx;
> +			child_ctx->parent_gen = cloned_gen;
>  		} else {
>  			child_ctx->parent_ctx = parent_ctx;
>  			child_ctx->parent_gen = parent_ctx->generation;
> @@ -3643,6 +3702,16 @@ int perf_counter_init_task(struct task_struct *child)
>  
>  	mutex_unlock(&parent_ctx->mutex);
>  
> +	/*
> +	 * Restore the clone status of the parent.
> +	 */
> +	if (parent_ctx->parent_ctx) {
> +		spin_lock_irq(&parent_ctx->lock);
> +		if (parent_ctx->parent_ctx)
> +			parent_ctx->parent_gen = cloned_gen;
> +		spin_unlock_irq(&parent_ctx->lock);
> +	}
> +
>  	return ret;
>  }


Could we maybe write this as:

static struct perf_counter_ctx *pin_ctx(struct perf_counter *counter, u64 *old_gen)
{
	struct perf_counter_context *ctx;
	unsigned long flags;

	rcu_read_lock();
retry:
	ctx = rcu_dereference(counter->ctx);
	spin_lock_irqsave(&ctx->lock, flags);
	if (ctx != rcu_dereference(counter->ctx))
		goto retry;

	*old_gen = ctx->generation;
	ctx->generation = ~0ULL;
	spin_unlock_irqrestore(&ctx->lock, flags);
	rcu_read_unlock();

	return ctx;
}

static void unpin_ctx(struct perf_counter_ctx *ctx, u64 old_gen)
{
	unsigned long flags;

	spin_lock_irqsave(&ctx->lock, flags);
	ctx->generation = old_gen;
	spin_unlock_irqrestore(&ctx->lock, flags);
}

So that's its easily reusable?

Things like perf_counter_{enable,disable}() also need it I think.

> @@ -3527,12 +3538,17 @@ void perf_counter_exit_task(struct task_struct *child) 
> 	struct perf_counter_context *child_ctx;
>  	unsigned long flags;
>  
> -	child_ctx = child->perf_counter_ctxp;
> -
> -	if (likely(!child_ctx))
> +	if (likely(!child->perf_counter_ctxp))
>  		return;
>  
>  	local_irq_save(flags);
> +	/*
> +	 * We can't reschedule here because interrupts are disabled,
> +	 * and either child is current or it is a task that can't be
> +	 * scheduled, so we are now safe from rescheduling changing
> +	 * our context.
> +	 */
> +	child_ctx = child->perf_counter_ctxp;
>  	__perf_counter_task_sched_out(child_ctx);
>  
>  	/*
> @@ -3542,6 +3558,15 @@ void perf_counter_exit_task(struct task_struct *child)
>  	 */
>  	spin_lock(&child_ctx->lock);
>  	child->perf_counter_ctxp = NULL;
> +	if (child_ctx->parent_ctx) {
> +		/*
> +		 * This context is a clone; unclone it so it can't get
> +		 * swapped to another process while we're removing all
> +		 * the counters from it.
> +		 */
> +		put_ctx(child_ctx->parent_ctx);
> +		child_ctx->parent_ctx = NULL;
> +	}
>  	spin_unlock(&child_ctx->lock);
>  	local_irq_restore(flags);

And then we can also use pin_ctx() here, right?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ