linux-kernel - Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160830121937.GQ10138@twins.programming.kicks-ass.net>
Date:   Tue, 30 Aug 2016 14:19:37 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Balbir Singh <bsingharora@...il.com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Oleg Nesterov <oleg@...hat.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Nicholas Piggin <nicholas.piggin@...il.com>
Subject: Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
> 
> 
> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state

As Nick says (good to see you're back Nick!), this is unrelated to
rwsems.

This is true for pretty much every blocking wait loop out there, they
all do:

	for (;;) {
		current->state = UNINTERRUPTIBLE;
		smp_mb();
		if (cond)
			break;
		schedule();
	}
	current->state = RUNNING;

Which, if the wakeup is spurious, is just the pattern you need.

> +++ b/kernel/sched/core.c
> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	success = 1; /* we're going to change ->state */
>  	cpu = task_cpu(p);
>  
> +	/*
> +	 * Ensure we see on_rq and p_state consistently
> +	 *
> +	 * For example in __rwsem_down_write_failed(), we have
> +	 *    [S] ->on_rq = 1				[L] ->state
> +	 *    MB					 RMB

There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
to PPC, is _not_ MB. It is however sufficient for this case.

> +	 *    [S] ->state = TASK_UNINTERRUPTIBLE	[L] ->on_rq
> +	 * In the absence of the RMB p->on_rq can be observed to be 0
> +	 * and we end up spinning indefinitely in while (p->on_cpu)
> +	 */


	/*
	 * Ensure we load p->on_rq _after_ p->state, otherwise it would
	 * be possible to, falsely, observe p->on_rq == 0 and get stuck
	 * in smp_cond_load_acquire() below.
	 *
	 * sched_ttwu_pending()			try_to_wake_up()
	 *   [S] p->on_rq = 1;			[L] P->state
	 *       UNLOCK rq->lock
	 *
	 * schedule()				RMB
	 *       LOCK rq->lock
	 *       UNLOCK rq->lock
	 *
	 * [task p]
	 *   [S] p->state = UNINTERRUPTIBLE	[L] p->on_rq
	 *
	 * Pairs with the UNLOCK+LOCK on rq->lock from the
	 * last wakeup of our task and the schedule that got our task
	 * current.
	 */

> +	smp_rmb();
>  	if (p->on_rq && ttwu_remote(p, wake_flags))
>  		goto stat;
>  


Now, this has been present for a fair while, I suspect ever since we
reworked the wakeup path to not use rq->lock twice. Curious you only now
hit it.