linux-kernel - Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1e2e1a49-db2f-bdae-53b2-0bda225be472@gmail.com>
Date:   Wed, 31 Aug 2016 13:41:33 +1000
From:   Balbir Singh <bsingharora@...il.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Oleg Nesterov <oleg@...hat.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Nicholas Piggin <nicholas.piggin@...il.com>,
        Alexey Kardashevskiy <aik@...abs.ru>
Subject: Re: [RFC][PATCH] Fix a race between rwsem and the scheduler



On 30/08/16 22:19, Peter Zijlstra wrote:
> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
>>
>>
>> The origin of the issue I've seen seems to be related to
>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>> following state
> 
> As Nick says (good to see you're back Nick!), this is unrelated to
> rwsems.
> 
> This is true for pretty much every blocking wait loop out there, they
> all do:
> 
> 	for (;;) {
> 		current->state = UNINTERRUPTIBLE;
> 		smp_mb();
> 		if (cond)
> 			break;
> 		schedule();
> 	}
> 	current->state = RUNNING;
> 
> Which, if the wakeup is spurious, is just the pattern you need.

Yes True! My bad Alexey had seen the same basic pattern, I should have been clearer
in my commit log. Should I resend the patch?

> 
>> +++ b/kernel/sched/core.c
>> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>  	success = 1; /* we're going to change ->state */
>>  	cpu = task_cpu(p);
>>  
>> +	/*
>> +	 * Ensure we see on_rq and p_state consistently
>> +	 *
>> +	 * For example in __rwsem_down_write_failed(), we have
>> +	 *    [S] ->on_rq = 1				[L] ->state
>> +	 *    MB					 RMB
> 
> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
> to PPC, is _not_ MB. It is however sufficient for this case.
> 

The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
different thread.

>> +	 *    [S] ->state = TASK_UNINTERRUPTIBLE	[L] ->on_rq
>> +	 * In the absence of the RMB p->on_rq can be observed to be 0
>> +	 * and we end up spinning indefinitely in while (p->on_cpu)
>> +	 */
> 
> 
> 	/*
> 	 * Ensure we load p->on_rq _after_ p->state, otherwise it would
> 	 * be possible to, falsely, observe p->on_rq == 0 and get stuck
> 	 * in smp_cond_load_acquire() below.
> 	 *
> 	 * sched_ttwu_pending()			try_to_wake_up()
> 	 *   [S] p->on_rq = 1;			[L] P->state
> 	 *       UNLOCK rq->lock
> 	 *
> 	 * schedule()				RMB
> 	 *       LOCK rq->lock
> 	 *       UNLOCK rq->lock
> 	 *
> 	 * [task p]
> 	 *   [S] p->state = UNINTERRUPTIBLE	[L] p->on_rq
> 	 *
> 	 * Pairs with the UNLOCK+LOCK on rq->lock from the
> 	 * last wakeup of our task and the schedule that got our task
> 	 * current.
> 	 */
> 
>> +	smp_rmb();
>>  	if (p->on_rq && ttwu_remote(p, wake_flags))
>>  		goto stat;
>>  
> 
> 
> Now, this has been present for a fair while, I suspect ever since we
> reworked the wakeup path to not use rq->lock twice. Curious you only now
> hit it.
> 

Yes, I just hit it a a week or two back and I needed to collect data to
explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
I needed a system with large threads and less memory running stress-ng.
Reproducing the problem takes an unpredictable amount of time.

Balbir Singh.