linux-kernel - Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <43886295-1c27-fdf7-859c-e604728881de@gmail.com>
Date:   Wed, 31 Aug 2016 13:25:28 +1000
From:   Balbir Singh <bsingharora@...il.com>
To:     Oleg Nesterov <oleg@...hat.com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Nicholas Piggin <nicholas.piggin@...il.com>
Subject: Re: [RFC][PATCH] Fix a race between rwsem and the scheduler



On 30/08/16 22:58, Oleg Nesterov wrote:
> On 08/30, Balbir Singh wrote:
>>
>> The origin of the issue I've seen seems to be related to
>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>> following state
>>
>> I have a system with multiple threads and
>>
>> Most of the threads are stuck doing
>>
>> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
>> [67272.593915]     LR = _raw_spin_lock_irqsave+0x9c/0x130
>> [67272.700996] [c000000012857ae0] [c00000000012453c] rwsem_wake+0xcc/0x110
>> [67272.749283] [c000000012857b20] [c0000000001215d8] up_write+0x78/0x90
>> [67272.788965] [c000000012857b50] [c00000000028153c] unlink_anon_vmas+0x15c/0x2c0
>> [67272.798782] [c000000012857bc0] [c00000000026f5c0] free_pgtables+0xf0/0x1c0
>> [67272.842528] [c000000012857c10] [c00000000027c9a0] exit_mmap+0x100/0x1a0
>> [67272.872947] [c000000012857cd0] [c0000000000b4a98] mmput+0xa8/0x1b0
>> [67272.898432] [c000000012857d00] [c0000000000bc50c] do_exit+0x33c/0xc30
>> [67272.944721] [c000000012857dc0] [c0000000000bcee4] do_group_exit+0x64/0x100
>> [67272.969014] [c000000012857e00] [c0000000000bcfac] SyS_exit_group+0x2c/0x30
>> [67272.978971] [c000000012857e30] [c000000000009204] system_call+0x38/0xb4
>> [67272.999016] Instruction dump:
>>
>> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
>> irq's disabled and is doing
>>
>> [c00000037930fb30] c0000000000f724c try_to_wake_up+0x6c/0x570
>> [c00000037930fbb0] c000000000124328 __rwsem_do_wake+0x1f8/0x260
>> [c00000037930fc00] c0000000001244b4 rwsem_wake+0x84/0x110
>> [c00000037930fc40] c000000000121598 up_write+0x78/0x90
>> [c00000037930fc70] c000000000281a54 anon_vma_fork+0x184/0x1d0
>> [c00000037930fcc0] c0000000000b68e0 copy_process.isra.5+0x14c0/0x1870
>> [c00000037930fda0] c0000000000b6e68 _do_fork+0xa8/0x4b0
>> [c00000037930fe30] c000000000009460 ppc_clone+0x8/0xc
>>
>> The offset of try_to_wake_up is actually misleading, it is actually stuck
>> doing the following in try_to_wake_up
>>
>> while (p->on_cpu)
>> 	cpu_relax();
>>
>> Analysis
>>
>> The issue is triggered, due to the following race
>>
>> CPU1					CPU2
>>
>> while () {
>>   if (cond)
>>     break;
>>   do {
>>     schedule();
>>     set_current_state(TASK_UN..)
>>   } while (!cond);
>> 					rwsem_wake()
>> 					  spin_lock_irqsave(wait_lock)
>>   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
>> }					  try_to_wake_up()
>> set_current_state(TASK_RUNNING);	  ..
>> list_del(&waiter.list);
>>
>> CPU2 wakes up CPU1, but before it can get the wait_lock and set
>> current state to TASK_RUNNING the following occurs
>>
>> CPU3
>> (stole the rwsem before waiter can be woken up from queue)
>> up_write()
>> rwsem_wake()
>> raw_spin_lock_irqsave(wait_lock)
>> if (!list_empty)
>>   wake_up_process()
>>   try_to_wake_up()
>>   raw_spin_lock_irqsave(p->pi_lock)
>>   ..
>>   if (p->on_rq && ttwu_wakeup())
>>   ..
>>   while (p->on_cpu)
>>     cpu_relax()
>>   ..
>>
>> CPU3 tries to wake up the task on CPU1 again since it finds
>> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
>> after CPU2, CPU3 got it.
>>
>> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
>> the task is spinning on the wait_lock. Interestingly since p->on_rq
>> is checked under pi_lock, I've noticed that try_to_wake_up() finds
>> p->on_rq to be 0. This was the most confusing bit of the analysis,
>> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
>> check is not reliable without this fix IMHO. The race is visible
>> (based on the analysis) only when ttwu_queue() does a remote wakeup
>> via ttwu_queue_remote. In which case the p->on_rq change is not
>> done uder the pi_lock.
>>
>> The result is that after a while the entire system locks up on
>> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
>>
>> Reproduction of the issue
>>
>> The issue can be reproduced after a long run on my system with 80
>> threads and having to tweak available memory to very low and running
>> memory stress-ng mmapfork test. It usually takes a long time to
>> reproduce. I am trying to work on a test case that can reproduce
>> the issue faster, but thats work in progress. I am still testing the
>> changes on my still in a loop and the tests seem OK thus far.
>>
>> Big thanks to Benjamin and Nick for helping debug this as well.
>> Ben helped catch the missing barrier, Nick caught every missing
>> bit in my theory
>>
>> Cc: Peter Zijlstra <peterz@...radead.org>
>> Cc: Nicholas Piggin <npiggin@...il.com>
>> Cc: Benjamin Herrenschmidt <benh@...nel.crashing.org>
>>
>> Signed-off-by: Balbir Singh <bsingharora@...il.com>
>> ---
>>  kernel/sched/core.c | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2a906f2..582c684 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>  	success = 1; /* we're going to change ->state */
>>  	cpu = task_cpu(p);
>>  
>> +	/*
>> +	 * Ensure we see on_rq and p_state consistently
>> +	 *
>> +	 * For example in __rwsem_down_write_failed(), we have
>> +	 *    [S] ->on_rq = 1				[L] ->state
>> +	 *    MB					 RMB
>> +	 *    [S] ->state = TASK_UNINTERRUPTIBLE	[L] ->on_rq
>> +	 * In the absence of the RMB p->on_rq can be observed to be 0
>> +	 * and we end up spinning indefinitely in while (p->on_cpu)
>> +	 */
>> +	smp_rmb();
> 
> I think the patch is fine... but unless I am totally confused this
> is not specific to __rwsem_down_write_failed(). ttwu() can hang the
> same way if the target simply does
> 
> 	schedule_timeout();
> 	current->state = TASK_INTERRUPTIBLE;

Yes

> 	current->state = TASK_RUNNING;
> 
> And. I am not sure I understand where this MB above comes from.

I think Ben pointed that out earlier and you found it in the documentation as well
at the end of __switch

Balbir Singh