linux-kernel - Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160830125827.GA17653@redhat.com>
Date:   Tue, 30 Aug 2016 14:58:27 +0200
From:   Oleg Nesterov <oleg@...hat.com>
To:     Balbir Singh <bsingharora@...il.com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Nicholas Piggin <nicholas.piggin@...il.com>
Subject: Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

On 08/30, Balbir Singh wrote:
>
> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state
> 
> I have a system with multiple threads and
> 
> Most of the threads are stuck doing
> 
> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
> [67272.593915]     LR = _raw_spin_lock_irqsave+0x9c/0x130
> [67272.700996] [c000000012857ae0] [c00000000012453c] rwsem_wake+0xcc/0x110
> [67272.749283] [c000000012857b20] [c0000000001215d8] up_write+0x78/0x90
> [67272.788965] [c000000012857b50] [c00000000028153c] unlink_anon_vmas+0x15c/0x2c0
> [67272.798782] [c000000012857bc0] [c00000000026f5c0] free_pgtables+0xf0/0x1c0
> [67272.842528] [c000000012857c10] [c00000000027c9a0] exit_mmap+0x100/0x1a0
> [67272.872947] [c000000012857cd0] [c0000000000b4a98] mmput+0xa8/0x1b0
> [67272.898432] [c000000012857d00] [c0000000000bc50c] do_exit+0x33c/0xc30
> [67272.944721] [c000000012857dc0] [c0000000000bcee4] do_group_exit+0x64/0x100
> [67272.969014] [c000000012857e00] [c0000000000bcfac] SyS_exit_group+0x2c/0x30
> [67272.978971] [c000000012857e30] [c000000000009204] system_call+0x38/0xb4
> [67272.999016] Instruction dump:
> 
> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
> irq's disabled and is doing
> 
> [c00000037930fb30] c0000000000f724c try_to_wake_up+0x6c/0x570
> [c00000037930fbb0] c000000000124328 __rwsem_do_wake+0x1f8/0x260
> [c00000037930fc00] c0000000001244b4 rwsem_wake+0x84/0x110
> [c00000037930fc40] c000000000121598 up_write+0x78/0x90
> [c00000037930fc70] c000000000281a54 anon_vma_fork+0x184/0x1d0
> [c00000037930fcc0] c0000000000b68e0 copy_process.isra.5+0x14c0/0x1870
> [c00000037930fda0] c0000000000b6e68 _do_fork+0xa8/0x4b0
> [c00000037930fe30] c000000000009460 ppc_clone+0x8/0xc
> 
> The offset of try_to_wake_up is actually misleading, it is actually stuck
> doing the following in try_to_wake_up
> 
> while (p->on_cpu)
> 	cpu_relax();
> 
> Analysis
> 
> The issue is triggered, due to the following race
> 
> CPU1					CPU2
> 
> while () {
>   if (cond)
>     break;
>   do {
>     schedule();
>     set_current_state(TASK_UN..)
>   } while (!cond);
> 					rwsem_wake()
> 					  spin_lock_irqsave(wait_lock)
>   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
> }					  try_to_wake_up()
> set_current_state(TASK_RUNNING);	  ..
> list_del(&waiter.list);
> 
> CPU2 wakes up CPU1, but before it can get the wait_lock and set
> current state to TASK_RUNNING the following occurs
> 
> CPU3
> (stole the rwsem before waiter can be woken up from queue)
> up_write()
> rwsem_wake()
> raw_spin_lock_irqsave(wait_lock)
> if (!list_empty)
>   wake_up_process()
>   try_to_wake_up()
>   raw_spin_lock_irqsave(p->pi_lock)
>   ..
>   if (p->on_rq && ttwu_wakeup())
>   ..
>   while (p->on_cpu)
>     cpu_relax()
>   ..
> 
> CPU3 tries to wake up the task on CPU1 again since it finds
> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
> after CPU2, CPU3 got it.
> 
> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
> the task is spinning on the wait_lock. Interestingly since p->on_rq
> is checked under pi_lock, I've noticed that try_to_wake_up() finds
> p->on_rq to be 0. This was the most confusing bit of the analysis,
> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
> check is not reliable without this fix IMHO. The race is visible
> (based on the analysis) only when ttwu_queue() does a remote wakeup
> via ttwu_queue_remote. In which case the p->on_rq change is not
> done uder the pi_lock.
> 
> The result is that after a while the entire system locks up on
> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
> 
> Reproduction of the issue
> 
> The issue can be reproduced after a long run on my system with 80
> threads and having to tweak available memory to very low and running
> memory stress-ng mmapfork test. It usually takes a long time to
> reproduce. I am trying to work on a test case that can reproduce
> the issue faster, but thats work in progress. I am still testing the
> changes on my still in a loop and the tests seem OK thus far.
> 
> Big thanks to Benjamin and Nick for helping debug this as well.
> Ben helped catch the missing barrier, Nick caught every missing
> bit in my theory
> 
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Nicholas Piggin <npiggin@...il.com>
> Cc: Benjamin Herrenschmidt <benh@...nel.crashing.org>
> 
> Signed-off-by: Balbir Singh <bsingharora@...il.com>
> ---
>  kernel/sched/core.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2a906f2..582c684 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	success = 1; /* we're going to change ->state */
>  	cpu = task_cpu(p);
>  
> +	/*
> +	 * Ensure we see on_rq and p_state consistently
> +	 *
> +	 * For example in __rwsem_down_write_failed(), we have
> +	 *    [S] ->on_rq = 1				[L] ->state
> +	 *    MB					 RMB
> +	 *    [S] ->state = TASK_UNINTERRUPTIBLE	[L] ->on_rq
> +	 * In the absence of the RMB p->on_rq can be observed to be 0
> +	 * and we end up spinning indefinitely in while (p->on_cpu)
> +	 */
> +	smp_rmb();

I think the patch is fine... but unless I am totally confused this
is not specific to __rwsem_down_write_failed(). ttwu() can hang the
same way if the target simply does

	schedule_timeout();
	current->state = TASK_INTERRUPTIBLE;
	current->state = TASK_RUNNING;

And. I am not sure I understand where this MB above comes from.

Oleg.