linux-kernel - Re: [PATCH v3] mm: Fix kthread_use

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1595487967.kclapwroks.astroid@bobo.none>
Date:   Thu, 23 Jul 2020 17:15:51 +1000
From:   Nicholas Piggin <npiggin@...il.com>
To:     Andrew Morton <akpm@...ux-foundation.org>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     axboe@...nel.dk, hch@....de, jannh@...gle.com,
        keescook@...omium.org, linux-kernel@...r.kernel.org,
        luto@...capital.net, mathieu.desnoyers@...icios.com,
        torvalds@...ux-foundation.org, will@...nel.org
Subject: Re: [PATCH v3] mm: Fix kthread_use_mm() vs TLB invalidate

Excerpts from Peter Zijlstra's message of July 22, 2020 6:35 pm:
> On Tue, Jul 21, 2020 at 02:06:23PM -0700, Andrew Morton wrote:
>> On Tue, 21 Jul 2020 17:41:06 +0200 Peter Zijlstra <peterz@...radead.org> wrote:
>> 
>> > 
>> > For SMP systems using IPI based TLB invalidation, looking at
>> > current->active_mm is entirely reasonable. This then presents the
>> > following race condition:
>> > 
>> > 
>> >   CPU0			CPU1
>> > 
>> >   flush_tlb_mm(mm)	use_mm(mm)
>> >     <send-IPI>
>> > 			  tsk->active_mm = mm;
>> > 			  <IPI>
>> > 			    if (tsk->active_mm == mm)
>> > 			      // flush TLBs
>> > 			  </IPI>
>> > 			  switch_mm(old_mm,mm,tsk);
>> > 
>> > 
>> > Where it is possible the IPI flushed the TLBs for @old_mm, not @mm,
>> > because the IPI lands before we actually switched.
>> > 
>> > Avoid this by disabling IRQs across changing ->active_mm and
>> > switch_mm().
>> > 
>> > [ There are all sorts of reasons this might be harmless for various
>> > architecture specific reasons, but best not leave the door open at
>> > all. ]
>> 
>> Can we give the -stable maintainers (and others) more explanation of
>> why they might choose to merge this?
> 
> Like so then?
> 
> ---
> Subject: mm: Fix kthread_use_mm() vs TLB invalidate
> From: Peter Zijlstra <peterz@...radead.org>
> Date: Tue, 11 Feb 2020 10:25:19 +0100
> 
> For SMP systems using IPI based TLB invalidation, looking at
> current->active_mm is entirely reasonable. This then presents the
> following race condition:
> 
> 
>   CPU0			CPU1
> 
>   flush_tlb_mm(mm)	use_mm(mm)
>     <send-IPI>
> 			  tsk->active_mm = mm;
> 			  <IPI>
> 			    if (tsk->active_mm == mm)
> 			      // flush TLBs
> 			  </IPI>
> 			  switch_mm(old_mm,mm,tsk);
> 
> 
> Where it is possible the IPI flushed the TLBs for @old_mm, not @mm,
> because the IPI lands before we actually switched.
> 
> Avoid this by disabling IRQs across changing ->active_mm and
> switch_mm().
> 
> Of the (SMP) architectures that have IPI based TLB invalidate:
> 
>   Alpha    - checks active_mm
>   ARC      - ASID specific
>   IA64     - checks active_mm
>   MIPS     - ASID specific flush
>   OpenRISC - shoots down world
>   PARISC   - shoots down world
>   SH       - ASID specific
>   SPARC    - ASID specific
>   x86      - N/A
>   xtensa   - checks active_mm
> 
> So at the very least Alpha, IA64 and Xtensa are suspect.
> 
> On top of this, for scheduler consistency we need at least preemption
> disabled across changing tsk->mm and doing switch_mm(), which is
> currently provided by task_lock(), but that's not sufficient for
> PREEMPT_RT.
> 
> Reported-by: Andy Lutomirski <luto@...capital.net>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Cc: stable@...nel.org
> ---
>  kernel/kthread.c |   11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1241,13 +1241,20 @@ void kthread_use_mm(struct mm_struct *mm
>  	WARN_ON_ONCE(tsk->mm);
>  
>  	task_lock(tsk);
> +	/*
> +	 * Serialize the tsk->mm store and switch_mm() against TLB invalidation
> +	 * IPIs. Also make sure we're non-preemptible on PREEMPT_RT to not race
> +	 * against the scheduler writing to these variables.
> +	 */
> +	local_irq_disable();
>  	active_mm = tsk->active_mm;
>  	if (active_mm != mm) {
>  		mmgrab(mm);
>  		tsk->active_mm = mm;
>  	}
>  	tsk->mm = mm;
> -	switch_mm(active_mm, mm, tsk);
> +	switch_mm_irqs_off(active_mm, mm, tsk);
> +	local_irq_enable();
>  	task_unlock(tsk);
>  #ifdef finish_arch_post_lock_switch
>  	finish_arch_post_lock_switch();
> @@ -1276,9 +1283,11 @@ void kthread_unuse_mm(struct mm_struct *
>  
>  	task_lock(tsk);
>  	sync_mm_rss(mm);
> +	local_irq_disable();
>  	tsk->mm = NULL;
>  	/* active_mm is still 'mm' */
>  	enter_lazy_tlb(mm, tsk);
> +	local_irq_enable();
>  	task_unlock(tsk);
>  }
>  EXPORT_SYMBOL_GPL(kthread_unuse_mm);
> 

Oh good, this is also needed as part of my preferred fix for the 
io_uring mmget_not_zero->use_mm() vs mm_cpumask problem

https://marc.info/?l=linux-mm&m=159520550112106&w=2

I'll try to do arch fixes on top of this (I have the same hunks
locally!). After that, we should be able to allow mmget_not_zero
to be first class references to mm AFAIKS.

Thanks,
Nick