[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1595487967.kclapwroks.astroid@bobo.none>
Date: Thu, 23 Jul 2020 17:15:51 +1000
From: Nicholas Piggin <npiggin@...il.com>
To: Andrew Morton <akpm@...ux-foundation.org>,
Peter Zijlstra <peterz@...radead.org>
Cc: axboe@...nel.dk, hch@....de, jannh@...gle.com,
keescook@...omium.org, linux-kernel@...r.kernel.org,
luto@...capital.net, mathieu.desnoyers@...icios.com,
torvalds@...ux-foundation.org, will@...nel.org
Subject: Re: [PATCH v3] mm: Fix kthread_use_mm() vs TLB invalidate
Excerpts from Peter Zijlstra's message of July 22, 2020 6:35 pm:
> On Tue, Jul 21, 2020 at 02:06:23PM -0700, Andrew Morton wrote:
>> On Tue, 21 Jul 2020 17:41:06 +0200 Peter Zijlstra <peterz@...radead.org> wrote:
>>
>> >
>> > For SMP systems using IPI based TLB invalidation, looking at
>> > current->active_mm is entirely reasonable. This then presents the
>> > following race condition:
>> >
>> >
>> > CPU0 CPU1
>> >
>> > flush_tlb_mm(mm) use_mm(mm)
>> > <send-IPI>
>> > tsk->active_mm = mm;
>> > <IPI>
>> > if (tsk->active_mm == mm)
>> > // flush TLBs
>> > </IPI>
>> > switch_mm(old_mm,mm,tsk);
>> >
>> >
>> > Where it is possible the IPI flushed the TLBs for @old_mm, not @mm,
>> > because the IPI lands before we actually switched.
>> >
>> > Avoid this by disabling IRQs across changing ->active_mm and
>> > switch_mm().
>> >
>> > [ There are all sorts of reasons this might be harmless for various
>> > architecture specific reasons, but best not leave the door open at
>> > all. ]
>>
>> Can we give the -stable maintainers (and others) more explanation of
>> why they might choose to merge this?
>
> Like so then?
>
> ---
> Subject: mm: Fix kthread_use_mm() vs TLB invalidate
> From: Peter Zijlstra <peterz@...radead.org>
> Date: Tue, 11 Feb 2020 10:25:19 +0100
>
> For SMP systems using IPI based TLB invalidation, looking at
> current->active_mm is entirely reasonable. This then presents the
> following race condition:
>
>
> CPU0 CPU1
>
> flush_tlb_mm(mm) use_mm(mm)
> <send-IPI>
> tsk->active_mm = mm;
> <IPI>
> if (tsk->active_mm == mm)
> // flush TLBs
> </IPI>
> switch_mm(old_mm,mm,tsk);
>
>
> Where it is possible the IPI flushed the TLBs for @old_mm, not @mm,
> because the IPI lands before we actually switched.
>
> Avoid this by disabling IRQs across changing ->active_mm and
> switch_mm().
>
> Of the (SMP) architectures that have IPI based TLB invalidate:
>
> Alpha - checks active_mm
> ARC - ASID specific
> IA64 - checks active_mm
> MIPS - ASID specific flush
> OpenRISC - shoots down world
> PARISC - shoots down world
> SH - ASID specific
> SPARC - ASID specific
> x86 - N/A
> xtensa - checks active_mm
>
> So at the very least Alpha, IA64 and Xtensa are suspect.
>
> On top of this, for scheduler consistency we need at least preemption
> disabled across changing tsk->mm and doing switch_mm(), which is
> currently provided by task_lock(), but that's not sufficient for
> PREEMPT_RT.
>
> Reported-by: Andy Lutomirski <luto@...capital.net>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Cc: stable@...nel.org
> ---
> kernel/kthread.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1241,13 +1241,20 @@ void kthread_use_mm(struct mm_struct *mm
> WARN_ON_ONCE(tsk->mm);
>
> task_lock(tsk);
> + /*
> + * Serialize the tsk->mm store and switch_mm() against TLB invalidation
> + * IPIs. Also make sure we're non-preemptible on PREEMPT_RT to not race
> + * against the scheduler writing to these variables.
> + */
> + local_irq_disable();
> active_mm = tsk->active_mm;
> if (active_mm != mm) {
> mmgrab(mm);
> tsk->active_mm = mm;
> }
> tsk->mm = mm;
> - switch_mm(active_mm, mm, tsk);
> + switch_mm_irqs_off(active_mm, mm, tsk);
> + local_irq_enable();
> task_unlock(tsk);
> #ifdef finish_arch_post_lock_switch
> finish_arch_post_lock_switch();
> @@ -1276,9 +1283,11 @@ void kthread_unuse_mm(struct mm_struct *
>
> task_lock(tsk);
> sync_mm_rss(mm);
> + local_irq_disable();
> tsk->mm = NULL;
> /* active_mm is still 'mm' */
> enter_lazy_tlb(mm, tsk);
> + local_irq_enable();
> task_unlock(tsk);
> }
> EXPORT_SYMBOL_GPL(kthread_unuse_mm);
>
Oh good, this is also needed as part of my preferred fix for the
io_uring mmget_not_zero->use_mm() vs mm_cpumask problem
https://marc.info/?l=linux-mm&m=159520550112106&w=2
I'll try to do arch fixes on top of this (I have the same hunks
locally!). After that, we should be able to allow mmget_not_zero
to be first class references to mm AFAIKS.
Thanks,
Nick
Powered by blists - more mailing lists