linux-kernel - Re: [RFC PATCH 1/2] Fix: sched/membarrier: p->mm->membarrier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1713923102.3325.1567731091553.JavaMail.zimbra@efficios.com>
Date:   Thu, 5 Sep 2019 20:51:31 -0400 (EDT)
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        paulmck <paulmck@...ux.ibm.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Oleg Nesterov <oleg@...hat.com>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        "Russell King, ARM Linux" <linux@...linux.org.uk>,
        Chris Metcalf <cmetcalf@...hip.com>,
        Chris Lameter <cl@...ux.com>, Kirill Tkhai <tkhai@...dex.ru>,
        Mike Galbraith <efault@....de>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>
Subject: Re: [RFC PATCH 1/2] Fix: sched/membarrier: p->mm->membarrier_state
 racy load

----- On Sep 4, 2019, at 2:26 PM, Peter Zijlstra peterz@...radead.org wrote:

> On Wed, Sep 04, 2019 at 01:12:53PM -0400, Mathieu Desnoyers wrote:
>> ----- On Sep 4, 2019, at 12:09 PM, Peter Zijlstra peterz@...radead.org wrote:
>> 
>> > On Wed, Sep 04, 2019 at 11:19:00AM -0400, Mathieu Desnoyers wrote:
>> >> ----- On Sep 3, 2019, at 4:36 PM, Linus Torvalds torvalds@...ux-foundation.org
>> >> wrote:
>> > 
>> >> > I wonder if the easiest model might be to just use a percpu variable
>> >> > instead for the membarrier stuff? It's not like it has to be in
>> >> > 'struct task_struct' at all, I think. We only care about the current
>> >> > runqueues, and those are percpu anyway.
>> >> 
>> >> One issue here is that membarrier iterates over all runqueues without
>> >> grabbing any runqueue lock. If we copy that state from mm to rq on
>> >> sched switch prepare, we would need to ensure we have the proper
>> >> memory barriers between:
>> >> 
>> >> prior user-space memory accesses  /  setting the runqueue membarrier state
>> >> 
>> >> and
>> >> 
>> >> setting the runqueue membarrier state / following user-space memory accesses
>> >> 
>> >> Copying the membarrier state into the task struct leverages the fact that
>> >> we have documented and guaranteed those barriers around the rq->curr update
>> >> in the scheduler.
>> > 
>> > Should be the same as the barriers we already rely on for rq->curr, no?
>> > That is, if we put this before switch_mm() then we have
>> > smp_mb__after_spinlock() and switch_mm() itself.
>> 
>> Yes, I think we can piggy-back on the already documented barriers documented
>> around
>> rq->curr store.
>> 
>> > Also, if we place mm->membarrier_state in the same cacheline as mm->pgd
>> > (which switch_mm() is bound to load) then we should be fine, I think.
>> 
>> Yes, if we make sure membarrier_prepare_task_switch only updates the
>> rq->membarrier_state if prev->mm != next->mm, we should be able to avoid
>> loading next->mm->membarrier_state when switch_mm() is not invoked.
>> 
>> I'll prepare RFC patch implementing this approach.
> 
> Thinking about this a bit; switching it 'on' still requires some
> thinking. Consider register on an already threaded process of which
> multiple threads are 'current' on multiple CPUs. In that case none of
> the rq bits will be set.
> 
> Not even synchronize_rcu() is sufficient to force it either, since we
> only update on switch_mm() and nothing guarantees we pass that.
> 
> One possible approach would be to IPI broadcast (after setting the
> ->mm->membarrier_State) and having the IPI update the rq state from
> 'current->mm'.
> 
> But possible I'm just confusing evryone again. I'm not having a good day
> today.

Indeed, switching 'on' requires some care. I implemented the IPI-based
approach as per your suggestion,

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com