lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190906082305.GU2349@hirez.programming.kicks-ass.net>
Date:   Fri, 6 Sep 2019 10:23:05 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:     "Paul E. McKenney" <paulmck@...ux.ibm.com>,
        Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org,
        Oleg Nesterov <oleg@...hat.com>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Russell King - ARM Linux admin <linux@...linux.org.uk>,
        Chris Metcalf <cmetcalf@...hip.com>,
        Christoph Lameter <cl@...ux.com>,
        Kirill Tkhai <tkhai@...dex.ru>, Mike Galbraith <efault@....de>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC PATCH 4/4] Fix: sched/membarrier: p->mm->membarrier_state
 racy load

On Thu, Sep 05, 2019 at 11:13:00PM -0400, Mathieu Desnoyers wrote:

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6a7a1083b6fb..7020572eb605 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -382,6 +382,9 @@ struct mm_struct {
>  		unsigned long task_size;	/* size of task vm space */
>  		unsigned long highest_vm_end;	/* highest vma end address */
>  		pgd_t * pgd;
> +#ifdef CONFIG_MEMBARRIER

Stick in a comment, on why here. To be close to data already used by
switch_mm().

> +		atomic_t membarrier_state;
> +#endif
>  
>  		/**
>  		 * @mm_users: The number of users including userspace.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 010d578118d6..1cffc1aa403c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3038,6 +3038,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>  	perf_event_task_sched_out(prev, next);
>  	rseq_preempt(prev);
>  	fire_sched_out_preempt_notifiers(prev, next);
> +	membarrier_prepare_task_switch(rq, prev, next);

This had me confused for a while, because I initially thought we'd only
do this for switch_mm(), but you're made it agressive and track kernel
threads too.

I think we can do that slightly different. See below...

>  	prepare_task(next);
>  	prepare_arch_switch(next);
>  }

> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 7e0a0d6535f3..5744c300d29e 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -30,6 +30,28 @@ static void ipi_mb(void *info)

> +void membarrier_execve(struct task_struct *t)
> +{
> +	atomic_set(&t->mm->membarrier_state, 0);
> +	WRITE_ONCE(this_rq()->membarrier_state, 0);

It is the callsite of this one that had me puzzled and confused. I
think it works by accident more than anything else.

You see; I thought the rules were that we'd change it near/before
switch_mm(), and this is quite a way _after_.

I think it might be best to place the call in exec_mmap(), right before
activate_mm().

But that then had me wonder about the membarrier_prepate_task_switch()
thing...

> +/*
> + * The scheduler provides memory barriers required by membarrier between:
> + * - prior user-space memory accesses and store to rq->membarrier_state,
> + * - store to rq->membarrier_state and following user-space memory accesses.
> + * In the same way it provides those guarantees around store to rq->curr.
> + */
> +static inline void membarrier_prepare_task_switch(struct rq *rq,
> +						  struct task_struct *prev,
> +						  struct task_struct *next)
> +{
> +	int membarrier_state = 0;
> +	struct mm_struct *next_mm = next->mm;
> +
> +	if (prev->mm == next_mm)
> +		return;
> +	if (next_mm)
> +		membarrier_state = atomic_read(&next_mm->membarrier_state);
> +	if (READ_ONCE(rq->membarrier_state) != membarrier_state)
> +		WRITE_ONCE(rq->membarrier_state, membarrier_state);
> +}

So if you make the above something like:

static inline void
membarrier_switch_mm(struct rq *rq, struct mm_struct *prev_mm, struct mm_struct *next_mm)
{
	int membarrier_state;

	if (prev_mm == next_mm)
		return;

	membarrier_state = atomic_read(&next_mm->membarrier_state);
	if (READ_ONCE(rq->membarrier_state) == membarrier_state)
		return;

	WRITE_ONCE(rq->membarrier_state, membarrier_state);
}

And put it right in front of switch_mm() in context_switch() then we'll
deal with kernel on the other side, like so:

> @@ -70,16 +90,13 @@ static int membarrier_global_expedited(void)
>  		if (cpu == raw_smp_processor_id())
>  			continue;
>  
> -		rcu_read_lock();
> -		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> -		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
> -				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
> +		if (READ_ONCE(cpu_rq(cpu)->membarrier_state) &
> +		    MEMBARRIER_STATE_GLOBAL_EXPEDITED) {

		p = rcu_dereference(rq->curr);
		if ((READ_ONCE(cpu_rq(cpu)->membarrier_state) & MEMBARRIER_STATE_GLOBAL_EXPEDITED) &&
		    !(p->flags & PF_KTHREAD))

>  			if (!fallback)
>  				__cpumask_set_cpu(cpu, tmpmask);
>  			else
>  				smp_call_function_single(cpu, ipi_mb, NULL, 1);
>  		}
> -		rcu_read_unlock();
>  	}
>  	if (!fallback) {
>  		preempt_disable();

does that make sense?

(also, I hate how long all these membarrier names are)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ