linux-kernel - Re: [RFC PATCH v2] membarrier: expedited private command

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <624649192.29564.1501256169885.JavaMail.zimbra@efficios.com>
Date:   Fri, 28 Jul 2017 15:36:09 +0000 (UTC)
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Boqun Feng <boqun.feng@...il.com>,
        Andrew Hunter <ahh@...gle.com>,
        maged michael <maged.michael@...il.com>,
        gromer <gromer@...gle.com>, Avi Kivity <avi@...lladb.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Paul Mackerras <paulus@...ba.org>,
        Michael Ellerman <mpe@...erman.id.au>
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

----- On Jul 28, 2017, at 4:55 AM, Peter Zijlstra peterz@...radead.org wrote:

> On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
>> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
>> +		struct mm_struct *oldmm)
> 
> That is a bit of a mouth-full...
> 
>> +{
>> +	if (!IS_ENABLED(CONFIG_MEMBARRIER))
>> +		return;
>> +	/*
>> +	 * __schedule()
>> +	 *   finish_task_switch()
>> +	 *    if (mm)
>> +	 *      mmdrop(mm)
>> +	 *        atomic_dec_and_test()
>         *
>> +	 * takes care of issuing a memory barrier when oldmm is
>> +	 * non-NULL. We also don't need the barrier when switching to a
>> +	 * kernel thread, nor when we switch between threads belonging
>> +	 * to the same process.
>> +	 */
>> +	if (likely(oldmm || !mm || mm == oldmm))
>> +		return;
>> +	/*
>> +	 * When switching between processes, membarrier expedited
>> +	 * private requires a memory barrier after we set the current
>> +	 * task.
>> +	 */
>> +	smp_mb();
>> +}
> 
> And because of what it complements, I would have expected the callsite:
> 
>> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>  
>>  	mm = next->mm;
>>  	oldmm = prev->active_mm;
>> +	membarrier_expedited_mb_after_set_current(mm, oldmm);
>>  	/*
>>  	 * For paravirt, this is coupled with an exit in switch_to to
>>  	 * combine the page table reload and the switch backend into
> 
> to be in finish_task_switch(), something like:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e9785f7aed75..33f34a201255 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct
> *prev)
> 	finish_arch_post_lock_switch();
> 
> 	fire_sched_in_preempt_notifiers(current);
> +
> +	/*
> +	 * For CONFIG_MEMBARRIER we need a full memory barrier after the
> +	 * rq->curr assignment. Not all architectures have one in either
> +	 * switch_to() or switch_mm() so we use (and complement) the one
> +	 * implied by mmdrop()'s atomic_dec_and_test().
> +	 */
> 	if (mm)
> 		mmdrop(mm);
> +	else if (IS_ENABLED(CONFIG_MEMBARRIER))
> +		smp_mb();
> +
> 	if (unlikely(prev_state == TASK_DEAD)) {
> 		if (prev->sched_class->task_dead)
> 			prev->sched_class->task_dead(prev);
> 
> 
> I realize this is sub-optimal if we're switching to a kernel thread, so
> it might want some work, then again, a whole bunch of architectures
> don't in fact need this extra barrier at all.

As discussed on IRC, I plan to go instead for:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01e3b881ab3a..dd677fb2ee92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct *
prev)
        vtime_task_switch(prev);
        perf_event_task_sched_in(prev, current);
        finish_lock_switch(rq, prev);
+       /*
+        * The membarrier system call requires a full memory barrier
+        * after storing to rq->curr, before going back to user-space.
+        */
+       smp_mb__after_unlock_lock();
        finish_arch_post_lock_switch();
 
        fire_sched_in_preempt_notifiers(current);

Which is free on most architectures, except those defining
CONFIG_ARCH_WEAK_RELEASE_ACQUIRE. CCing PPC maintainers.

> 
>> +static void membarrier_private_expedited(void)
>> +{
>> +	int cpu, this_cpu;
>> +	bool fallback = false;
>> +	cpumask_var_t tmpmask;
>> +
>> +	if (num_online_cpus() == 1)
>> +		return;
>> +
>> +	/*
>> +	 * Matches memory barriers around rq->curr modification in
>> +	 * scheduler.
>> +	 */
>> +	smp_mb();	/* system call entry is not a mb. */
>> +
> 
> Weren't you going to put in a comment on that GFP_NOWAIT thing?

I only added it to the uapi header. Adding this to the implementation
too:

+       /*
+        * Expedited membarrier commands guarantee that they won't
+        * block, hence the GFP_NOWAIT allocation flag and fallback
+        * implementation.
+        */



> 
>> +	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> 
> You really want: zalloc_cpumask_var().

ok

> 
>> +		/* Fallback for OOM. */
>> +		fallback = true;
>> +	}
>> +
>> +	/*
>> +	 * Skipping the current CPU is OK even through we can be
>> +	 * migrated at any point. The current CPU, at the point where we
>> +	 * read raw_smp_processor_id(), is ensured to be in program
>> +	 * order with respect to the caller thread. Therefore, we can
>> +	 * skip this CPU from the iteration.
>> +	 */
>> +	this_cpu = raw_smp_processor_id();
> 
> So if instead you do the below, that is still true, but you have the
> opportunity to skip moar CPUs, then again, if you migrate the wrong way
> you'll end up not skipping yourself.. a well.

Chances are better to skip more CPUs in face of migration if we do it
in the loop as you suggest. Will do.

> 
>> +	cpus_read_lock();
>> +	for_each_online_cpu(cpu) {
>> +		struct task_struct *p;
>> +
>		if (cpu == raw_smp_processor_id())
>			continue;
> 
>> +		rcu_read_lock();
>> +		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> +		if (p && p->mm == current->mm) {
>> +			if (!fallback)
>> +				__cpumask_set_cpu(cpu, tmpmask);
>> +			else
>> +				smp_call_function_single(cpu, ipi_mb, NULL, 1);
>> +		}
>> +		rcu_read_unlock();
>> +	}
>> +	cpus_read_unlock();
> 
> This ^, wants to go after that v
> 
>> +	if (!fallback) {
>> +		smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
>> +		free_cpumask_var(tmpmask);
>> +	}
> 
> Because otherwise the bits in your tmpmask might no longer match the
> online state.

Good point, thanks!

Mathieu

> 
>> +
>> +	/*
>> +	 * Memory barrier on the caller thread _after_ we finished
>> +	 * waiting for the last IPI. Matches memory barriers around
>> +	 * rq->curr modification in scheduler.
>> +	 */
>> +	smp_mb();	/* exit from system call is not a mb */
> > +}

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com