linux-kernel - Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3a)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100111205250.GA6866@Krystal>
Date:	Mon, 11 Jan 2010 15:52:50 -0500
From:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Oleg Nesterov <oleg@...hat.com>, linux-kernel@...r.kernel.org,
	Ingo Molnar <mingo@...e.hu>, akpm@...ux-foundation.org,
	josh@...htriplett.org, tglx@...utronix.de, Valdis.Kletnieks@...edu,
	dhowells@...hat.com, laijs@...fujitsu.com, dipankar@...ibm.com,
	"David S. Miller" <davem@...emloft.net>
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory
	barrier (v3a)

* Peter Zijlstra (peterz@...radead.org) wrote:
> On Sun, 2010-01-10 at 23:29 -0500, Mathieu Desnoyers wrote:
> > Here is an implementation of a new system call, sys_membarrier(), which
> > executes a memory barrier on all threads of the current process.
> 
> Please start a new thread for new versions, I really didn't find this
> until I started reading in date order instead of thread order.

OK

> 
> 
> > Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-10 19:21:31.000000000 -0500
> > +++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-10 19:21:37.000000000 -0500
> > @@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
> >  __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
> >  #define __NR_perf_event_open			298
> >  __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
> > +#define __NR_membarrier				299
> > +__SYSCALL(__NR_membarrier, sys_membarrier)
> >  
> >  #ifndef __NO_STUBS
> >  #define __ARCH_WANT_OLD_READDIR
> > Index: linux-2.6-lttng/kernel/sched.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/sched.c	2010-01-10 19:21:31.000000000 -0500
> > +++ linux-2.6-lttng/kernel/sched.c	2010-01-10 22:22:40.000000000 -0500
> > @@ -2861,12 +2861,26 @@ context_switch(struct rq *rq, struct tas
> >  	 */
> >  	arch_start_context_switch(prev);
> >  
> > +	/*
> > +	 * sys_membarrier IPI-mb scheme requires a memory barrier between
> > +	 * user-space thread execution and update to mm_cpumask.
> > +	 */
> > +	if (likely(oldmm) && likely(oldmm != mm))
> > +		smp_mb__before_clear_bit();
> > +
> >  	if (unlikely(!mm)) {
> >  		next->active_mm = oldmm;
> >  		atomic_inc(&oldmm->mm_count);
> >  		enter_lazy_tlb(oldmm, next);
> > -	} else
> > +	} else {
> >  		switch_mm(oldmm, mm, next);
> > +		/*
> > +		 * sys_membarrier IPI-mb scheme requires a memory barrier
> > +		 * between update to mm_cpumask and user-space thread execution.
> > +		 */
> > +		if (likely(oldmm != mm))
> > +			smp_mb__after_clear_bit();
> > +	}
> >  
> >  	if (unlikely(!prev->mm)) {
> >  		prev->active_mm = NULL;
> > @@ -10822,6 +10836,49 @@ struct cgroup_subsys cpuacct_subsys = {
> >  };
> >  #endif	/* CONFIG_CGROUP_CPUACCT */
> >  
> > +/*
> > + * Execute a memory barrier on all active threads from the current process
> > + * on SMP systems. Do not rely on implicit barriers in
> > + * smp_call_function_many(), just in case they are ever relaxed in the future.
> > + */
> > +static void membarrier_ipi(void *unused)
> > +{
> > +	smp_mb();
> > +}
> > +
> > +/*
> > + * sys_membarrier - issue memory barrier on current process running threads
> > + *
> > + * Execute a memory barrier on all running threads of the current process.
> > + * Upon completion, the caller thread is ensured that all process threads
> > + * have passed through a state where memory accesses match program order.
> > + * (non-running threads are de facto in such a state)
> > + */
> > +SYSCALL_DEFINE0(membarrier)
> > +{
> > +#ifdef CONFIG_SMP
> > +	if (unlikely(thread_group_empty(current)))
> > +		return 0;
> > +	/*
> > +	 * Memory barrier on the caller thread _before_ sending first
> > +	 * IPI. Matches memory barriers around mm_cpumask modification in
> > +	 * context_switch().
> > +	 */
> > +	smp_mb();
> > +	preempt_disable();
> > +	smp_call_function_many(mm_cpumask(current->mm), membarrier_ipi,
> > +			       NULL, 1);
> > +	preempt_enable();
> > +	/*
> > +	 * Memory barrier on the caller thread _after_ we finished
> > +	 * waiting for the last IPI. Matches memory barriers around mm_cpumask
> > +	 * modification in context_switch().
> > +	 */
> > +	smp_mb();
> > +#endif	/* #ifdef CONFIG_SMP */
> > +	return 0;
> > +}
> > +
> >  #ifndef CONFIG_SMP
> >  
> >  int rcu_expedited_torture_stats(char *page)
> 
> Right, so here you rely on the arch switch_mm() implementation to keep
> mm_cpumask() current, but then stick a memory barrier in the generic
> code... seems odd.

Agreed. I wanted to provide an idea of the required memory barriers
without changing every arch's switch_mm(). If it is generally agreed
that this kind of overhead is OK, then I think the path to follow is to
audit each architecture switch_mm() and add comments and/or barriers as
needed. We can write that switch_mm() requirement at the top of the
system call, so as we gradually assign system call numbers, the matching
switch_mm() modifications should be done.

> 
> x86 switch_mm() does indeed keep it current, but writing cr3 is also a
> rather serializing instruction.

Agreed. Write to CR3 is listed as a synchronizing instruction, which
effect includes the equivalent of a "mfence" instruction.

> 
> Furthermore, do we really need that smp_mb() in the membarrier_ipi()
> function? Shouldn't we investigate if either:
>  - receiving an IPI implies an mb, or

kernel/smp.c:

void generic_smp_call_function_interrupt(void)
...
        /*
         * Ensure entry is visible on call_function_queue after we have
         * entered the IPI. See comment in smp_call_function_many.
         * If we don't have this, then we may miss an entry on the list
         * and never get another IPI to process it.
         */
        smp_mb();

So, I think that if we have other concurrent IPI requests sent, we have
no guarantee that the IPI handler will indeed issue the memory barrier 
_after_ we added the entry to the list. The list itself is synchronized
with a spin lock-irqoff, but not with an explicit memory barrier. Or am
I missing a subtlety here ?

>  - enter/leave kernelspace implies an mb
> ?

On x86, iret is a serializing instruction. However, I haven't found any
information saying that int 0x80 nor sysenter/sysexit are serializing
instructions. And there seem to be no explicit mfence in x86 entry_*.S
(except a 64-bit gs swap workaround). So I'm tempted to answer: no,
entry/return kernelspace does not seem to imply a mb on x86.

> 
> So while I much like the simplified version, that previous one was
> heavily over engineer, I 
>  1) don't like that memory barrier in the generic code, 

Would that help if we proceed to arch-per-arch modification of
switch_mm() instead ?

>  2) don't think that arch's necessarily keep that mask as tight.
> 
> [ even if for x86 smp_mb__{before,after}_clear_bit are a nop, tying that
>   to switch_mm() semantics just reeks ]
> 
> See for example the sparc64 implementation of switch_mm() that only sets
> cpus in mm_cpumask(), but only tlb flushes clear them. Also, I wouldn't
> know if switch_mm() implies an mb on sparc64.

Yes, I've seen that some architecture delay the clear mask until much
later (lazy TLB shootdown I think is the correct term). As far as these
barriers are concerned, as long as we have one barrier before the clear
bit, and one barrier after the set bit, we're fine.

So, depending on the order, we might need:

smp_mb()
clear bit
set bit
smp_mb()

or, simply (and this applies to lazy TLB shootdown on sparc64):

set bit
mb()
clear bit (possibly performed by tlb flush)

So the clear bit can occur far, far away in the future, we don't care.
We'll just send extra IPIs when unneeded in this time-frame.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/