linux-kernel - Re: [RFC PATCH v2] membarrier: expedited private command

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 1 Aug 2017 01:33:09 +0000 (UTC)
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Nicholas Piggin <npiggin@...il.com>
Cc:     Michael Ellerman <mpe@...erman.id.au>,
        Peter Zijlstra <peterz@...radead.org>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Boqun Feng <boqun.feng@...il.com>,
        Andrew Hunter <ahh@...gle.com>,
        maged michael <maged.michael@...il.com>,
        gromer <gromer@...gle.com>, Avi Kivity <avi@...lladb.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Palmer Dabbelt <palmer@...belt.com>,
        Dave Watson <davejwatson@...com>
Subject: Re: [RFC PATCH v2] membarrier: expedited private command

----- On Jul 31, 2017, at 8:35 PM, Nicholas Piggin npiggin@...il.com wrote:

> On Mon, 31 Jul 2017 23:20:59 +1000
> Michael Ellerman <mpe@...erman.id.au> wrote:
> 
>> Peter Zijlstra <peterz@...radead.org> writes:
>> 
>> > On Fri, Jul 28, 2017 at 10:55:32AM +0200, Peter Zijlstra wrote:
>> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> >> index e9785f7aed75..33f34a201255 100644
>> >> --- a/kernel/sched/core.c
>> >> +++ b/kernel/sched/core.c
>> >> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct
>> >> *prev)
>> >>  	finish_arch_post_lock_switch();
>> >>  
>> >>  	fire_sched_in_preempt_notifiers(current);
>> >> +
>> >> +	/*
>> >> +	 * For CONFIG_MEMBARRIER we need a full memory barrier after the
>> >> +	 * rq->curr assignment. Not all architectures have one in either
>> >> +	 * switch_to() or switch_mm() so we use (and complement) the one
>> >> +	 * implied by mmdrop()'s atomic_dec_and_test().
>> >> +	 */
>> >>  	if (mm)
>> >>  		mmdrop(mm);
>> >> +	else if (IS_ENABLED(CONFIG_MEMBARRIER))
>> >> +		smp_mb();
>> >> +
>> >>  	if (unlikely(prev_state == TASK_DEAD)) {
>> >>  		if (prev->sched_class->task_dead)
>> >>  			prev->sched_class->task_dead(prev);
>> >> 
>> >>   
>> >  
>> >> a whole bunch of architectures don't in fact need this extra barrier at all.
>> >
>> > In fact, I'm fairly sure its only PPC.
>> >
>> > Because only ARM64 and PPC actually implement ACQUIRE/RELEASE with
>> > anything other than smp_mb() (for now, Risc-V is in this same boat and
>> > MIPS could be if they ever sort out their fancy barriers).
>> >
>> > TSO archs use a regular STORE for RELEASE, but all their atomics imply a
>> > smp_mb() and there are enough around to make one happen (typically
>> > mm_cpumask updates).
>> >
>> > Everybody else, aside from ARM64 and PPC must use smp_mb() for
>> > ACQUIRE/RELEASE.
>> >
>> > ARM64 has a super duper barrier in switch_to().
>> >
>> > Which only leaves PPC stranded.. but the 'good' news is that mpe says
>> > they'll probably need a barrier in switch_mm() in any case.
>> 
>> I may have been sleep deprived. We have a patch, probably soon to be
>> merged, which will add a smp_mb() in switch_mm() but *only* when we add
>> a CPU to mm_cpumask, ie. when we run on a CPU we haven't run on before.
>> 
>> I'm not across membarrier enough to know if that's sufficient, but it
>> seems unlikely?
> 
> Won't be sufficient, they need a barrier after assigning rq->curr.
> It can be avoided when switching between threads with the same mm.
> 
> I would like to see how bad membarrier performance is if we made
> that side heavy enough to avoid the barrier in context switch (e.g.,
> by taking the rq locks, or using synchronize_sched_expedited -- on
> a per-arch basis of course).
> 
> Is there some (realistic-ish) benchmark using membarrier we can
> experiment with?

Hi Nick,

I have benchmark programs using membarrier in liburcu [1]. They
represent very heavy usage of membarrier (perhaps more extreme than
reality).

The master branch is at:

git://git.liburcu.org/userspace-rcu.git

Below is a test branch I quickly hacked to make urcu use membarrier
MEMBARRIER_CMD_PRIVATE_EXPEDITED:

https://github.com/compudj/userspace-rcu-dev/tree/test-mb-private

An example benchmark is under tests/benchmark

Usage: test_urcu nr_readers nr_writers duration (s) <OPTIONS>

Here is an example on my 16-core Intel machine (with hyperthreading, so don't
expect this benchmark to be very reliable) with 4 reader threads, 1 writer
thread. The writer thread in this benchmark issues a synchronize_rcu() at
each iteration, so typically 2 sys_membarrier calls. I'm compiling with
the configure flag "--disable-sys-membarrier-fallback" to have the full
read-side speed.

Using MEMBARRIER_CMD_SHARED:

$ ./test_urcu 4 1 10
SUMMARY [...] test_urcu testdur   10 nr_readers   4 rdur      0 wdur      0 nr_writers   1 wdelay      0 nr_reads  17522991852 nr_writes          249 nr_ops  17522992101

438'074'796 reads / reader thread / s
         25 updates / s

Using MEMBARRIER_CMD_PRIVATE_EXPEDITED:

$ ./test_urcu 4 1 10
SUMMARY [...] testdur   10 nr_readers   4 rdur      0 wdur      0 nr_writers   1 wdelay      0 nr_reads  12709653290 nr_writes       799668 nr_ops  12710452958

317'741'332 reads / reader thread / s
     79'966 updates / s

As a comparison, refer to the test_urcu_mb benchmark, which uses
a full memory barrier in rcu_read_lock and rcu_read_unlock, and
can therefore do without any sys_membarrier. This is our fallback
when sys_membarrier returns -ENOSYS.

$ ./test_urcu_mb 4 1 10
SUMMARY [...] test_urcu_mb testdur   10 nr_readers   4 rdur      0 wdur      0 nr_writers   1 wdelay      0 nr_reads    137290634 nr_writes      8624184 nr_ops    145914818

  3'432'265 reads / reader thread / s
    862'418 updates / s

You could try adapting my membarrier private-expedited patch to grab the
rq lock for each CPU to see the impact on the benchmark for your specific
architecture. I suspect that the overhead that will matter in that case
is not only the speed of the membarrier operation, but also its disruption
of the scheduler rq lock cachelines, which will have effect on other unrelated
workloads. You may need to benchmark another scheduler-intensive workload
executed in parallel with heavy use of sys_membarrier adapted to grab rq locks
to figure this one out.

Let me know if you have questions,

Perhaps Dave Watson will have benchmarks on hazard pointers.

Thanks,

Mathieu


[1] http://liburcu.org


> 
> Thanks,
> Nick

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com