linux-kernel - Re: [PATCH tip/core/rcu 4/5] sys_membarrier: Add expedited option

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170727083003.ivb2fr47vepa2g6t@hirez.programming.kicks-ass.net>
Date:   Thu, 27 Jul 2017 10:30:03 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:     linux-kernel@...r.kernel.org, mingo@...nel.org,
        jiangshanlai@...il.com, dipankar@...ibm.com,
        akpm@...ux-foundation.org, mathieu.desnoyers@...icios.com,
        josh@...htriplett.org, tglx@...utronix.de, rostedt@...dmis.org,
        dhowells@...hat.com, edumazet@...gle.com, fweisbec@...il.com,
        oleg@...hat.com, will.deacon@....com
Subject: Re: [PATCH tip/core/rcu 4/5] sys_membarrier: Add expedited option

On Wed, Jul 26, 2017 at 08:41:10AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 26, 2017 at 09:41:28AM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 25, 2017 at 04:59:36PM -0700, Paul E. McKenney wrote:

> > Sure, but SCHED_OTHER auto throttles in that if there's anything else to
> > run, you get to wait. So you can't generate an IPI storm with it. Also,
> > again, we can be limited to a subset of CPUs.
> 
> OK, what is its auto-throttle policy?  One round of IPIs per jiffy or
> some such?

No. Its called wakeup latency :-) Your SCHED_OTHER task will not get to
insta-run all the time. If there are other tasks already running, we'll
not IPI unless it should preempt.

If its idle, nobody cares..

> Does this auto-throttling also apply if the user is running a CPU-bound
> SCHED_BATCH or SCHED_IDLE task on each CPU, and periodically waking up
> one of a large group of SCHED_OTHER tasks, where the SCHED_OTHER tasks
> immediately sleep upon being awakened?

SCHED_BATCH is even more likely to suffer wakeup latency since it will
never preempt anything.

> OK, and the rq->curr assignment is in common code, correct?  Does this
> allow the IPI-only-requesting-process approach to live entirely within
> common code?

That is the idea.

> The 2010 email thread ended up with sys_membarrier() acquiring the
> runqueue lock for each CPU,

Yes, that's something I'm not happy with. Machine wide banging of that
lock will be a performance no-no.

> because doing otherwise meant adding code to the scheduler fastpath.

And that's obviously another thing I'm not happy with either.

> Don't we still need to do this?
> 
> https://marc.info/?l=linux-kernel&m=126341138408407&w=2
> https://marc.info/?l=linux-kernel&m=126349766324224&w=2

I don't know.. those seem focussed on mm_cpumask() and we can't use that
per Will's email.


So I think we need to think anew on this, start from the ground up.

What is missing for this:

static void ipi_mb(void *info)
{
	smp_mb(); // IPIs should be serializing but paranoid
}


sys_membarrier()
{
	smp_mb(); // because sysenter isn't an unconditional mb

	for_each_online_cpu(cpu) {
		struct task_struct *p;

		rcu_read_lock();
		p = task_rcu_dereference(&cpu_curr(cpu));
		if (p && p->mm == current->mm)
			__set_bit(cpus, cpu);
		rcu_read_unlock();
	}

	on_cpu_cpu_mask(cpus, ipi_mb, NULL, 1); // does local smp_mb() too
}

VS

__schedule()
{
	spin_lock(&rq->lock);
	smp_mb__after_spinlock(); // really full mb implied

	/* lots */

	if (likely(prev != next)) {

		rq->curr = next;

		context_switch() {
			switch_mm();
			switch_to();
			// neither need imply a barrier

			spin_unlock(&rq->lock);
		}
	}
}




So I think we need either switch_mm() or switch_to() to imply a full
barrier for this to work, otherwise we get:

  CPU0				CPU1


  lock rq->lock
  mb

  rq->curr = A

  unlock rq->lock

  lock rq->lock
  mb

				sys_membarrier()

				mb

				for_each_online_cpu()
				  p = A
				  // no match no IPI

				mb
  rq->curr = B

  unlock rq->lock


And that's bad, because now CPU0 doesn't have an MB happening _after_
sys_membarrier() if B matches.


So without audit, I only know of PPC and Alpha not having a barrier in
either switch_*().

x86 obviously has barriers all over the place, arm has a super duper
heavy barrier in switch_to().

One option might be to resurrect spin_unlock_wait(), although to use
that here is really ugly too, but it would avoid thrashing the
rq->lock.

I think it'd end up having to look like:

	rq = cpu_rq(cpu);
again:
	rcu_read_lock()
	p = task_rcu_dereference(&rq->curr);
	if (p) {
		raw_spin_unlock_wait(&rq->lock);
		q = task_rcu_dereference(&rq->curr);
		if (q != p) {
			rcu_read_unlock();
			goto again;
		}
	}
	...

which is just about as horrible as it looks.