lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190607142332.GF3463@hirez.programming.kicks-ass.net>
Date:   Fri, 7 Jun 2019 16:23:32 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Jens Axboe <axboe@...nel.dk>
Cc:     Qian Cai <cai@....pw>, akpm@...ux-foundation.org, hch@....de,
        oleg@...hat.com, gkohli@...eaurora.org, mingo@...hat.com,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] block: fix a crash in do_task_dead()

On Fri, Jun 07, 2019 at 03:35:41PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 05, 2019 at 09:04:02AM -0600, Jens Axboe wrote:
> > How about the following plan - if folks are happy with this sched patch,
> > we can queue it up for 5.3. Once that is in, I'll kill the block change
> > that special cases the polled task wakeup. For 5.2, we go with Oleg's
> > patch for the swap case.
> 
> OK, works for me. I'll go write a proper patch.

I now have the below; I'll queue that after the long weekend and let
0-day chew on it for a while and then push it out to tip or something.


---
Subject: sched: Optimize try_to_wake_up() for local wakeups
From: Peter Zijlstra <peterz@...radead.org>
Date: Fri Jun 7 15:39:49 CEST 2019

Jens reported that significant performance can be had on some block
workloads (XXX numbers?) by special casing local wakeups. That is,
wakeups on the current task before it schedules out. Given something
like the normal wait pattern:

	for (;;) {
		set_current_state(TASK_UNINTERRUPTIBLE);

		if (cond)
			break;

		schedule();
	}
	__set_current_state(TASK_RUNNING);

Any wakeup (on this CPU) after set_current_state() and before
schedule() would benefit from this.

Normal wakeups take p->pi_lock, which serializes wakeups to the same
task. By eliding that we gain concurrency on:

 - ttwu_stat(); we already had concurrency on rq stats, this now also
   brings it to task stats. -ENOCARE

 - tracepoints; it is now possible to get multiple instances of
   trace_sched_waking() (and possibly trace_sched_wakeup()) for the
   same task. Tracers will have to learn to cope.

Furthermore, p->pi_lock is used by set_special_state(), to order
against TASK_RUNNING stores from other CPUs. But since this is
strictly CPU local, we don't need the lock, and set_special_state()'s
disabling of IRQs is sufficient.

After the normal wakeup takes p->pi_lock it issues
smp_mb__after_spinlock(), in order to ensure the woken task must
observe prior stores before we observe the p->state. If this is CPU
local, this will be satisfied with a compiler barrier, and we rely on
try_to_wake_up() being a funcation call, which implies such.

Since, when 'p == current', 'p->on_rq' must be true, the normal wakeup
would continue into the ttwu_remote() branch, which normally is
concerned with exactly this wakeup scenario, except from a remote CPU.
IOW we're waking a task that is still running. In this case, we can
trivially avoid taking rq->lock, all that's left from this is to set
p->state.

This then yields an extremely simple and fast path for 'p == current'.

Cc: Qian Cai <cai@....pw>
Cc: mingo@...hat.com
Cc: akpm@...ux-foundation.org
Cc: hch@....de
Cc: gkohli@...eaurora.org
Cc: oleg@...hat.com
Reported-by: Jens Axboe <axboe@...nel.dk>
Tested-by: Jens Axboe <axboe@...nel.dk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
---
 kernel/sched/core.c |   33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1991,6 +1991,28 @@ try_to_wake_up(struct task_struct *p, un
 	unsigned long flags;
 	int cpu, success = 0;
 
+	if (p == current) {
+		/*
+		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
+		 * == smp_processor_id()'. Together this means we can special
+		 * case the whole 'p->on_rq && ttwu_remote()' case below
+		 * without taking any locks.
+		 *
+		 * In particular:
+		 *  - we rely on Program-Order guarantees for all the ordering,
+		 *  - we're serialized against set_special_state() by virtue of
+		 *    it disabling IRQs (this allows not taking ->pi_lock).
+		 */
+		if (!(p->state & state))
+			return false;
+
+		success = 1;
+		trace_sched_waking(p);
+		p->state = TASK_RUNNING;
+		trace_sched_wakeup(p);
+		goto out;
+	}
+
 	/*
 	 * If we are going to wake up a thread waiting for CONDITION we
 	 * need to ensure that CONDITION=1 done by the caller can not be
@@ -2000,7 +2022,7 @@ try_to_wake_up(struct task_struct *p, un
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	smp_mb__after_spinlock();
 	if (!(p->state & state))
-		goto out;
+		goto unlock;
 
 	trace_sched_waking(p);
 
@@ -2030,7 +2052,7 @@ try_to_wake_up(struct task_struct *p, un
 	 */
 	smp_rmb();
 	if (p->on_rq && ttwu_remote(p, wake_flags))
-		goto stat;
+		goto unlock;
 
 #ifdef CONFIG_SMP
 	/*
@@ -2090,10 +2112,11 @@ try_to_wake_up(struct task_struct *p, un
 #endif /* CONFIG_SMP */
 
 	ttwu_queue(p, cpu, wake_flags);
-stat:
-	ttwu_stat(p, cpu, wake_flags);
-out:
+unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+out:
+	if (success)
+		ttwu_stat(p, cpu, wake_flags);
 
 	return success;
 }

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ