linux-kernel - Re: [PATCH] sched: Add logic to handle parallel try_to_wake

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <449471373897674@web23d.yandex.ru>
Date:	Mon, 15 Jul 2013 18:14:34 +0400
From:	Kirill Tkhai <tkhai@...dex.ru>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task

Hi, Peter,

15.07.2013, 10:32, "Peter Zijlstra" <peterz@...radead.org>:
> On Sat, Jul 13, 2013 at 07:45:49PM +0400, Kirill Tkhai wrote:
>
>>  ---
>>   include/linux/sched.h |    1 +
>>   kernel/sched/core.c   |   29 +++++++++++++++++++++++++----
>>   kernel/sched/debug.c  |    7 +++++++
>>   kernel/sched/stats.h  |   16 ++++++++++++++++
>>   4 files changed, 49 insertions(+), 4 deletions(-)
>>  diff --git a/include/linux/sched.h b/include/linux/sched.h
>>  index fc09d21..235a466 100644
>>  --- a/include/linux/sched.h
>>  +++ b/include/linux/sched.h
>>  @@ -964,6 +964,7 @@ struct sched_statistics {
>>           u64 nr_wakeups_affine_attempts;
>>           u64 nr_wakeups_passive;
>>           u64 nr_wakeups_idle;
>>  + atomic_t nr_wakeups_parallel;
>
> bad idea.. atomics are expensive and stats aren't generally _that_
> important.
>
>>   };
>>   #endif
>>
>>  diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>  index 9d06ad6..1e1475f 100644
>>  --- a/kernel/sched/core.c
>>  +++ b/kernel/sched/core.c
>>  @@ -1336,6 +1336,11 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>>
>>           p->state = TASK_RUNNING;
>>   #ifdef CONFIG_SMP
>>  + /*
>>  + * Pair bracket with TASK_WAKING check it try_to_wake_up()
>>  + */
>>  + smp_wmb();
>>  +
>
> Like Mike said, you're making the common case more expensive, this is
> not appreciated.
>
>>           if (p->sched_class->task_woken)
>>                   p->sched_class->task_woken(rq, p);
>>
>>  @@ -1487,20 +1492,37 @@ static int
>>   try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>   {
>>           unsigned long flags;
>>  - int cpu, success = 0;
>>  + int cpu, success = 1;
>>
>>  + /*
>>  + * See commentary for commit 04e2f1741d235ba599037734878d72e57cb302b5.
>
> That must be the worst barrier comment ever.. If you want it there, just
> copy/paste the commit log here. It is also completely unrelated to the
> rest of the patch.
>
>>  + */
>>           smp_wmb();
>>  +#ifdef CONFIG_SMP
>>  + if (p->state == TASK_WAKING) {
>>  + /*
>>  + * Pairs with sets of p->state: below and in ttwu_do_wakeup().
>>  + */
>>  + smp_rmb();
>>  + inc_nr_parallel_wakeups(p);
>>  + return success;
>
> This is wrong, you didn't change p->state, therefore returning 1 is not
> an option. If you want to do something like this, treat it like waking
> an already running task.
>
> Also, the barrier comments fail to explain the exact problem they're
> solving.
>
>>  + }
>>  +#endif
>>           raw_spin_lock_irqsave(&p->pi_lock, flags);
>>  - if (!(p->state & state))
>>  + if (!(p->state & state)) {
>>  + success = 0;
>>                   goto out;
>>  + }
>>
>>  - success = 1; /* we're going to change ->state */
>>           cpu = task_cpu(p);
>>
>>           if (p->on_rq && ttwu_remote(p, wake_flags))
>>                   goto stat;
>>
>>   #ifdef CONFIG_SMP
>>  + p->state = TASK_WAKING;
>>  + smp_wmb();
>>  +
>
> This too is broken; the loop below needs to be completed first,
> otherwise we change p->state while the task is still on the CPU and it
> might read the wrong p->state.

This place is below (on_rq && ttwu_remote) check, so the task
either 'dequeued and on_cpu == 0'
or it's in the middle of schedule() on arch, which wants unlocked
context switch.

Nobody scheduler's probes p->state between prepare_lock_switch() and
finish_lock_switch(). Archs with unlocked ctx switch (mips and ia64)
don't change or probe state of previous process during context_switch.

The only exception is TASK_DEAD case, but it is handled above
in 'p->state & state' check. Nobody passes TASK_DEAD to try_to_wake_up().

p->state might be changed outside from the scheduler. It's ptrace.
But, it doesn't look be touched by earlier TASK_WAKING set.

So, I don't see any problem with it. Peter, what do you really mean?

Other problems is easy to be fixed, all except common case overhead.
If it's really considerable in terms of scheduler, I'll be account this fact
in the future.

>>           /*
>>            * If the owning (remote) cpu is still in the middle of schedule() with
>>            * this task as prev, wait until its done referencing the task.
>>  @@ -1513,7 +1535,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>           smp_rmb();
>>
>>           p->sched_contributes_to_load = !!task_contributes_to_load(p);
>>  - p->state = TASK_WAKING;
>>
>>           if (p->sched_class->task_waking)
>>                   p->sched_class->task_waking(p);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/