linux-kernel - Re: [RFC] sched/fair: Use wake_q length as a hint for wake

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJWu+oqEO6_FteexR6ODq6fszMuRzGshDqaq61i_14H+sVdDjA@mail.gmail.com>
Date:   Mon, 18 Sep 2017 22:15:07 -0700
From:   Joel Fernandes <joelaf@...gle.com>
To:     Brendan Jackman <brendan.jackman@....com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Andres Oportus <andresoportus@...gle.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Josef Bacik <josef@...icpanda.com>,
        Mike Galbraith <efault@....de>,
        Matt Fleming <matt@...eblueprint.co.uk>,
        Dietmar Eggemann <dietmar.eggemann@....com>
Subject: Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide

Hi Brendan,

On Fri, Aug 11, 2017 at 2:45 AM, Brendan Jackman
<brendan.jackman@....com> wrote:
> This patch adds a parameter to select_task_rq, sibling_count_hint
> allowing the caller, where it has this information, to inform the
> sched_class the number of tasks that are being woken up as part of
> the same event.
>
> The wake_q mechanism is one case where this information is available.
>
> select_task_rq_fair can then use the information to detect that it
> needs to widen the search space for task placement in order to avoid
> overloading the last-level cache domain's CPUs.
>
>                                * * *
>
> The reason I am investigating this change is the following use case
> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
> all repeatedly do X amount of work then
> pthread_barrier_wait (i.e. sleep until the last task finishes its X
> and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
> finish faster, and then those CPUs pull over the tasks that are still
> running:
>
>      v CPU v           ->time->
>
>                     -------------
>    0  (big)         11111  /333
>                     -------------
>    1  (big)         22222   /444|
>                     -------------
>    2  (LITTLE)      333333/
>                     -------------
>    3  (LITTLE)      444444/
>                     -------------

This is because of the newly idle balancing right?

>
> Now when task 4 hits the barrier (at |) and wakes the others up,
> there are 4 tasks with prev_cpu=<big> and 0 tasks with
> prev_cpu=<little>. want_affine therefore means that we'll only look
> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
> on the bigs until the next load balance, something like this:
>
>      v CPU v           ->time->
>
>                     ------------------------
>    0  (big)         11111  /333  31313\33333
>                     ------------------------
>    1  (big)         22222   /444|424\4444444
>                     ------------------------
>    2  (LITTLE)      333333/          \222222
>                     ------------------------
>    3  (LITTLE)      444444/            \1111
>                     ------------------------
>                                  ^^^
>                            underutilization

I wonder if this problem can be better solved by not having the first
newly-idle-balance pull happen based on some condition, only to be
corrected later by another balancing act?

>
> So, I'm trying to get want_affine = 0 for these tasks.
>
> I don't _think_ any incarnation of the wakee_flips mechanism can help
> us here because which task is waker and which tasks are wakees
> generally changes with each iteration.
>
> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
> nice property that we know exactly how many tasks are being woken, so
> we can cheat.
>
> It might be a disadvantage that we "widen" _every_ task that's woken in
> an event, while select_idle_sibling would work fine for the first
> sd_llc_size - 1 tasks.
>
> IIUC, if wake_affine() behaves correctly this trick wouldn't be
> necessary on SMP systems, so it might be best guarded by the presence

Actually wake_affine doesn't check for balance if previous/next cpu
are within the same shared cache domain. The difference is some time
ago it would return true for shared cache but now it returns false as
of 4.14-rc1:
http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466

Since it would return false in the above wake up cases for task 1 and
2, it would then run select_idle_sibling on the previous CPU which is
also within the big cluster, so I don't think it will make a
difference in this case... Infact what it returns probably doesn't
matter.

> of SD_ASYM_CPUCAPACITY?
>
>                                * * *
>
> Final note..
>
> In order to observe "perfect" behaviour for this use case, I also had
> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
> above we are working through the work queue and have placed tasks 3
> and 2, and are about to place task 1:
>
>      v CPU v           ->time->
>
>                     --------------
>    0  (big)         11111  /333  3
>                     --------------
>    1  (big)         22222   /444|4
>                     --------------
>    2  (LITTLE)      333333/      2
>                     --------------
>    3  (LITTLE)      444444/          <- Task 1 should go here
>                     --------------
>
> If TTWU_QUEUE is enabled, we will not yet have enqueued task
> 2 (having instead sent a reschedule IPI) or attached its load to CPU
> 2. So we are likely to also place task 1 on cpu 2. Disabling
> TTWU_QUEUE means that we enqueue task 2 before placing task 1,
> solving this issue. TTWU_QUEUE is there to minimise rq lock
> contention, and I guess that this contention is less of an issue on
> big.LITTLE systems since they have relatively few CPUs, which
> suggests the trade-off makes sense here.

Yeah, OTOH probably once those IPIs are sent, the select path should I
feel be smart enough to factor those pending enqueues, not sure how
hard or meaningful it is to implement something like that..

>
> Signed-off-by: Brendan Jackman <brendan.jackman@....com>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Josef Bacik <josef@...icpanda.com>
> Cc: Joel Fernandes <joelaf@...gle.com>
> Cc: Mike Galbraith <efault@....de>
> Cc: Matt Fleming <matt@...eblueprint.co.uk>
> ---
>  include/linux/sched/wake_q.h |  2 ++
>  kernel/sched/core.c          | 34 +++++++++++++++++++++++-----------
>  kernel/sched/deadline.c      |  3 ++-
>  kernel/sched/fair.c          | 17 +++++++++++------
>  kernel/sched/idle_task.c     |  3 ++-
>  kernel/sched/rt.c            |  3 ++-
>  kernel/sched/sched.h         |  3 ++-
>  kernel/sched/stop_task.c     |  3 ++-
>  8 files changed, 46 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
> index d03d8a9047dc..607a888eb35b 100644
> --- a/include/linux/sched/wake_q.h
> +++ b/include/linux/sched/wake_q.h
> @@ -33,6 +33,7 @@
>  struct wake_q_head {
>         struct wake_q_node *first;
>         struct wake_q_node **lastp;
> +       int count;
>  };
>
>  #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
> @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
>  {
>         head->first = WAKE_Q_TAIL;
>         head->lastp = &head->first;
> +       head->count = 0;
>  }
>
>  extern void wake_q_add(struct wake_q_head *head,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0869b20fba81..ddf9257b1467 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>         if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
>                 return;
>
> +       head->count++;
> +
>         get_task_struct(task);
>
>         /*
> @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>         head->lastp = &node->next;
>  }
>
> +static int
> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
> +              int sibling_count_hint);
> +
>  void wake_up_q(struct wake_q_head *head)
>  {
>         struct wake_q_node *node = head->first;
> @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
>                 task->wake_q.next = NULL;
>
>                 /*
> -                * wake_up_process() implies a wmb() to pair with the queueing
> +                * try_to_wake_up() implies a wmb() to pair with the queueing
>                  * in wake_q_add() so as not to miss wakeups.
>                  */
> -               wake_up_process(task);
> +               try_to_wake_up(task, TASK_NORMAL, 0, head->count);
>                 put_task_struct(task);

Shouldn't you reset head->count after all the tasks have been woken up?

>         }
>  }
> @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>   * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
>   */
>  static inline
> -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
> +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
> +                  int sibling_count_hint)

This variable seems a bit long to me, how about just sibling_count?

>  {
>         lockdep_assert_held(&p->pi_lock);
>
>         if (p->nr_cpus_allowed > 1)
> -               cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
> +               cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
> +                                                    sibling_count_hint);

same.

<snip>

>
>  static int
> -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
> +                 int sibling_count_hint)
>  {
>         struct task_struct *curr;
>         struct rq *rq;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index eeef1a3086d1..56ae525618e9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1419,7 +1419,8 @@ struct sched_class {
>         void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>
>  #ifdef CONFIG_SMP
> -       int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
> +       int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
> +                              int subling_count_hint);

s/subling/sibling/


thanks,

- Joel