linux-kernel - Re: [RFC] sched/fair: Use wake_q length as a hint for wake

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87efqln7xs.fsf@arm.com>
Date:   Mon, 02 Oct 2017 11:46:52 +0100
From:   Brendan Jackman <brendan.jackman@....com>
To:     linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>
Cc:     Joel Fernandes <joelaf@...gle.com>,
        Andres Oportus <andresoportus@...gle.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Josef Bacik <josef@...icpanda.com>,
        Mike Galbraith <efault@....de>,
        Matt Fleming <matt@...eblueprint.co.uk>
Subject: Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide

Hi PeterZ,

I just got this in my inbox and noticed I didn't adress it to anyone. I
meant to address it to you.

On Fri, Sep 29 2017 at 17:05, Brendan Jackman wrote:
> There has been a bit of discussion on this RFC, but before I do any
> more work I'd really like your input on the basic idea.
>
> Does the approach of feeding outside info like wake_q length into the
> scheduler's decisions seem basically acceptable?
>
> Cheers,
> Brendan
>
>
> On Fri, Aug 11 2017 at 09:45, Brendan Jackman wrote:
>> This patch adds a parameter to select_task_rq, sibling_count_hint
>> allowing the caller, where it has this information, to inform the
>> sched_class the number of tasks that are being woken up as part of
>> the same event.
>>
>> The wake_q mechanism is one case where this information is available.
>>
>> select_task_rq_fair can then use the information to detect that it
>> needs to widen the search space for task placement in order to avoid
>> overloading the last-level cache domain's CPUs.
>>
>>                                * * *
>>
>> The reason I am investigating this change is the following use case
>> on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
>> all repeatedly do X amount of work then
>> pthread_barrier_wait (i.e. sleep until the last task finishes its X
>> and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
>> finish faster, and then those CPUs pull over the tasks that are still
>> running:
>>
>>      v CPU v           ->time->
>>
>>                     -------------
>>    0  (big)         11111  /333
>>                     -------------
>>    1  (big)         22222   /444|
>>                     -------------
>>    2  (LITTLE)      333333/
>>                     -------------
>>    3  (LITTLE)      444444/
>>                     -------------
>>
>> Now when task 4 hits the barrier (at |) and wakes the others up,
>> there are 4 tasks with prev_cpu=<big> and 0 tasks with
>> prev_cpu=<little>. want_affine therefore means that we'll only look
>> in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
>> on the bigs until the next load balance, something like this:
>>
>>      v CPU v           ->time->
>>
>>                     ------------------------
>>    0  (big)         11111  /333  31313\33333
>>                     ------------------------
>>    1  (big)         22222   /444|424\4444444
>>                     ------------------------
>>    2  (LITTLE)      333333/          \222222
>>                     ------------------------
>>    3  (LITTLE)      444444/            \1111
>>                     ------------------------
>>                                  ^^^
>>                            underutilization
>>
>> So, I'm trying to get want_affine = 0 for these tasks.
>>
>> I don't _think_ any incarnation of the wakee_flips mechanism can help
>> us here because which task is waker and which tasks are wakees
>> generally changes with each iteration.
>>
>> However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
>> nice property that we know exactly how many tasks are being woken, so
>> we can cheat.
>>
>> It might be a disadvantage that we "widen" _every_ task that's woken in
>> an event, while select_idle_sibling would work fine for the first
>> sd_llc_size - 1 tasks.
>>
>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>> necessary on SMP systems, so it might be best guarded by the presence
>> of SD_ASYM_CPUCAPACITY?
>>
>>                                * * *
>>
>> Final note..
>>
>> In order to observe "perfect" behaviour for this use case, I also had
>> to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
>> above we are working through the work queue and have placed tasks 3
>> and 2, and are about to place task 1:
>>
>>      v CPU v           ->time->
>>
>>                     --------------
>>    0  (big)         11111  /333  3
>>                     --------------
>>    1  (big)         22222   /444|4
>>                     --------------
>>    2  (LITTLE)      333333/      2
>>                     --------------
>>    3  (LITTLE)      444444/          <- Task 1 should go here
>>                     --------------
>>
>> If TTWU_QUEUE is enabled, we will not yet have enqueued task
>> 2 (having instead sent a reschedule IPI) or attached its load to CPU
>> 2. So we are likely to also place task 1 on cpu 2. Disabling
>> TTWU_QUEUE means that we enqueue task 2 before placing task 1,
>> solving this issue. TTWU_QUEUE is there to minimise rq lock
>> contention, and I guess that this contention is less of an issue on
>> big.LITTLE systems since they have relatively few CPUs, which
>> suggests the trade-off makes sense here.
>>
>> Signed-off-by: Brendan Jackman <brendan.jackman@....com>
>> Cc: Ingo Molnar <mingo@...hat.com>
>> Cc: Peter Zijlstra <peterz@...radead.org>
>> Cc: Josef Bacik <josef@...icpanda.com>
>> Cc: Joel Fernandes <joelaf@...gle.com>
>> Cc: Mike Galbraith <efault@....de>
>> Cc: Matt Fleming <matt@...eblueprint.co.uk>
>> ---
>>  include/linux/sched/wake_q.h |  2 ++
>>  kernel/sched/core.c          | 34 +++++++++++++++++++++++-----------
>>  kernel/sched/deadline.c      |  3 ++-
>>  kernel/sched/fair.c          | 17 +++++++++++------
>>  kernel/sched/idle_task.c     |  3 ++-
>>  kernel/sched/rt.c            |  3 ++-
>>  kernel/sched/sched.h         |  3 ++-
>>  kernel/sched/stop_task.c     |  3 ++-
>>  8 files changed, 46 insertions(+), 22 deletions(-)
>>
>> diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
>> index d03d8a9047dc..607a888eb35b 100644
>> --- a/include/linux/sched/wake_q.h
>> +++ b/include/linux/sched/wake_q.h
>> @@ -33,6 +33,7 @@
>>  struct wake_q_head {
>>  	struct wake_q_node *first;
>>  	struct wake_q_node **lastp;
>> +	int count;
>>  };
>>
>>  #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
>> @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head)
>>  {
>>  	head->first = WAKE_Q_TAIL;
>>  	head->lastp = &head->first;
>> +	head->count = 0;
>>  }
>>
>>  extern void wake_q_add(struct wake_q_head *head,
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0869b20fba81..ddf9257b1467 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>>  	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
>>  		return;
>>
>> +	head->count++;
>> +
>>  	get_task_struct(task);
>>
>>  	/*
>> @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>>  	head->lastp = &node->next;
>>  }
>>
>> +static int
>> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
>> +	       int sibling_count_hint);
>> +
>>  void wake_up_q(struct wake_q_head *head)
>>  {
>>  	struct wake_q_node *node = head->first;
>> @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
>>  		task->wake_q.next = NULL;
>>
>>  		/*
>> -		 * wake_up_process() implies a wmb() to pair with the queueing
>> +		 * try_to_wake_up() implies a wmb() to pair with the queueing
>>  		 * in wake_q_add() so as not to miss wakeups.
>>  		 */
>> -		wake_up_process(task);
>> +		try_to_wake_up(task, TASK_NORMAL, 0, head->count);
>>  		put_task_struct(task);
>>  	}
>>  }
>> @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>>   * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
>>   */
>>  static inline
>> -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
>> +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
>> +		   int sibling_count_hint)
>>  {
>>  	lockdep_assert_held(&p->pi_lock);
>>
>>  	if (p->nr_cpus_allowed > 1)
>> -		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
>> +		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
>> +						     sibling_count_hint);
>>  	else
>>  		cpu = cpumask_any(&p->cpus_allowed);
>>
>> @@ -1944,6 +1952,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>>   * @p: the thread to be awakened
>>   * @state: the mask of task states that can be woken
>>   * @wake_flags: wake modifier flags (WF_*)
>> + * @sibling_count_hint: A hint at the number of threads that are being woken up
>> + *                      in this event.
>>   *
>>   * If (@state & @p->state) @p->state = TASK_RUNNING.
>>   *
>> @@ -1956,7 +1966,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
>>   *	   %false otherwise.
>>   */
>>  static int
>> -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
>> +	       int sibling_count_hint)
>>  {
>>  	unsigned long flags;
>>  	int cpu, success = 0;
>> @@ -2042,7 +2053,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>  		atomic_dec(&task_rq(p)->nr_iowait);
>>  	}
>>
>> -	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
>> +	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags,
>> +			     sibling_count_hint);
>>  	if (task_cpu(p) != cpu) {
>>  		wake_flags |= WF_MIGRATED;
>>  		set_task_cpu(p, cpu);
>> @@ -2130,13 +2142,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
>>   */
>>  int wake_up_process(struct task_struct *p)
>>  {
>> -	return try_to_wake_up(p, TASK_NORMAL, 0);
>> +	return try_to_wake_up(p, TASK_NORMAL, 0, 1);
>>  }
>>  EXPORT_SYMBOL(wake_up_process);
>>
>>  int wake_up_state(struct task_struct *p, unsigned int state)
>>  {
>> -	return try_to_wake_up(p, state, 0);
>> +	return try_to_wake_up(p, state, 0, 1);
>>  }
>>
>>  /*
>> @@ -2442,7 +2454,7 @@ void wake_up_new_task(struct task_struct *p)
>>  	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
>>  	 * as we're not fully set-up yet.
>>  	 */
>> -	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
>> +	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 1));
>>  #endif
>>  	rq = __task_rq_lock(p, &rf);
>>  	update_rq_clock(rq);
>> @@ -2893,7 +2905,7 @@ void sched_exec(void)
>>  	int dest_cpu;
>>
>>  	raw_spin_lock_irqsave(&p->pi_lock, flags);
>> -	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
>> +	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0, 1);
>>  	if (dest_cpu == smp_processor_id())
>>  		goto unlock;
>>
>> @@ -3582,7 +3594,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
>>  int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
>>  			  void *key)
>>  {
>> -	return try_to_wake_up(curr->private, mode, wake_flags);
>> +	return try_to_wake_up(curr->private, mode, wake_flags, 1);
>>  }
>>  EXPORT_SYMBOL(default_wake_function);
>>
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 755bd3f1a1a9..69a9dd407267 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -1516,7 +1516,8 @@ static void yield_task_dl(struct rq *rq)
>>  static int find_later_rq(struct task_struct *task);
>>
>>  static int
>> -select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>> +select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags,
>> +		  int sibling_count_hint)
>>  {
>>  	struct task_struct *curr;
>>  	struct rq *rq;
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c95880e216f6..0a9d706b62bf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5332,15 +5332,18 @@ static void record_wakee(struct task_struct *p)
>>   * whatever is irrelevant, spread criteria is apparent partner count exceeds
>>   * socket size.
>>   */
>> -static int wake_wide(struct task_struct *p)
>> +static int wake_wide(struct task_struct *p, int sibling_count_hint)
>>  {
>>  	unsigned int master = current->wakee_flips;
>>  	unsigned int slave = p->wakee_flips;
>> -	int factor = this_cpu_read(sd_llc_size);
>> +	int llc_size = this_cpu_read(sd_llc_size);
>> +
>> +	if (sibling_count_hint >= llc_size)
>> +		return 1;
>>
>>  	if (master < slave)
>>  		swap(master, slave);
>> -	if (slave < factor || master < slave * factor)
>> +	if (slave < llc_size || master < slave * llc_size)
>>  		return 0;
>>  	return 1;
>>  }
>> @@ -5869,7 +5872,8 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>>   * preempt must be disabled.
>>   */
>>  static int
>> -select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
>> +select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags,
>> +		    int sibling_count_hint)
>>  {
>>  	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>>  	int cpu = smp_processor_id();
>> @@ -5879,8 +5883,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>
>>  	if (sd_flag & SD_BALANCE_WAKE) {
>>  		record_wakee(p);
>> -		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
>> -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
>> +		want_affine = !wake_wide(p, sibling_count_hint) &&
>> +			      !wake_cap(p, cpu, prev_cpu) &&
>> +			      cpumask_test_cpu(cpu, &p->cpus_allowed);
>>  	}
>>
>>  	rcu_read_lock();
>> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
>> index 0c00172db63e..3c343e055110 100644
>> --- a/kernel/sched/idle_task.c
>> +++ b/kernel/sched/idle_task.c
>> @@ -9,7 +9,8 @@
>>
>>  #ifdef CONFIG_SMP
>>  static int
>> -select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
>> +select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags,
>> +		    int sibling_count_hint)
>>  {
>>  	return task_cpu(p); /* IDLE tasks as never migrated */
>>  }
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 45caf937ef90..b9937dccb8b3 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -1387,7 +1387,8 @@ static void yield_task_rt(struct rq *rq)
>>  static int find_lowest_rq(struct task_struct *task);
>>
>>  static int
>> -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
>> +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
>> +		  int sibling_count_hint)
>>  {
>>  	struct task_struct *curr;
>>  	struct rq *rq;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index eeef1a3086d1..56ae525618e9 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1419,7 +1419,8 @@ struct sched_class {
>>  	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
>>
>>  #ifdef CONFIG_SMP
>> -	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
>> +	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
>> +			       int subling_count_hint);
>>  	void (*migrate_task_rq)(struct task_struct *p);
>>
>>  	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
>> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
>> index 9f69fb630853..d0ce4fbb18ef 100644
>> --- a/kernel/sched/stop_task.c
>> +++ b/kernel/sched/stop_task.c
>> @@ -11,7 +11,8 @@
>>
>>  #ifdef CONFIG_SMP
>>  static int
>> -select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
>> +select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags,
>> +		    int sibling_count_hint)
>>  {
>>  	return task_cpu(p); /* stop tasks as never migrate */
>>  }