[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c0728f14-7373-1e19-1655-1944411290b2@efficios.com>
Date: Wed, 6 Sep 2023 09:57:04 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Swapnil Sapkal <Swapnil.Sapkal@....com>,
Aaron Lu <aaron.lu@...el.com>,
Julien Desfossez <jdesfossez@...italocean.com>, x86@...nel.org
Subject: Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per
task
On 9/6/23 04:41, Peter Zijlstra wrote:
> On Tue, Sep 05, 2023 at 01:11:04PM -0400, Mathieu Desnoyers wrote:
>> Rate limit migrations to 1 migration per 2 milliseconds per task. On a
>> kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679),
>
> This is not in any way related to the actual eevdf part, perhaps just
> call it fair.
Good point.
>
>
>> include/linux/sched.h | 2 ++
>> kernel/sched/core.c | 1 +
>> kernel/sched/fair.c | 14 ++++++++++++++
>> kernel/sched/sched.h | 2 ++
>> 4 files changed, 19 insertions(+)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 177b3f3676ef..1111d04255cc 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -564,6 +564,8 @@ struct sched_entity {
>>
>> u64 nr_migrations;
>>
>> + u64 next_migration_time;
>> +
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>> int depth;
>> struct sched_entity *parent;
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 479db611f46e..0d294fce261d 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>> p->se.vruntime = 0;
>> p->se.vlag = 0;
>> p->se.slice = sysctl_sched_base_slice;
>> + p->se.next_migration_time = 0;
>> INIT_LIST_HEAD(&p->se.group_node);
>>
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d92da2d78774..24ac69913005 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -960,6 +960,14 @@ int sched_update_scaling(void)
>>
>> static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>>
>> +static bool should_migrate_task(struct task_struct *p, int prev_cpu)
>> +{
>> + /* Rate limit task migration. */
>> + if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)
>> + return false;
>> + return true;
>> +}
>> +
>> /*
>> * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
>> * this is probably good enough.
>> @@ -7897,6 +7905,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>> want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
>> }
>>
>> + if (want_affine && !should_migrate_task(p, prev_cpu))
>> + return prev_cpu;
>> +
>> rcu_read_lock();
>> for_each_domain(cpu, tmp) {
>> /*
>> @@ -7944,6 +7955,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>> {
>> struct sched_entity *se = &p->se;
>>
>> + /* Rate limit task migration. */
>> + se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW;
>> +
>> if (!task_on_rq_migrating(p)) {
>> remove_entity_load_avg(se);
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index cf54fe338e23..c9b1a5976761 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -104,6 +104,8 @@ struct cpuidle_state;
>> #define TASK_ON_RQ_QUEUED 1
>> #define TASK_ON_RQ_MIGRATING 2
>>
>> +#define SCHED_MIGRATION_RATELIMIT_WINDOW 2000000 /* 2 ms */
>> +
>> extern __read_mostly int scheduler_running;
>>
>> extern unsigned long calc_load_update;
>
> Urgh... so we already have much of this around task_hot() /
> can_migrate_task(). And I would much rather see we extend those things
> to this wakeup migration path, rather than build a whole new parallel
> thing.
Yes, good point.
>
> Also:
>
>> I have noticed that in order to observe the speedup, the workload needs
>> to keep the CPUs sufficiently busy to cause runqueue lock contention,
>> but not so busy that they don't go idle.
>
> This would suggest inhibiting pulling tasks based on rq statistics,
> instead of tasks stats. It doesn't matter when the task migrated last,
> what matter is that this rq doesn't want new tasks at this point.
>
> Them not the same thing.
I suspect we could try something like this then:
When a cpu enters idle state, it could grab a sched_clock() timestamp
and store it into this_rq()->enter_idle_time. Then, when it exits
idle and reenters idle again, it could save rq->enter_idle_time to
rq->prev_enter_idle_time, and sample enter_idle_time again.
When considering the CPU as a target for task migration, if it is
idle but the delta between sched_clock_cpu(cpu_of(rq)) and that
prev_enter_idle_time is below a threshold (e.g. a few ms), this means
the CPU got out of idle and went back to idle pretty quickly, which
means it's not a good target for pulling tasks for a short while.
I'll try something along these lines and see how it goes.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists