linux-kernel - Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c0728f14-7373-1e19-1655-1944411290b2@efficios.com>
Date:   Wed, 6 Sep 2023 09:57:04 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@....com>,
        Aaron Lu <aaron.lu@...el.com>,
        Julien Desfossez <jdesfossez@...italocean.com>, x86@...nel.org
Subject: Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per
 task

On 9/6/23 04:41, Peter Zijlstra wrote:
> On Tue, Sep 05, 2023 at 01:11:04PM -0400, Mathieu Desnoyers wrote:
>> Rate limit migrations to 1 migration per 2 milliseconds per task. On a
>> kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679),
> 
> This is not in any way related to the actual eevdf part, perhaps just
> call it fair.

Good point.

> 
> 
>>   include/linux/sched.h |  2 ++
>>   kernel/sched/core.c   |  1 +
>>   kernel/sched/fair.c   | 14 ++++++++++++++
>>   kernel/sched/sched.h  |  2 ++
>>   4 files changed, 19 insertions(+)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 177b3f3676ef..1111d04255cc 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -564,6 +564,8 @@ struct sched_entity {
>>   
>>   	u64				nr_migrations;
>>   
>> +	u64				next_migration_time;
>> +
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>>   	int				depth;
>>   	struct sched_entity		*parent;
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 479db611f46e..0d294fce261d 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>>   	p->se.vruntime			= 0;
>>   	p->se.vlag			= 0;
>>   	p->se.slice			= sysctl_sched_base_slice;
>> +	p->se.next_migration_time	= 0;
>>   	INIT_LIST_HEAD(&p->se.group_node);
>>   
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d92da2d78774..24ac69913005 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -960,6 +960,14 @@ int sched_update_scaling(void)
>>   
>>   static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>>   
>> +static bool should_migrate_task(struct task_struct *p, int prev_cpu)
>> +{
>> +	/* Rate limit task migration. */
>> +	if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)
>> +	       return false;
>> +	return true;
>> +}
>> +
>>   /*
>>    * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
>>    * this is probably good enough.
>> @@ -7897,6 +7905,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>   		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
>>   	}
>>   
>> +	if (want_affine && !should_migrate_task(p, prev_cpu))
>> +		return prev_cpu;
>> +
>>   	rcu_read_lock();
>>   	for_each_domain(cpu, tmp) {
>>   		/*
>> @@ -7944,6 +7955,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>>   {
>>   	struct sched_entity *se = &p->se;
>>   
>> +	/* Rate limit task migration. */
>> +	se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW;
>> +
>>   	if (!task_on_rq_migrating(p)) {
>>   		remove_entity_load_avg(se);
>>   
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index cf54fe338e23..c9b1a5976761 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -104,6 +104,8 @@ struct cpuidle_state;
>>   #define TASK_ON_RQ_QUEUED	1
>>   #define TASK_ON_RQ_MIGRATING	2
>>   
>> +#define SCHED_MIGRATION_RATELIMIT_WINDOW	2000000		/* 2 ms */
>> +
>>   extern __read_mostly int scheduler_running;
>>   
>>   extern unsigned long calc_load_update;
> 
> Urgh... so we already have much of this around task_hot() /
> can_migrate_task(). And I would much rather see we extend those things
> to this wakeup migration path, rather than build a whole new parallel
> thing.

Yes, good point.

> 
> Also:
> 
>> I have noticed that in order to observe the speedup, the workload needs
>> to keep the CPUs sufficiently busy to cause runqueue lock contention,
>> but not so busy that they don't go idle.
> 
> This would suggest inhibiting pulling tasks based on rq statistics,
> instead of tasks stats. It doesn't matter when the task migrated last,
> what matter is that this rq doesn't want new tasks at this point.
> 
> Them not the same thing.

I suspect we could try something like this then:

When a cpu enters idle state, it could grab a sched_clock() timestamp
and store it into this_rq()->enter_idle_time. Then, when it exits
idle and reenters idle again, it could save rq->enter_idle_time to
rq->prev_enter_idle_time, and sample enter_idle_time again.

When considering the CPU as a target for task migration, if it is
idle but the delta between sched_clock_cpu(cpu_of(rq)) and that
prev_enter_idle_time is below a threshold (e.g. a few ms), this means
the CPU got out of idle and went back to idle pretty quickly, which
means it's not a good target for pulling tasks for a short while.

I'll try something along these lines and see how it goes.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com