linux-kernel - Re: [PATCH] sched/rt: Avoid sending an IPI to a CPU already doing a push

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 30 Jun 2016 13:57:10 -0400
From:	Steven Rostedt <rostedt@...dmis.org>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...nel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Clark Williams <williams@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH] sched/rt: Avoid sending an IPI to a CPU already doing a
 push


Gentle ping...

-- Steve


On Fri, 24 Jun 2016 11:26:13 -0400
Steven Rostedt <rostedt@...dmis.org> wrote:

> When a CPU lowers its priority (schedules out a high priority task for a
> lower priority one), a check is made to see if any other CPU has overloaded
> RT tasks (more than one). It checks the rto_mask to determine this and if so
> it will request to pull one of those tasks to itself if the non running RT
> task is of higher priority than the new priority of the next task to run on
> the current CPU.
> 
> When we deal with large number of CPUs, the original pull logic suffered
> from large lock contention on a single CPU run queue, which caused a huge
> latency across all CPUs. This was caused by only having one CPU having
> overloaded RT tasks and a bunch of other CPUs lowering their priority. To
> solve this issue, commit b6366f048e0c ("sched/rt: Use IPI to trigger RT task
> push migration instead of pulling") changed the way to request a pull.
> Instead of grabbing the lock of the overloaded CPU's runqueue, it simply
> sent an IPI to that CPU to do the work.
> 
> Although the IPI logic worked very well in removing the large latency build
> up, it still could suffer a large (although not as large as without the IPI)
> latency due to the work within the IPI. To understand this issue, an
> explanation of the IPI logic is required.
> 
> When a CPU lowers its priority, it finds the next set bit in the rto_mask
> from its own CPU. That is, if bit 2 and 10 are set in the rto_mask, and CPU
> 8 lowers its priority, it will select CPU 10 to send its IPI to. Now, lets
> say that CPU 0 and CPU 1 lower their prioirty. They will both send their IPI
> to CPU 2.
> 
> If IPI of CPU 0 gets to CPU 2 first, then it triggers the PUSH logic and if
> CPU 1 has a lower priority than CPU 0, it will push its overloaded task to
> CPU 1 (due to cpupri), even though the IPI came from CPU 0. Even though a
> task was pushed, we need to make sure that there's not higher tasks still
> waiting. Thus an IPI is then sent to CPU 10 for processing of CPU 0's
> request (remember the pushed task went to CPU 1).
> 
> When the IPI of CPU 1 reaches CPU 2, it will skip the push logic (because it
> no longer has any tasks to push), but it too still needs to notify other
> CPUs about this CPU lowering its priority. Thus it sends another IPI to CPU
> 10, because that bit is still set in the rto_mask.
> 
> Now on CPU 10, it just finished dealing with the IPI of CPU 8, and even
> though it now doesn't have any RT tasks to push, it just received two more
> IPIs (from CPU 2, to deal with CPU 0 and CPU 1). It too must do work to see
> if it should continue sending an IPI to more rto_mask CPUs. If there's no
> more CPUs to send to, it still needs to "stop" the execution of the push
> request.
> 
> Although these IPIs are fast to process, I've traced a single CPU dealing
> with 89 IPIs in a row, on a 80 CPU machine! This was caused by an overloaded
> RT task that and a limited CPU affinity, and most of the CPUs sending IPIs
> to it, couldn't do anything with it. And because the CPUs were very active
> and changed their priorities again, it sent out duplicates. The latency of
> handling 89 IPIs was 200us (~2.3us to handle each IPI), as each IPI does
> require taking of a spinlock that deals with the IPI itself (not a rq lock,
> and very little contention).
> 
> To solve this, an ipi_count is added to rt_rq, that gets incremented when an
> IPI is sent to that runqueue. When looking for the next CPU to process, the
> ipi_count is checked to see if that CPU is already processing push requests,
> and if so, then that CPU is skipped, and the next CPU in the rto_mask is
> checked.
> 
> The IPI code now needs to call push_tasks() instead of just push_task() as
> it will not be receiving an IPI for each CPU that is requesting a PULL.
> 
> This change removes this duplication of work in the IPI logic, and lowers
> the latency caused by the IPIs greatly.
> 
> Signed-off-by: Steven Rostedt <rostedt@...dmis.org>
> ---
>  kernel/sched/rt.c    | 14 ++++++++++++--
>  kernel/sched/sched.h |  2 ++
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index d5690b722691..165bcfdbd94b 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -100,6 +100,7 @@ void init_rt_rq(struct rt_rq *rt_rq)
>  	rt_rq->push_flags = 0;
>  	rt_rq->push_cpu = nr_cpu_ids;
>  	raw_spin_lock_init(&rt_rq->push_lock);
> +	atomic_set(&rt_rq->ipi_count, 0);
>  	init_irq_work(&rt_rq->push_work, push_irq_work_func);
>  #endif
>  #endif /* CONFIG_SMP */
> @@ -1917,6 +1918,10 @@ static int find_next_push_cpu(struct rq *rq)
>  			break;
>  		next_rq = cpu_rq(cpu);
>  
> +		/* If pushing was already started, ignore */
> +		if (atomic_read(&next_rq->rt.ipi_count))
> +			continue;
> +
>  		/* Make sure the next rq can push to this rq */
>  		if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
>  			break;
> @@ -1955,6 +1960,7 @@ static void tell_cpu_to_push(struct rq *rq)
>  		return;
>  
>  	rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
> +	atomic_inc(&cpu_rq(cpu)->rt.ipi_count);
>  
>  	irq_work_queue_on(&rq->rt.push_work, cpu);
>  }
> @@ -1974,11 +1980,12 @@ static void try_to_push_tasks(void *arg)
>  
>  	rq = cpu_rq(this_cpu);
>  	src_rq = rq_of_rt_rq(rt_rq);
> +	WARN_ON_ONCE(!atomic_read(&rq->rt.ipi_count));
>  
>  again:
>  	if (has_pushable_tasks(rq)) {
>  		raw_spin_lock(&rq->lock);
> -		push_rt_task(rq);
> +		push_rt_tasks(rq);
>  		raw_spin_unlock(&rq->lock);
>  	}
>  
> @@ -2000,7 +2007,7 @@ again:
>  	raw_spin_unlock(&rt_rq->push_lock);
>  
>  	if (cpu >= nr_cpu_ids)
> -		return;
> +		goto out;
>  
>  	/*
>  	 * It is possible that a restart caused this CPU to be
> @@ -2011,7 +2018,10 @@ again:
>  		goto again;
>  
>  	/* Try the next RT overloaded CPU */
> +	atomic_inc(&cpu_rq(cpu)->rt.ipi_count);
>  	irq_work_queue_on(&rt_rq->push_work, cpu);
> +out:
> +	atomic_dec(&rq->rt.ipi_count);
>  }
>  
>  static void push_irq_work_func(struct irq_work *work)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index de607e4febd9..b47d580dfa84 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -476,6 +476,8 @@ struct rt_rq {
>  	int push_cpu;
>  	struct irq_work push_work;
>  	raw_spinlock_t push_lock;
> +	/* Used to skip CPUs being processed in the rto_mask */
> +	atomic_t ipi_count;
>  #endif
>  #endif /* CONFIG_SMP */
>  	int rt_queued;