[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220513063729.GF76023@worktop.programming.kicks-ass.net>
Date: Fri, 13 May 2022 08:37:29 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Tianchen Ding <dtcccc@...ux.alibaba.com>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched: Queue task on wakelist in the same llc if the
wakee cpu is idle
On Fri, May 13, 2022 at 02:24:27PM +0800, Tianchen Ding wrote:
> We notice the commit 518cd6234178 ("sched: Only queue remote wakeups
> when crossing cache boundaries") disabled queuing tasks on wakelist when
> the cpus share llc. This is because, at that time, the scheduler must
> send IPIs to do ttwu_queue_wakelist.
No; this was because of cache bouncing.
> Nowadays, ttwu_queue_wakelist also
> supports TIF_POLLING, so this is not a problem now when the wakee cpu is
> in idle polling.
>
> Benefits:
> Queuing the task on idle cpu can help improving performance on waker cpu
> and utilization on wakee cpu, and further improve locality because
> the wakee cpu can handle its own rq. This patch helps improving rt on
> our real java workloads where wakeup happens frequently.
>
> Does this patch bring IPI flooding?
> For archs with TIF_POLLING_NRFLAG (e.g., x86), there will be no
> difference if the wakee cpu is idle polling. If the wakee cpu is idle
> but not polling, the later check_preempt_curr() will send IPI too.
>
> For archs without TIF_POLLING_NRFLAG (e.g., arm64), the IPI is
> unavoidable, since the later check_preempt_curr() will send IPI when
> wakee cpu is idle.
>
> Benchmark:
> running schbench -m 2 -t 8 on 8269CY:
>
> without patch:
> Latency percentiles (usec)
> 50.0000th: 10
> 75.0000th: 14
> 90.0000th: 16
> 95.0000th: 16
> *99.0000th: 17
> 99.5000th: 20
> 99.9000th: 23
> min=0, max=28
>
> with patch:
> Latency percentiles (usec)
> 50.0000th: 6
> 75.0000th: 8
> 90.0000th: 9
> 95.0000th: 9
> *99.0000th: 10
> 99.5000th: 10
> 99.9000th: 14
> min=0, max=16
>
> We've also tested unixbench and see about 10% improvement on Pipe-based
> Context Switching, and no performance regression on other test cases.
>
> For arm64, we've tested schbench and unixbench on Kunpeng920, the
> results show that,
What is a kunpeng and how does it's topology look?
> the improvement is not as obvious as on x86, and
> there's no performance regression.
x86 is wide and varied; what x86 did you test?
Powered by blists - more mailing lists