[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b470a016-3b5d-4edf-2a54-9e70f9849bc2@linux.alibaba.com>
Date: Fri, 13 May 2022 15:05:24 +0800
From: Tianchen Ding <dtcccc@...ux.alibaba.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched: Queue task on wakelist in the same llc if the
wakee cpu is idle
On 2022/5/13 14:37, Peter Zijlstra wrote:
> On Fri, May 13, 2022 at 02:24:27PM +0800, Tianchen Ding wrote:
>> We notice the commit 518cd6234178 ("sched: Only queue remote wakeups
>> when crossing cache boundaries") disabled queuing tasks on wakelist when
>> the cpus share llc. This is because, at that time, the scheduler must
>> send IPIs to do ttwu_queue_wakelist.
>
> No; this was because of cache bouncing.
As I understand, avoiding cache bouncing is the reason to do
queue_wakelist accross llc. This can be the same reason why we try to do
queue_wakelist within the same llc now. It should be better for the
wakee cpu handling its own rq. Will there be some other side effects?
>
>> Nowadays, ttwu_queue_wakelist also
>> supports TIF_POLLING, so this is not a problem now when the wakee cpu is
>> in idle polling.
>>
>> Benefits:
>> Queuing the task on idle cpu can help improving performance on waker cpu
>> and utilization on wakee cpu, and further improve locality because
>> the wakee cpu can handle its own rq. This patch helps improving rt on
>> our real java workloads where wakeup happens frequently.
>>
>> Does this patch bring IPI flooding?
>> For archs with TIF_POLLING_NRFLAG (e.g., x86), there will be no
>> difference if the wakee cpu is idle polling. If the wakee cpu is idle
>> but not polling, the later check_preempt_curr() will send IPI too.
>>
>> For archs without TIF_POLLING_NRFLAG (e.g., arm64), the IPI is
>> unavoidable, since the later check_preempt_curr() will send IPI when
>> wakee cpu is idle.
>>
>> Benchmark:
>> running schbench -m 2 -t 8 on 8269CY:
>>
>> without patch:
>> Latency percentiles (usec)
>> 50.0000th: 10
>> 75.0000th: 14
>> 90.0000th: 16
>> 95.0000th: 16
>> *99.0000th: 17
>> 99.5000th: 20
>> 99.9000th: 23
>> min=0, max=28
>>
>> with patch:
>> Latency percentiles (usec)
>> 50.0000th: 6
>> 75.0000th: 8
>> 90.0000th: 9
>> 95.0000th: 9
>> *99.0000th: 10
>> 99.5000th: 10
>> 99.9000th: 14
>> min=0, max=16
>>
>> We've also tested unixbench and see about 10% improvement on Pipe-based
>> Context Switching, and no performance regression on other test cases.
>>
>> For arm64, we've tested schbench and unixbench on Kunpeng920, the
>> results show that,
>
> What is a kunpeng and how does it's topology look?
It's an arm64 processor produced by Huawei. It's topology has NUMA and
cluster. See the commit log of c5e22feffdd7 ("topology: Represent
clusters of CPUs within a die") for detail.
In fact I also tried to test on Ampere. But there maybe sth wrong on my
machine and the kernel only get upto l2 cache info. (Which means each
cpu has a different sd_llc_id so the patch will take no effect.) :-(
>
>> the improvement is not as obvious as on x86, and
>> there's no performance regression.
>
> x86 is wide and varied; what x86 did you test?
I've tested on Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz. Do you
need more info on other machines?
Powered by blists - more mailing lists