lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b470a016-3b5d-4edf-2a54-9e70f9849bc2@linux.alibaba.com>
Date:   Fri, 13 May 2022 15:05:24 +0800
From:   Tianchen Ding <dtcccc@...ux.alibaba.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched: Queue task on wakelist in the same llc if the
 wakee cpu is idle

On 2022/5/13 14:37, Peter Zijlstra wrote:
> On Fri, May 13, 2022 at 02:24:27PM +0800, Tianchen Ding wrote:
>> We notice the commit 518cd6234178 ("sched: Only queue remote wakeups
>> when crossing cache boundaries") disabled queuing tasks on wakelist when
>> the cpus share llc. This is because, at that time, the scheduler must
>> send IPIs to do ttwu_queue_wakelist.
> 
> No; this was because of cache bouncing.

As I understand, avoiding cache bouncing is the reason to do 
queue_wakelist accross llc. This can be the same reason why we try to do 
queue_wakelist within the same llc now. It should be better for the 
wakee cpu handling its own rq. Will there be some other side effects?

> 
>> Nowadays, ttwu_queue_wakelist also
>> supports TIF_POLLING, so this is not a problem now when the wakee cpu is
>> in idle polling.
>>
>> Benefits:
>>    Queuing the task on idle cpu can help improving performance on waker cpu
>>    and utilization on wakee cpu, and further improve locality because
>>    the wakee cpu can handle its own rq. This patch helps improving rt on
>>    our real java workloads where wakeup happens frequently.
>>
>> Does this patch bring IPI flooding?
>>    For archs with TIF_POLLING_NRFLAG (e.g., x86), there will be no
>>    difference if the wakee cpu is idle polling. If the wakee cpu is idle
>>    but not polling, the later check_preempt_curr() will send IPI too.
>>
>>    For archs without TIF_POLLING_NRFLAG (e.g., arm64), the IPI is
>>    unavoidable, since the later check_preempt_curr() will send IPI when
>>    wakee cpu is idle.
>>
>> Benchmark:
>> running schbench -m 2 -t 8 on 8269CY:
>>
>> without patch:
>> Latency percentiles (usec)
>>          50.0000th: 10
>>          75.0000th: 14
>>          90.0000th: 16
>>          95.0000th: 16
>>          *99.0000th: 17
>>          99.5000th: 20
>>          99.9000th: 23
>>          min=0, max=28
>>
>> with patch:
>> Latency percentiles (usec)
>>          50.0000th: 6
>>          75.0000th: 8
>>          90.0000th: 9
>>          95.0000th: 9
>>          *99.0000th: 10
>>          99.5000th: 10
>>          99.9000th: 14
>>          min=0, max=16
>>
>> We've also tested unixbench and see about 10% improvement on Pipe-based
>> Context Switching, and no performance regression on other test cases.
>>
>> For arm64, we've tested schbench and unixbench on Kunpeng920, the
>> results show that,
> 
> What is a kunpeng and how does it's topology look?

It's an arm64 processor produced by Huawei. It's topology has NUMA and 
cluster. See the commit log of c5e22feffdd7 ("topology: Represent 
clusters of CPUs within a die") for detail.
In fact I also tried to test on Ampere. But there maybe sth wrong on my 
machine and the kernel only get upto l2 cache info. (Which means each 
cpu has a different sd_llc_id so the patch will take no effect.) :-(

> 
>> the improvement is not as obvious as on x86, and
>> there's no performance regression.
> 
> x86 is wide and varied; what x86 did you test?

I've tested on Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz. Do you 
need more info on other machines?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ