lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Mon, 21 Feb 2022 16:39:17 -0000
From:   "irqchip-bot for Barry Song" <tip-bot2@...utronix.de>
To:     linux-kernel@...r.kernel.org
Cc:     Barry Song <song.bao.hua@...ilicon.com>,
        Marc Zyngier <maz@...nel.org>, tglx@...utronix.de
Subject: [irqchip: irq/irqchip-next] irqchip/gic-v3: Use dsb(ishst) to order
 writes with ICC_SGI1R_EL1 accesses

The following commit has been merged into the irq/irqchip-next branch of irqchip:

Commit-ID:     80e4e1f472889f31a4dcaea3a4eb7a565296f1f3
Gitweb:        https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms/80e4e1f472889f31a4dcaea3a4eb7a565296f1f3
Author:        Barry Song <song.bao.hua@...ilicon.com>
AuthorDate:    Sun, 20 Feb 2022 19:19:10 +13:00
Committer:     Marc Zyngier <maz@...nel.org>
CommitterDate: Mon, 21 Feb 2022 16:17:02 

irqchip/gic-v3: Use dsb(ishst) to order writes with ICC_SGI1R_EL1 accesses

A dsb(ishst) barrier should be enough to order previous writes with
the system register generating the SGI, as we only need to guarantee
the visibility of data to other CPUs in the inner shareable domain
before we send the SGI.

A micro-benchmark is written to verify the performance impact on
kunpeng920 machine with 2 sockets, each socket has 2 dies, and
each die has 24 CPUs, so totally the system has 2 * 2 * 24 = 96
CPUs. ~2% performance improvement can be seen by this benchmark.

The code of benchmark module:

 #include <linux/module.h>
 #include <linux/timekeeping.h>

 volatile int data0 ____cacheline_aligned;
 volatile int data1 ____cacheline_aligned;
 volatile int data2 ____cacheline_aligned;
 volatile int data3 ____cacheline_aligned;
 volatile int data4 ____cacheline_aligned;
 volatile int data5 ____cacheline_aligned;
 volatile int data6 ____cacheline_aligned;

 static void ipi_latency_func(void *val)
 {
 }

 static int __init ipi_latency_init(void)
 {
 	ktime_t stime, etime, delta;
 	int cpu, i;
 	int start = smp_processor_id();

 	stime = ktime_get();
 	for ( i = 0; i < 1000; i++)
 		for (cpu = 0; cpu < 96; cpu++) {
 			data0 = data1 = data2 = data3 = data4 = data5 = data6 = cpu;
 			smp_call_function_single(cpu, ipi_latency_func, NULL, 1);
 		}
 	etime = ktime_get();

 	delta = ktime_sub(etime, stime);

 	printk("%s ipi from cpu%d to cpu0-95 delta of 1000times:%lld\n",
 			__func__, start, delta);

 	return 0;
 }
 module_init(ipi_latency_init);

 static void ipi_latency_exit(void)
 {
 }
 module_exit(ipi_latency_exit);

 MODULE_DESCRIPTION("IPI benchmark");
 MODULE_LICENSE("GPL");

run the below commands 10 times on both Vanilla and the kernel with this
patch:
 # taskset -c 0 insmod test.ko
 # rmmod test

The result on vanilla:
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126757449
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126784249
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126177703
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127022281
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126184883
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127374585
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:125778089
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126974441
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127357625
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126228184

The result on the kernel with this patch:
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:124467401
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123474209
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123558497
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122993951
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122984223
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123323609
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:124507583
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123386963
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123340664
 ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123285324

Signed-off-by: Barry Song <song.bao.hua@...ilicon.com>
[maz: tidied up commit message]
Signed-off-by: Marc Zyngier <maz@...nel.org>
Link: https://lore.kernel.org/r/20220220061910.6155-1-21cnbao@gmail.com
---
 drivers/irqchip/irq-gic-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 5e935d9..0efe1a9 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
 	 * Ensure that stores to Normal memory are visible to the
 	 * other CPUs before issuing the IPI.
 	 */
-	wmb();
+	dsb(ishst);
 
 	for_each_cpu(cpu, mask) {
 		u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ