[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240930072355.2892-1-yongli-oc@zhaoxin.com>
Date: Mon, 30 Sep 2024 15:23:52 +0800
From: yongli-oc <yongli-oc@...oxin.com>
To: <peterz@...radead.org>, <mingo@...hat.com>, <will@...nel.org>,
<longman@...hat.com>, <boqun.feng@...il.com>
CC: <linux-kernel@...r.kernel.org>, <yongli@...oxin.com>,
<louisqi@...oxin.com>, <cobechen@...oxin.com>, <jiangbowang@...oxin.com>
Subject: [PATCH v2 0/3] locking/osq_lock: Update osq_lock to dynamic
The series patch changes osq lock to 2 bytes if the
CONFIG_LOCK_SPIN_ON_OWNER=y, and change some coding problems,
add more comments.
Since the 2-byte and the 4-byte osq lock both access the same
cacheline with LOCK# asserted, the speed should be essentially
the same.
To compare the performance for the two kinds of osq lock,
I use locktorture and set cpu affinity for each mutex_lock
write kthreads. The result is the an average of 9 times test.
locktortue-SET CPU AFFINITY AMD EPYC 7551 32-core, 2 sockets
Writers 6.6.28 6.6.28-osq-dynamic disable 6.6.28-osq-dynamic enable
stress 4-byte 2-byte tail 2-byte tail
Average CV Average CV Improve Average CV Improve
1 21047265 3.48% 21331993 12.16% 1.35% 21359519 6.76% 1.48%
2 39186677 5.66% 40348197 6.44% 2.96% 39387961 4.18% 0.51%
4 43467264 3.63% 44133849 4.95% 1.53% 38961218 7.01% -10.37%
8 43780445 6.67% 48887433 3.31% 11.66% 41725007 5.29% -4.69%
16 41407176 4.19% 51042178 3.45% 23.27% 71381112 2.75% 72.39%
32 46000746 6.63% 50060246 14.19% 8.82% 79361487 2.29% 72.52%
48 44235011 5.20% 44988160 7.22% 1.70% 79779501 4.88% 80.35%
64 59054128 4.02% 62233006 2.00% 5.38% 112695286 7.42% 90.83%
The 2-byte osq lock in 1, 2, 4 threads, the performance is nearly
the same as the 4-byte lock. 8, 16, 32 threads, the performance
is better than the 4-byte lock, more threads, the
performance tends to be the same. If turn on dynamic switching,
2-byte locks in 4, 8 threads, the performance has a little
degradation, may be related to the lock contention cumulative,
but it is never satisfied with the switching conditions.
If 16 threads or more, the performance improvements approaching
80 percent.
v1:
The dynamic numa-aware osq_lock supports numa architecture based on
the kernel in-box osq_lock.
After enable by echo 1 > /proc/zx_numa_lock/dynamic_enable, the patch
keeps checking how many processes are waiting for a osq lock. If more
than a threshold, the patch stops these processes and switches to the
numa-aware osq lock, then restarts. By a cleanup work queue,
numa-aware osq lock turns back to osq_lock when all nodes unlocked,
all the numa-aware lock memory returns to the pre-allocated Linux
kernel memory cache.
The struct optimistic_spin_queue of dynamic numa-aware lock is also
4 bytes, the same as the in-box osq lock. If enable dynamic switch,
it will be accessed as three members by union. The tail is tail16,
2 bytes, supports 65534 cpu cores. The other two members are for lock
switch state and numa memory index, each 1 byte.
The serial are added to the struct optimistic_spin_node to
know how many processes is waiting an osq lock. Each process applies
an osq lock, the serial will add 1.
We have done some performance evaluation for the dynamic numa-aware
osq lock by perf, locktorture, unixbench, fxmark etc.
fxmark: Filesystem Multicore Scalability Benchmark
https://github.com/sslab-gatech/fxmark
The following results are tested by Zhaoxin KH40000 32 cores processor
or 32+32 cores, two sockets processor, and AMD EPYC 7551 32-core
processor, two sockets. Since I do not know well about AMD CPU, the
code to support AMD CPU is a sample only.
The number under Average represents an average of test five times.
The CV is the Coefficient of Variation.
The kernel source code is 6.6.28 stable, compiled in the default
configuration.
The 6.6.28-osq-dynamic is the kernel 6.6.28 with the patch and enable
dynamic switch.
The OS is Ubuntu 22.04.02 LTS, gcc version 9.5.0.
perf bench Zhaoxin KH40000 32 cores
kernel 6.6.28 6.6.28-osq-dynamic
epoll Average CV Average CV Improve
ADD 25620 0.78% 64609 2.55% 152.18%
WAIT 7134 1.77% 11098 0.52% 55.56%
locktortue Zhaoxin KH40000 32 cores
kernel 6.6.28 6.6.28-osq-dynamic
lock torture Average CV Average CV Improve
mutex_lock Writes 7433503 1.59% 17979058 1.90% 141.87%
unixbench Zhaoxin KH40000 32+32 cores, run on ssd
64 copys 6.6.28 6.6.28-osq-dynamic
System Benchmarks Partial Average CV Average CV Improve
Execl Throughput 1460.18 1.18% 1865.22 0.25% 27.74%
File Copy 1024 bufsize 200 549.94 0.62% 1221.32 6.71% 122.08%
File Copy 256 bufsize 500 339.62 2.20% 896.58 6.57% 164.00%
File Copy 4096 bufsize 800 1173.68 1.88% 2089.7 5.20% 78.05%
Pipe Throughput 52122.26 0.18% 53842.72 0.15% 3.30%
Pipe-based Context Switchi 18340.38 0.92% 19874.66 0.80% 8.37%
Process Creation 2325.12 0.18% 2178.16 0.21% -6.32%
Shell Scripts (1 concurren 7414.32 0.29% 8458.5 0.10% 14.08%
Shell Scripts (16 concurre
Shell Scripts (8 concurren 7156.48 0.10% 8132.42 0.14% 13.64%
System Call Overhead 1476.9 0.14% 1574.32 0.09% 6.60%
System Benchmarks Index Sc 2982.64 0.33% 4008.66 0.94% 34.40%
fxmark Zhaoxin KH40000 32 cores, run on ssd (ssd, ext4)
parallel cores 32 24
6.6.28 vs 6.6.28-osq-dynamic 6.6.28 vs 6.6.28-osq-dynamic
item Improve Average,CV:Average,CV Improv CV:CV
DWAL -0.17% ( 455895, 0.14%: 455115, 0.37%) -0.07% ( 0.42%: 0.44%)
DWOL 1.10% (32166648, 2.64%:32521877, 2.06%) -0.68% ( 2.54%: 3.04%)
DWOM 51.63% ( 496955, 4.34%: 753509, 8.32%) 45.93% ( 3.14%: 2.57%)
DWSL 1.67% ( 20229, 2.34%: 20566, 3.18%) -1.74% ( 1.96%: 2.66%)
MWRL 71.00% ( 348097, 0.92%: 595241, 1.26%) 65.95% ( 0.65%: 2.27%)
MWRM 63.06% ( 6750, 3.33%: 11007, 4.31%) 60.18% ( 5.67%: 4.81%)
MWCL 16.99% ( 149628, 1.66%: 175054, 0.82%) 16.96% ( 2.57%: 0.51%)
MWCM 80.97% ( 9448, 4.66%: 17098, 0.96%) 73.79% ( 5.37%: 1.79%)
MWUM 37.73% ( 16858, 3.13%: 23220, 3.42%) 31.16% ( 3.59%: 1.62%)
MWUL 12.83% ( 45275, 3.90%: 51083, 3.25%) 19.94% ( 4.19%: 1.98%)
DWTL 41.44% ( 85255, 5.01%: 120583, 9.83%) 45.07% ( 6.42%: 6.11%)
MRPL -2.63% (11448731, 1.91%:11147179, 4.18%) -0.56% ( 1.65%: 3.33%)
MRPM 0.29% ( 5423233, 1.77%: 5438929, 2.59%) -10.54% (15.85%:16.42%)
MRPH -0.49% ( 688629, 2.84%: 685266, 2.88%) -18.99% (15.00%:24.98%)
MRDM 8.42% ( 3662627, 0.76%: 3971133, 0.45%) 4.53% ( 1.77%: 1.72%)
MRDL 6.25% ( 530518, 2.75%: 563671, 5.33%) 12.43% (25.88%:26.91%)
DRBH 7.16% ( 388144, 7.88%: 415933,17.87%) -20.61% (29.12%:21.91%)
DRBM -4.34% ( 381710, 5.51%: 365159, 3.15%) -16.93% (27.15%:29.85%)
DRBL -0.17% (46227341, 2.50%:46147935, 2.89%) -4.03% ( 4.01%: 5.30%)
fxmark Zhaoxin KH40000 32 cores, run on ssd (ssd, ext4)
parallel cores 2 1
6.6.28 vs 6.6.28-osq-dynamic 6.6.28 vs 6.6.28-osq-dynamic
item Improve CV:CV Improve CV: CV
DWAL 1.78% (0.31%:0.20%) 6.36% (2.52%: 0.67%)
DWOL 2.46% (2.26%:2.53%) 1.83% (2.69%: 3.07%)
DWOM 2.70% (2.58%:3.12%) 2.22% (2.67%: 3.79%)
DWSL 3.28% (2.90%:3.38%) 4.41% (1.32%: 1.36%)
MWRL -0.76% (1.46%:1.94%) -0.82% (2.04%: 2.32%)
MWRM 1.94% (4.38%:0.89%) -2.05% (4.07%: 5.16%)
MWCL -0.07% (1.36%:3.84%) -2.17% (1.58%: 3.04%)
MWCM 1.85% (2.95%:4.68%) 0.28% (0.45%: 2.48%)
MWUM -2.85% (1.48%:2.01%) -3.06% (1.47%: 1.97%)
MWUL -1.46% (0.58%:2.27%) -2.98% (0.71%: 2.11%)
DWTL 0.40% (3.89%:4.35%) -2.68% (4.04%: 3.15%)
MRPL 3.11% (1.38%:0.35%) -4.81% (0.32%:16.52%)
MRPM 2.99% (0.29%:1.19%) 3.50% (0.56%: 0.78%)
MRPH 3.01% (1.10%:1.42%) 5.06% (1.18%: 1.73%)
MRDM -1.67% (4.59%:5.58%) -3.30% (0.23%: 8.01%)
MRDL 1.94% (1.56%:4.39%) -0.55% (0.88%: 9.57%)
DRBH 7.24% (7.07%:7.10%) 3.36% (3.30%: 2.95%)
DRBM 4.40% (5.11%:0.74%) -2.55% (0.46%: 3.28%)
DRBL 5.50% (5.58%:0.30%) -1.00% (0.71%: 5.21%)
(some tests has more than 10% loss, CV is also more than 10%,
the result is not stable)
perf bench AMD EPYC 7551 32-core, 2 sockets
kernel 6.6.28 6.6.28-osq-dynamic
epoll Average CV Average CV Improve
ADD 15258 2.30% 62160 2.40% 307.38%
WAIT 3861 4.20% 6990 16.77% 81.03%
locktortue AMD EPYC 7551 32-core, 2 sockets
kernel 6.6.28 6.6.28-osq-dynamic
lock torture Average CV Average CV Improve
mutex_lock Writes 10435064 3.14% 22627890 4.92% 116.84%
unixbench AMD EPYC 7551 32-core, 2 sockets. run on ramdisk
64 copys 6.6.28 6.6.28-osq-dynamic
System Benchmarks Partial Average CV Average CV Improve
Execl Throughput 2677.18 0.90% 3451.76 0.22% 28.93%
File Copy 1024 bufsize 200 815.2 0.59% 1999.54 0.36% 145.28%
File Copy 256 bufsize 500 504.6 0.69% 1359.6 0.49% 169.44%
File Copy 4096 bufsize 800 1842.76 1.24% 3236.48 1.40% 75.63%
Pipe Throughput 57748.74 0.01% 57539.6 0.03% -0.36%
Pipe-based Context Switchi 20882.18 0.57% 20525.38 0.57% -1.71%
Process Creation 4523.98 0.20% 4784.98 0.10% 5.77%
Shell Scripts (1 concurren 13136.54 0.06% 15883.6 0.35% 20.91%
Shell Scripts (16 concurre
Shell Scripts (8 concurren 12883.82 0.14% 15640.32 0.20% 21.40%
System Call Overhead 3533.74 0.04% 3544.16 0.02% 0.29%
System Benchmarks Index Sc 4809.38 0.23% 6575.44 0.14% 36.72%
fxmark AMD EPYC 7551 32-core, 2 sockets. run on ramdisk (mem,tmpfs)
parallel cores 64 32
6.6.28 vs 6.6.28-osq-dynamic 6.6.28 vs 6.6.28-osq-dynamic
item Improve Average, CV : Average, CV Improve CV : CV
DWAL -0.22% ( 24091112, 0.31%: 24038426, 0.52%) -0.26% (0.10%: 0.12%)
DWOL 2.21% ( 86569869, 0.36%: 88479947, 0.27%) 1.99% (0.41%: 0.29%)
DWOM 210.41% ( 425986, 0.77%: 1322320, 0.28%) 128.86% (0.59%: 0.46%)
DWSL 1.27% ( 70260252, 0.39%: 71149334, 0.37%) 1.19% (0.31%: 0.22%)
MWRL 0.85% ( 489865, 0.22%: 494045, 0.25%) 2.29% (0.12%: 0.33%)
MWRM 96.28% ( 149042, 0.45%: 292540, 3.55%) 60.10% (2.49%: 0.38%)
MWCL -5.44% ( 772582, 2.92%: 730585, 0.80%) 0.32% (2.41%: 2.56%)
MWCM 53.89% ( 153857, 1.92%: 236774, 0.46%) 23.84% (0.72%: 0.50%)
MWUM 88.20% ( 214551, 3.90%: 403790, 0.41%) 62.81% (0.80%: 1.12%)
MWUL -8.26% ( 970810, 1.63%: 890615, 1.63%) -6.73% (3.01%: 1.61%)
DWTL 5.90% ( 5522297, 0.49%: 5847951, 0.18%) 5.03% (0.44%: 0.08%)
MRPL -1.10% ( 39707577, 0.07%: 39268812, 0.03%) -1.30% (0.18%: 0.07%)
MRPM -0.63% ( 16446350, 0.47%: 16341936, 0.40%) 0.45% (0.15%: 0.45%)
MRPH -0.03% ( 3805484, 0.50%: 3804248, 0.12%) 3.02% (1.54%: 0.36%)
MRDM 49.41% ( 20178742, 1.89%: 30148449, 1.01%) 17.58% (1.19%: 0.85%)
MRDL -1.95% (227253170, 0.48%:222825409, 1.34%) -1.80% (0.32%: 0.54%)
DRBH 6.01% ( 1045587, 1.91%: 1108467, 0.64%) 0.12% (0.13%: 0.30%)
DRBM 0.65% (117702744, 0.31%:118473408, 0.87%) 1.12% (0.25%: 1.18%)
DRBL 0.93% (121770444, 0.42%:122905957, 0.25%) 1.59% (0.31%: 0.40%)
fxmark AMD EPYC 7551 32-core, 2 sockets. run on ramdisk (mem,tmpfs)
parallel cores 2 1
6.6.28 vs 6.6.28-osq-dynamic 6.6.28 vs 6.6.28-osq-dynamic
item Improve CV : CV Improve CV : CV
DWAL -0.74% (0.33%: 0.19%) -1.02% (0.19%: 0.34%)
DWOL 1.50% (0.36%: 0.44%) 1.89% (0.30%: 0.36%)
DWOM -2.00% (0.73%: 0.38%) 2.43% (0.35%: 0.29%)
DWSL 1.03% (0.34%: 0.54%) 1.18% (0.46%: 0.61%)
MWRL 0.93% (0.39%: 0.18%) 2.25% (1.28%: 1.78%)
MWRM -0.30% (0.60%: 0.47%) 0.17% (0.58%: 0.47%)
MWCL -1.28% (0.41%: 0.66%) -0.38% (0.19%: 0.44%)
MWCM -1.23% (0.36%: 0.23%) -1.42% (0.41%: 0.54%)
MWUM -2.28% (0.57%: 0.75%) -1.11% (0.82%: 0.21%)
MWUL -1.87% (0.64%: 0.50%) -1.75% (0.58%: 0.65%)
DWTL 0.36% (0.09%: 0.12%) 0.19% (0.09%: 0.09%)
MRPL -1.45% (0.37%: 0.31%) -1.35% (0.12%: 0.54%)
MRPM -0.58% (0.30%: 0.11%) -1.04% (0.18%: 0.31%)
MRPH 0.79% (3.92%: 0.48%) -0.53% (0.68%: 0.33%)
MRDM -0.55% (0.93%: 0.44%) -0.13% (0.43%: 0.67%)
MRDL -0.11% (0.56%: 0.19%) 0.68% (0.71%: 0.49%)
DRBH 0.09% (1.31%: 0.87%) 2.75% (0.68%: 0.45%)
DRBM 1.09% (0.19%: 1.05%) 1.60% (0.15%: 0.72%)
DRBL 3.26% (1.00%: 0.56%) 2.34% (0.36%: 0.23%)
From the test result, when heavy contention, the performance of
dynamic numa-aware lock is better than the performance of in-box
osq_lock. If not too many processes apply a lock, the performance
is nearly the same as the in-box osq_lock.
---
Changes since v1 (based on Longman reviews)
#1 Changes the bisection from v1 patchs.
#2 Modify some code, such as the definition, special value in macro,
cpu_relax().
#3 Add some comments.
yongli-oc (3):
locking/osq_lock: The Kconfig for dynamic numa-aware osq lock.
locking/osq_lock: Define osq by union to support dynamic numa-aware
lock.
locking/osq_lock: Turn from 2-byte osq_lock/unlock to numa
lock/unlock.
include/linux/osq_lock.h | 33 ++-
kernel/Kconfig.numalocks | 17 ++
kernel/locking/Makefile | 3 +
kernel/locking/numa.h | 90 ++++++
kernel/locking/numa_osq.h | 29 ++
kernel/locking/x_osq_lock.c | 371 ++++++++++++++++++++++++
kernel/locking/zx_numa.c | 540 +++++++++++++++++++++++++++++++++++
kernel/locking/zx_numa_osq.c | 497 ++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 1 +
9 files changed, 1580 insertions(+), 1 deletion(-)
create mode 100644 kernel/Kconfig.numalocks
create mode 100644 kernel/locking/numa.h
create mode 100644 kernel/locking/numa_osq.h
create mode 100644 kernel/locking/x_osq_lock.c
create mode 100644 kernel/locking/zx_numa.c
create mode 100644 kernel/locking/zx_numa_osq.c
--
2.34.1
Powered by blists - more mailing lists