[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240925134732.24431-1-ahuang12@lenovo.com>
Date: Wed, 25 Sep 2024 21:47:32 +0800
From: Adrian Huang <adrianhuang0701@...il.com>
To: Andrey Ryabinin <ryabinin.a.a@...il.com>,
Alexander Potapenko <glider@...gle.com>,
Andrey Konovalov <andreyknvl@...il.com>,
Dmitry Vyukov <dvyukov@...gle.com>,
Vincenzo Frascino <vincenzo.frascino@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
Uladzislau Rezki <urezki@...il.com>
Cc: kasan-dev@...glegroups.com,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
Adrian Huang <ahuang12@...ovo.com>
Subject: [PATCH 1/1] kasan, vmalloc: avoid lock contention when depopulating vmalloc
From: Adrian Huang <ahuang12@...ovo.com>
When running the test_vmalloc stress on a 448-core server, the following
soft/hard lockups were observed and the OS was panicked eventually.
1) Kernel config
CONFIG_KASAN=y
CONFIG_KASAN_VMALLOC=y
2) Reproduced command
# modprobe test_vmalloc nr_threads=448 run_test_mask=0x1 nr_pages=8
3) OS Log: Detail is in [1].
watchdog: BUG: soft lockup - CPU#258 stuck for 26s!
RIP: 0010:native_queued_spin_lock_slowpath+0x504/0x940
Call Trace:
do_raw_spin_lock+0x1e7/0x270
_raw_spin_lock+0x63/0x80
kasan_depopulate_vmalloc_pte+0x3c/0x70
apply_to_pte_range+0x127/0x4e0
apply_to_pmd_range+0x19e/0x5c0
apply_to_pud_range+0x167/0x510
__apply_to_page_range+0x2b4/0x7c0
kasan_release_vmalloc+0xc8/0xd0
purge_vmap_node+0x190/0x980
__purge_vmap_area_lazy+0x640/0xa60
drain_vmap_area_work+0x23/0x30
process_one_work+0x84a/0x1760
worker_thread+0x54d/0xc60
kthread+0x2a8/0x380
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x1a/0x30
...
watchdog: Watchdog detected hard LOCKUP on cpu 8
watchdog: Watchdog detected hard LOCKUP on cpu 42
watchdog: Watchdog detected hard LOCKUP on cpu 10
...
Shutting down cpus with NMI
Kernel Offset: disabled
pstore: backend (erst) writing error (-28)
---[ end Kernel panic - not syncing: Hard LOCKUP ]---
BTW, the issue can be also reproduced on a 192-core server and a 256-core
server.
[Root Cause]
The tight loop in kasan_release_vmalloc_node() iteratively calls
kasan_release_vmalloc() to clear the corresponding PTE, which
acquires/releases "init_mm.page_table_lock" in
kasan_depopulate_vmalloc_pte().
The lock_stat shows that the "init_mm.page_table_lock" is the first entry
of top list of the contentions. This lock_stat info is based on the
following command (in order not to get OS panicked), where the max
wait time is 600ms:
# modprobe test_vmalloc nr_threads=150 run_test_mask=0x1 nr_pages=8
<snip>
------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max ...
------------------------------------------------------------------
init_mm.page_table_lock: 87859653 93020601 0.27 600304.90 ...
-----------------------
init_mm.page_table_lock 54332301 [<000000008ce229be>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120
init_mm.page_table_lock 6680902 [<000000009c0800ad>] __pte_alloc_kernel+0x9b/0x370
init_mm.page_table_lock 31991077 [<00000000180bc35d>] kasan_depopulate_vmalloc_pte+0x3c/0x70
init_mm.page_table_lock 16321 [<000000003ef0e79b>] __pmd_alloc+0x1d5/0x720
-----------------------
init_mm.page_table_lock 50278552 [<000000008ce229be>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120
init_mm.page_table_lock 5725380 [<000000009c0800ad>] __pte_alloc_kernel+0x9b/0x370
init_mm.page_table_lock 36992410 [<00000000180bc35d>] kasan_depopulate_vmalloc_pte+0x3c/0x70
init_mm.page_table_lock 24259 [<000000003ef0e79b>] __pmd_alloc+0x1d5/0x720
...
<snip>
[Solution]
After re-visiting code path about setting the kasan ptep (pte pointer),
it's unlikely that a kasan ptep is set and cleared simultaneously by
different CPUs. So, use ptep_get_and_clear() to get rid of the spinlock
operation.
The result shows the max wait time is 13ms with the following command
(448 cores are fully stressed):
# modprobe test_vmalloc nr_threads=448 run_test_mask=0x1 nr_pages=8
<snip>
------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max ...
------------------------------------------------------------------
init_mm.page_table_lock: 109999304 110008477 0.27 13534.76
-----------------------
init_mm.page_table_lock 109369156 [<000000001a135943>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120
init_mm.page_table_lock 637661 [<0000000051481d84>] __pte_alloc_kernel+0x9b/0x370
init_mm.page_table_lock 1660 [<00000000a492cdc5>] __pmd_alloc+0x1d5/0x720
-----------------------
init_mm.page_table_lock 109410237 [<000000001a135943>] kasan_populate_vmalloc_pte.part.0.isra.0+0x99/0x120
init_mm.page_table_lock 595016 [<0000000051481d84>] __pte_alloc_kernel+0x9b/0x370
init_mm.page_table_lock 3224 [<00000000a492cdc5>] __pmd_alloc+0x1d5/0x720
[More verifications on a 448-core server: Passed]
1) test_vmalloc module
* Each test is run sequentially.
2) stress-ng
* fork() and exit()
# stress-ng --fork 448 --timeout 180
* pthread
# stress-ng --pthread 448 --timeout 180
* fork()/exit() and pthread
# stress-ng --pthread 448 --fork 448 --timeout 180
The above verifications were run repeatedly for more than 24 hours.
[1] https://gist.github.com/AdrianHuang/99d12986a465cc33a38c7a7ceeb6f507
Signed-off-by: Adrian Huang <ahuang12@...ovo.com>
---
mm/kasan/shadow.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 88d1c9dcb507..985356811aee 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -397,17 +397,13 @@ int kasan_populate_vmalloc(unsigned long addr, unsigned long size)
static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
void *unused)
{
+ pte_t orig_pte = ptep_get_and_clear(&init_mm, addr, ptep);
unsigned long page;
- page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
-
- spin_lock(&init_mm.page_table_lock);
-
- if (likely(!pte_none(ptep_get(ptep)))) {
- pte_clear(&init_mm, addr, ptep);
+ if (likely(!pte_none(orig_pte))) {
+ page = (unsigned long)__va(pte_pfn(orig_pte) << PAGE_SHIFT);
free_page(page);
}
- spin_unlock(&init_mm.page_table_lock);
return 0;
}
--
2.34.1
Powered by blists - more mailing lists