[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com>
Date: Wed, 3 Jul 2024 20:41:03 +0530
From: Bharata B Rao <bharata@....com>
To: linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org, nikunj@....com,
"Upadhyay, Neeraj" <Neeraj.Upadhyay@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>, willy@...radead.org, vbabka@...e.cz,
yuzhao@...gle.com, kinseyho@...gle.com, Mel Gorman <mgorman@...e.de>
Subject: Hard and soft lockups with FIO and LTP runs on a large system
Many soft and hard lockups are seen with upstream kernel when running a
bunch of tests that include FIO and LTP filesystem test on 10 NVME
disks. The lockups can appear anywhere between 2 to 48 hours. Originally
this was reported on a large customer VM instance with passthrough NVME
disks on older kernels(v5.4 based). However, similar problems were
reproduced when running the tests on bare metal with latest upstream
kernel (v6.10-rc3). Other lockups with different signatures are seen but
in this report, only those related to MM area are being discussed.
Also note that the subsequent description is related to the lockups in
bare metal upstream (and not VM).
The general observation is that the problem usually surfaces when the
system free memory goes very low and page cache/buffer consumption hits
the ceiling. Most of the times the two contended locks are lruvec and
inode->i_lock spinlocks.
- Could this be a scalability issue in LRU list handling and/or page
cache invalidation typical to a large system configuration?
- Are there any MM/FS tunables that could help here?
Hardware configuration
======================
Dual socket AMD EPYC 128 Core processor (256 cores, 512 threads)
Memory: 1.5 TB
10 NVME - 3.5TB each
available: 2 nodes (0-1)
node 0 cpus: 0-127,256-383
node 0 size: 773727 MB
node 1 cpus: 128-255,384-511
node 1 size: 773966 MB
Workload details
================
Workload includes concurrent runs of FIO and a few FS tests from LTP.
FIO is run with a size of 1TB on each NVME partition with different
combinations of ioengine/blocksize/mode parameters and buffered-IO.
Selected FS tests from LTP are run on 256GB partitions of all NVME
disks. This is the typical NVME partition layout.
nvme2n1 259:4 0 3.5T 0 disk
├─nvme2n1p1 259:6 0 256G 0 part /data_nvme2n1p1
└─nvme2n1p2 259:7 0 3.2T 0 part
Though many different runs exist in the workload, the combination that
results in the problem is buffered-IO run with sync engine.
fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
-rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
-numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest
Watchdog threshold was reduced to 5s to reproduce the problem early and
all CPU backtrace enabled.
Problem details and analysis
============================
One of the hard lockups which was observed and analyzed in detail is this:
kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284
kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel: <NMI>
kernel: ? show_regs+0x69/0x80
kernel: ? watchdog_hardlockup_check+0x19e/0x360
<SNIP>
kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: </NMI>
kernel: <TASK>
kernel: ? __pfx_lru_add_fn+0x10/0x10
kernel: _raw_spin_lock_irqsave+0x42/0x50
kernel: folio_lruvec_lock_irqsave+0x62/0xb0
kernel: folio_batch_move_lru+0x79/0x2a0
kernel: folio_add_lru+0x6d/0xf0
kernel: filemap_add_folio+0xba/0xe0
kernel: __filemap_get_folio+0x137/0x2e0
kernel: ext4_da_write_begin+0x12c/0x270
kernel: generic_perform_write+0xbf/0x200
kernel: ext4_buffered_write_iter+0x67/0xf0
kernel: ext4_file_write_iter+0x70/0x780
kernel: vfs_write+0x301/0x420
kernel: ksys_write+0x67/0xf0
kernel: __x64_sys_write+0x19/0x20
kernel: x64_sys_call+0x1689/0x20d0
kernel: do_syscall_64+0x6b/0x110
kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: RIP:
0033:0x7fe21c314887
With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock
acquisition. We measured the lruvec spinlock start, end and hold
time(htime) using sched_clock(), along with a BUG() if the hold time was
more than 10s. The below case shows that lruvec spin lock was held for ~25s.
kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime
27963324369895, htime 25889317166
kernel: ------------[ cut here ]------------
kernel: kernel BUG at include/linux/memcontrol.h:1677!
kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G W
6.10.0-rc3-qspindbg #10
kernel: RIP: 0010:shrink_active_list+0x40a/0x520
And the corresponding trace point for the above:
kswapd0-3211 [021] dN.1. 27963.324332: mm_vmscan_lru_isolate:
classzone=0 order=0 nr_requested=1 nr_scanned=156946361
nr_skipped=156946360 nr_taken=1 lru=active_file
This shows that isolate_lru_folios() is scanning through a huge number
(~150million) of folios (order=0) with lruvec spinlock held. This is
happening because a large number of folios are being skipped to isolate
a few ZONE_DMA folios. Though the number of folios to be scanned is
bounded (32), there exists a genuine case where this can become
unbounded, i.e. in case where folios are skipped.
Meminfo output shows that the free memory is around ~2% and page/buffer
cache grows very high when the lockup happens.
MemTotal: 1584835956 kB
MemFree: 27805664 kB
MemAvailable: 1568099004 kB
Buffers: 1386120792 kB
Cached: 151894528 kB
SwapCached: 30620 kB
Active: 1043678892 kB
Inactive: 494456452 kB
Often times, the perf output at the time of the problem shows heavy
contention on lruvec spin lock. Similar contention is also observed with
inode i_lock (in clear_shadow_entry path)
98.98% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
|
--98.96%--native_queued_spin_lock_slowpath
|
--98.96%--_raw_spin_lock_irqsave
folio_lruvec_lock_irqsave
|
--98.78%--folio_batch_move_lru
|
--98.63%--deactivate_file_folio
mapping_try_invalidate
invalidate_mapping_pages
invalidate_bdev
blkdev_common_ioctl
blkdev_ioctl
__x64_sys_ioctl
x64_sys_call
do_syscall_64
entry_SYSCALL_64_after_hwframe
Some experiments tried
======================
1) When MGLRU was enabled many soft lockups were observed, no hard
lockups were seen for 48 hours run. Below is once such soft lockup.
kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L
6.10.0-rc3-mglru-irqstrc #24
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel: <IRQ>
kernel: ? show_regs+0x69/0x80
kernel: ? watchdog_timer_fn+0x223/0x2b0
kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
<SNIP>
kernel: </IRQ>
kernel: <TASK>
kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: _raw_spin_lock+0x38/0x50
kernel: clear_shadow_entry+0x3d/0x100
kernel: ? __pfx_workingset_update_node+0x10/0x10
kernel: mapping_try_invalidate+0x117/0x1d0
kernel: invalidate_mapping_pages+0x10/0x20
kernel: invalidate_bdev+0x3c/0x50
kernel: blkdev_common_ioctl+0x5f7/0xa90
kernel: blkdev_ioctl+0x109/0x270
kernel: x64_sys_call+0x1215/0x20d0
kernel: do_syscall_64+0x7e/0x130
This happens to be contending on inode i_lock spinlock.
Below preemptirqsoff trace points to preemption being disabled for more
than 10s and the lock in picture is lruvec spinlock.
# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
# --------------------------------------------------------------------
# latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
HP:0 #P:512)
# -----------------
# | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: deactivate_file_folio
# => ended at: deactivate_file_folio
#
#
# _------=> CPU#
# / _-----=> irqs-off/BH-disabled
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / _-=> migrate-disable
# ||||| / delay
# cmd pid |||||| time | caller
# \ / |||||| \ | /
fio-2701523 128...1. 0us$: deactivate_file_folio
<-deactivate_file_folio
fio-2701523 128.N.1. 10382681us : deactivate_file_folio
<-deactivate_file_folio
fio-2701523 128.N.1. 10382683us : tracer_preempt_on
<-deactivate_file_folio
fio-2701523 128.N.1. 10382691us : <stack trace>
=> deactivate_file_folio
=> mapping_try_invalidate
=> invalidate_mapping_pages
=> invalidate_bdev
=> blkdev_common_ioctl
=> blkdev_ioctl
=> __x64_sys_ioctl
=> x64_sys_call
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
2) Increased low_watermark_threshold to 10% to prevent system from
entering into extremely low memory situation. Although hard lockups
weren't seen, but soft lockups (clear_shadow_entry()) were still seen.
3) AMD has a BIOS setting called NPS (Nodes per socket), using which a
socket can be further partitioned into smaller NUMA nodes. With NPS=4,
there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in
the system. This was done to check if having more number of kswapd
threads working on lesser number of folios per node would make a
difference. However here too, multiple soft lockups were seen (in
clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.
Any insights/suggestion into these lockups and suggestions are welcome!
Regards,
Bharata.
Powered by blists - more mailing lists