linux-kernel - Hard and soft lockups with FIO and LTP runs on a large system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com>
Date: Wed, 3 Jul 2024 20:41:03 +0530
From: Bharata B Rao <bharata@....com>
To: linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org, nikunj@....com,
 "Upadhyay, Neeraj" <Neeraj.Upadhyay@....com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...hat.com>, willy@...radead.org, vbabka@...e.cz,
 yuzhao@...gle.com, kinseyho@...gle.com, Mel Gorman <mgorman@...e.de>
Subject: Hard and soft lockups with FIO and LTP runs on a large system

Many soft and hard lockups are seen with upstream kernel when running a 
bunch of tests that include FIO and LTP filesystem test on 10 NVME 
disks. The lockups can appear anywhere between 2 to 48 hours. Originally 
this was reported on a large customer VM instance with passthrough NVME 
disks on older kernels(v5.4 based). However, similar problems were 
reproduced when running the tests on bare metal with latest upstream 
kernel (v6.10-rc3). Other lockups with different signatures are seen but 
in this report, only those related to MM area are being discussed.
Also note that the subsequent description is related to the lockups in 
bare metal upstream (and not VM).

The general observation is that the problem usually surfaces when the 
system free memory goes very low and page cache/buffer consumption hits 
the ceiling. Most of the times the two contended locks are lruvec and 
inode->i_lock spinlocks.

- Could this be a scalability issue in LRU list handling and/or page 
cache invalidation typical to a large system configuration?
- Are there any MM/FS tunables that could help here?

Hardware configuration
======================
Dual socket  AMD EPYC 128 Core processor (256 cores, 512 threads)
Memory: 1.5 TB
10 NVME - 3.5TB each
available: 2 nodes (0-1)
node 0 cpus: 0-127,256-383
node 0 size: 773727 MB
node 1 cpus: 128-255,384-511
node 1 size: 773966 MB

Workload details
================
Workload includes concurrent runs of FIO and a few FS tests from LTP.

FIO is run with a size of 1TB on each NVME partition with different 
combinations of ioengine/blocksize/mode parameters and buffered-IO. 
Selected FS tests from LTP are run on 256GB partitions of all NVME 
disks. This is the typical NVME partition layout.

nvme2n1      259:4   0   3.5T  0 disk
├─nvme2n1p1  259:6   0   256G  0 part /data_nvme2n1p1
└─nvme2n1p2  259:7   0   3.2T  0 part

Though many different runs exist in the workload, the combination that 
results in the problem is buffered-IO run with sync engine.

fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
-rwmixwrite=30  --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
-numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest

Watchdog threshold was reduced to 5s to reproduce the problem early and 
all CPU backtrace enabled.

Problem details and analysis
============================
One of the hard lockups which was observed and analyzed in detail is this:

kernel: watchdog: Watchdog detected hard LOCKUP on cpu 284
kernel: CPU: 284 PID: 924096 Comm: cat Not tainted 6.10.0-rc3-lruvec #9
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel:  <NMI>
kernel:  ? show_regs+0x69/0x80
kernel:  ? watchdog_hardlockup_check+0x19e/0x360
<SNIP>
kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel:  </NMI>
kernel:  <TASK>
kernel:  ? __pfx_lru_add_fn+0x10/0x10
kernel: _raw_spin_lock_irqsave+0x42/0x50
kernel: folio_lruvec_lock_irqsave+0x62/0xb0
kernel: folio_batch_move_lru+0x79/0x2a0
kernel: folio_add_lru+0x6d/0xf0
kernel: filemap_add_folio+0xba/0xe0
kernel: __filemap_get_folio+0x137/0x2e0
kernel: ext4_da_write_begin+0x12c/0x270
kernel: generic_perform_write+0xbf/0x200
kernel: ext4_buffered_write_iter+0x67/0xf0
kernel: ext4_file_write_iter+0x70/0x780
kernel: vfs_write+0x301/0x420
kernel: ksys_write+0x67/0xf0
kernel: __x64_sys_write+0x19/0x20
kernel: x64_sys_call+0x1689/0x20d0
kernel: do_syscall_64+0x6b/0x110
kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e	kernel: RIP: 
0033:0x7fe21c314887

With all CPU backtraces enabled, many CPUs are waiting for lruvec_lock 
acquisition. We measured the lruvec spinlock start, end and hold 
time(htime) using sched_clock(), along with a BUG() if the hold time was 
more than 10s. The below case shows that lruvec spin lock was held for ~25s.

kernel: vmscan: unlock_page_lruvec_irq: stime 27963327514341, etime 
27963324369895, htime 25889317166
kernel: ------------[ cut here ]------------
kernel: kernel BUG at include/linux/memcontrol.h:1677!
kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
kernel: CPU: 21 PID: 3211 Comm: kswapd0 Tainted: G        W 
6.10.0-rc3-qspindbg #10
kernel: RIP: 0010:shrink_active_list+0x40a/0x520

And the corresponding trace point for the above:
kswapd0-3211    [021] dN.1. 27963.324332: mm_vmscan_lru_isolate: 
classzone=0 order=0 nr_requested=1 nr_scanned=156946361 
nr_skipped=156946360 nr_taken=1 lru=active_file

This shows that isolate_lru_folios() is scanning through a huge number 
(~150million) of folios (order=0) with lruvec spinlock held. This is 
happening because a large number of folios are being skipped to isolate 
a few ZONE_DMA folios. Though the number of folios to be scanned is 
bounded (32), there exists a genuine case where this can become 
unbounded, i.e. in case where folios are skipped.

Meminfo output shows that the free memory is around ~2% and page/buffer 
cache grows very high when the lockup happens.

MemTotal:       1584835956 kB
MemFree:        27805664 kB
MemAvailable:   1568099004 kB
Buffers:        1386120792 kB
Cached:         151894528 kB
SwapCached:        30620 kB
Active:         1043678892 kB
Inactive:       494456452 kB

Often times, the perf output at the time of the problem shows heavy 
contention on lruvec spin lock. Similar contention is also observed with 
inode i_lock (in clear_shadow_entry path)

98.98%  fio    [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
    |
     --98.96%--native_queued_spin_lock_slowpath
        |
         --98.96%--_raw_spin_lock_irqsave
                   folio_lruvec_lock_irqsave
                   |
                    --98.78%--folio_batch_move_lru
                        |
                         --98.63%--deactivate_file_folio
                                   mapping_try_invalidate
                                   invalidate_mapping_pages
                                   invalidate_bdev
                                   blkdev_common_ioctl
                                   blkdev_ioctl
                                   __x64_sys_ioctl
                                   x64_sys_call
                                   do_syscall_64
                                   entry_SYSCALL_64_after_hwframe

Some experiments tried
======================
1) When MGLRU was enabled many soft lockups were observed, no hard 
lockups were seen for 48 hours run. Below is once such soft lockup.

kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G             L 
6.10.0-rc3-mglru-irqstrc #24
kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
kernel: Call Trace:
kernel:  <IRQ>
kernel:  ? show_regs+0x69/0x80
kernel:  ? watchdog_timer_fn+0x223/0x2b0
kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
<SNIP>
kernel:  </IRQ>
kernel:  <TASK>
kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
kernel:  _raw_spin_lock+0x38/0x50
kernel:  clear_shadow_entry+0x3d/0x100
kernel:  ? __pfx_workingset_update_node+0x10/0x10
kernel:  mapping_try_invalidate+0x117/0x1d0
kernel:  invalidate_mapping_pages+0x10/0x20
kernel:  invalidate_bdev+0x3c/0x50
kernel:  blkdev_common_ioctl+0x5f7/0xa90
kernel:  blkdev_ioctl+0x109/0x270
kernel:  x64_sys_call+0x1215/0x20d0
kernel:  do_syscall_64+0x7e/0x130

This happens to be contending on inode i_lock spinlock.

Below preemptirqsoff trace points to preemption being disabled for more 
than 10s and the lock in picture is lruvec spinlock.

     # tracer: preemptirqsoff
     #
     # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
     # --------------------------------------------------------------------
     # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0 
HP:0 #P:512)
     #    -----------------
     #    | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
     #    -----------------
     #  => started at: deactivate_file_folio
     #  => ended at:   deactivate_file_folio
     #
     #
     #                    _------=> CPU#
     #                   / _-----=> irqs-off/BH-disabled
     #                  | / _----=> need-resched
     #                  || / _---=> hardirq/softirq
     #                  ||| / _--=> preempt-depth
     #                  |||| / _-=> migrate-disable
     #                  ||||| /     delay
     #  cmd     pid     |||||| time  |   caller
     #     \   /        ||||||  \    |    /
          fio-2701523 128...1.    0us$: deactivate_file_folio 
<-deactivate_file_folio
          fio-2701523 128.N.1. 10382681us : deactivate_file_folio 
<-deactivate_file_folio
          fio-2701523 128.N.1. 10382683us : tracer_preempt_on 
<-deactivate_file_folio
          fio-2701523 128.N.1. 10382691us : <stack trace>
      => deactivate_file_folio
      => mapping_try_invalidate
      => invalidate_mapping_pages
      => invalidate_bdev
      => blkdev_common_ioctl
      => blkdev_ioctl
      => __x64_sys_ioctl
      => x64_sys_call
      => do_syscall_64
      => entry_SYSCALL_64_after_hwframe

2) Increased low_watermark_threshold to 10% to prevent system from 
entering into extremely low memory situation. Although hard lockups 
weren't seen, but soft lockups (clear_shadow_entry()) were still seen.

3) AMD has a BIOS setting called NPS (Nodes per socket), using which a 
socket can be further partitioned into smaller NUMA nodes. With NPS=4, 
there will be four NUMA nodes in one socket, and hence 8 NUMA nodes in 
the system. This was done to check if having more number of kswapd 
threads working on lesser number of folios per node would make a 
difference. However here too, multiple  soft lockups were seen (in 
clear_shadow_entry() as seen in MGLRU case). No hard lockups were observed.

Any insights/suggestion into these lockups and suggestions are welcome!

Regards,
Bharata.