[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1998d479-eb1a-4bc8-a11e-59f8dd71aadb@amd.com>
Date: Mon, 8 Jul 2024 20:04:22 +0530
From: Bharata B Rao <bharata@....com>
To: Yu Zhao <yuzhao@...gle.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, nikunj@....com,
"Upadhyay, Neeraj" <Neeraj.Upadhyay@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>, willy@...radead.org, vbabka@...e.cz,
kinseyho@...gle.com, Mel Gorman <mgorman@...e.de>
Subject: Re: Hard and soft lockups with FIO and LTP runs on a large system
Hi Yu Zhao,
Thanks for your patches. See below...
On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> Hi Bharata,
>
> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@....com> wrote:
>>
<snip>
>>
>> Some experiments tried
>> ======================
>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>> lockups were seen for 48 hours run. Below is once such soft lockup.
>
> This is not really an MGLRU issue -- can you please try one of the
> attached patches? It (truncate.patch) should help with or without
> MGLRU.
With truncate.patch and default LRU scheme, a few hard lockups are seen.
First one is this:
watchdog: Watchdog detected hard LOCKUP on cpu 487
CPU: 487 PID: 11525 Comm: fio Not tainted 6.10.0-rc3 #27
RIP: 0010:native_queued_spin_lock_slowpath+0x81/0x300
Call Trace:
<NMI>
? show_regs+0x69/0x80
? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
? native_queued_spin_lock_slowpath+0x81/0x300
</NMI>
<TASK>
? __pfx_folio_activate_fn+0x10/0x10
_raw_spin_lock_irqsave+0x5b/0x70
folio_lruvec_lock_irqsave+0x62/0x90
folio_batch_move_lru+0x9d/0x160
folio_activate+0x95/0xe0
folio_mark_accessed+0x11f/0x160
filemap_read+0x343/0x3d0
<SNIP>
blkdev_read_iter+0x6f/0x140
vfs_read+0x25b/0x340
ksys_read+0x67/0xf0
__x64_sys_read+0x19/0x20
x64_sys_call+0x1771/0x20d0
This is the next one:
watchdog: Watchdog detected hard LOCKUP on cpu 219
CPU: 219 PID: 2584763 Comm: fs_racer_dir_cr Not tainted 6.10.0-rc3 #27
RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
Call Trace:
<NMI>
? show_regs+0x69/0x80
? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
? native_queued_spin_lock_slowpath+0x2b4/0x300
</NMI>
<TASK>
_raw_spin_lock_irqsave+0x5b/0x70
folio_lruvec_lock_irqsave+0x62/0x90
__page_cache_release+0x89/0x2f0
folios_put_refs+0x92/0x230
__folio_batch_release+0x74/0x90
truncate_inode_pages_range+0x16f/0x520
truncate_pagecache+0x49/0x70
ext4_setattr+0x326/0xaa0
notify_change+0x353/0x500
do_truncate+0x83/0xe0
path_openat+0xd9e/0x1090
do_filp_open+0xaa/0x150
do_sys_openat2+0x9b/0xd0
__x64_sys_openat+0x55/0x90
x64_sys_call+0xe55/0x20d0
do_syscall_64+0x7e/0x130
entry_SYSCALL_64_after_hwframe+0x76/0x7e
When this happens, all-CPU backtrace shows a CPU being in
isolate_lru_folios().
>
>> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
>> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G L
>> 6.10.0-rc3-mglru-irqstrc #24
>> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
>> kernel: Call Trace:
>> kernel: <IRQ>
>> kernel: ? show_regs+0x69/0x80
>> kernel: ? watchdog_timer_fn+0x223/0x2b0
>> kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
>> <SNIP>
>> kernel: </IRQ>
>> kernel: <TASK>
>> kernel: ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
>> kernel: ? native_queued_spin_lock_slowpath+0x2b4/0x300
>> kernel: _raw_spin_lock+0x38/0x50
>> kernel: clear_shadow_entry+0x3d/0x100
>> kernel: ? __pfx_workingset_update_node+0x10/0x10
>> kernel: mapping_try_invalidate+0x117/0x1d0
>> kernel: invalidate_mapping_pages+0x10/0x20
>> kernel: invalidate_bdev+0x3c/0x50
>> kernel: blkdev_common_ioctl+0x5f7/0xa90
>> kernel: blkdev_ioctl+0x109/0x270
>> kernel: x64_sys_call+0x1215/0x20d0
>> kernel: do_syscall_64+0x7e/0x130
>>
>> This happens to be contending on inode i_lock spinlock.
>>
>> Below preemptirqsoff trace points to preemption being disabled for more
>> than 10s and the lock in picture is lruvec spinlock.
>
> Also if you could try the other patch (mglru.patch) please. It should
> help reduce unnecessary rotations from deactivate_file_folio(), which
> in turn should reduce the contention on the LRU lock for MGLRU.
Currently testing is in progress with mglru.patch and MGLRU enabled.
Will get back on the results.
Regards,
Bharata.
Powered by blists - more mailing lists