lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1998d479-eb1a-4bc8-a11e-59f8dd71aadb@amd.com>
Date: Mon, 8 Jul 2024 20:04:22 +0530
From: Bharata B Rao <bharata@....com>
To: Yu Zhao <yuzhao@...gle.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, nikunj@....com,
 "Upadhyay, Neeraj" <Neeraj.Upadhyay@....com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...hat.com>, willy@...radead.org, vbabka@...e.cz,
 kinseyho@...gle.com, Mel Gorman <mgorman@...e.de>
Subject: Re: Hard and soft lockups with FIO and LTP runs on a large system

Hi Yu Zhao,

Thanks for your patches. See below...

On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> Hi Bharata,
> 
> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@....com> wrote:
>>
<snip>
>> 
>> Some experiments tried
>> ======================
>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>> lockups were seen for 48 hours run. Below is once such soft lockup.
> 
> This is not really an MGLRU issue -- can you please try one of the
> attached patches? It (truncate.patch) should help with or without
> MGLRU.

With truncate.patch and default LRU scheme, a few hard lockups are seen.

First one is this:

watchdog: Watchdog detected hard LOCKUP on cpu 487
CPU: 487 PID: 11525 Comm: fio Not tainted 6.10.0-rc3 #27
RIP: 0010:native_queued_spin_lock_slowpath+0x81/0x300
Call Trace:
   <NMI>
   ? show_regs+0x69/0x80
   ? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
   ? native_queued_spin_lock_slowpath+0x81/0x300
   </NMI>
   <TASK>
   ? __pfx_folio_activate_fn+0x10/0x10
   _raw_spin_lock_irqsave+0x5b/0x70
   folio_lruvec_lock_irqsave+0x62/0x90
   folio_batch_move_lru+0x9d/0x160
   folio_activate+0x95/0xe0
   folio_mark_accessed+0x11f/0x160
   filemap_read+0x343/0x3d0
<SNIP>
   blkdev_read_iter+0x6f/0x140
   vfs_read+0x25b/0x340
   ksys_read+0x67/0xf0
   __x64_sys_read+0x19/0x20
   x64_sys_call+0x1771/0x20d0

This is the next one:

watchdog: Watchdog detected hard LOCKUP on cpu 219
CPU: 219 PID: 2584763 Comm: fs_racer_dir_cr Not tainted 6.10.0-rc3 #27
RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
Call Trace:
   <NMI>
   ? show_regs+0x69/0x80
   ? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
   ? native_queued_spin_lock_slowpath+0x2b4/0x300
   </NMI>
   <TASK>
   _raw_spin_lock_irqsave+0x5b/0x70
   folio_lruvec_lock_irqsave+0x62/0x90
   __page_cache_release+0x89/0x2f0
   folios_put_refs+0x92/0x230
   __folio_batch_release+0x74/0x90
   truncate_inode_pages_range+0x16f/0x520
   truncate_pagecache+0x49/0x70
   ext4_setattr+0x326/0xaa0
   notify_change+0x353/0x500
   do_truncate+0x83/0xe0
   path_openat+0xd9e/0x1090
   do_filp_open+0xaa/0x150
   do_sys_openat2+0x9b/0xd0
   __x64_sys_openat+0x55/0x90
   x64_sys_call+0xe55/0x20d0
   do_syscall_64+0x7e/0x130
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

When this happens, all-CPU backtrace shows a CPU being in 
isolate_lru_folios().

> 
>> kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649]
>> kernel: CPU: 29 PID: 2701649 Comm: fio Tainted: G             L
>> 6.10.0-rc3-mglru-irqstrc #24
>> kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
>> kernel: Call Trace:
>> kernel:  <IRQ>
>> kernel:  ? show_regs+0x69/0x80
>> kernel:  ? watchdog_timer_fn+0x223/0x2b0
>> kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
>> <SNIP>
>> kernel:  </IRQ>
>> kernel:  <TASK>
>> kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
>> kernel:  ? native_queued_spin_lock_slowpath+0x2b4/0x300
>> kernel:  _raw_spin_lock+0x38/0x50
>> kernel:  clear_shadow_entry+0x3d/0x100
>> kernel:  ? __pfx_workingset_update_node+0x10/0x10
>> kernel:  mapping_try_invalidate+0x117/0x1d0
>> kernel:  invalidate_mapping_pages+0x10/0x20
>> kernel:  invalidate_bdev+0x3c/0x50
>> kernel:  blkdev_common_ioctl+0x5f7/0xa90
>> kernel:  blkdev_ioctl+0x109/0x270
>> kernel:  x64_sys_call+0x1215/0x20d0
>> kernel:  do_syscall_64+0x7e/0x130
>>
>> This happens to be contending on inode i_lock spinlock.
>>
>> Below preemptirqsoff trace points to preemption being disabled for more
>> than 10s and the lock in picture is lruvec spinlock.
> 
> Also if you could try the other patch (mglru.patch) please. It should
> help reduce unnecessary rotations from deactivate_file_folio(), which
> in turn should reduce the contention on the LRU lock for MGLRU.

Currently testing is in progress with mglru.patch and MGLRU enabled. 
Will get back on the results.

Regards,
Bharata.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ