linux-kernel - Re: Hard and soft lockups with FIO and LTP runs on a large system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHH4N0eEQCpJqFioRCJx75WAO5n+kCA0XcRZ-914xFR0gw@mail.gmail.com>
Date: Wed, 10 Jul 2024 14:24:16 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Bharata B Rao <bharata@....com>
Cc: Yu Zhao <yuzhao@...gle.com>, david@...morbit.com, kent.overstreet@...ux.dev, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, nikunj@....com, 
	"Upadhyay, Neeraj" <Neeraj.Upadhyay@....com>, Andrew Morton <akpm@...ux-foundation.org>, 
	David Hildenbrand <david@...hat.com>, willy@...radead.org, vbabka@...e.cz, kinseyho@...gle.com, 
	Mel Gorman <mgorman@...e.de>, linux-fsdevel@...r.kernel.org
Subject: Re: Hard and soft lockups with FIO and LTP runs on a large system

On Wed, Jul 10, 2024 at 2:04 PM Bharata B Rao <bharata@....com> wrote:
>
> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
> >> Some experiments tried
> >> ======================
> >> 1) When MGLRU was enabled many soft lockups were observed, no hard
> >> lockups were seen for 48 hours run. Below is once such soft lockup.
> <snip>
> >> Below preemptirqsoff trace points to preemption being disabled for more
> >> than 10s and the lock in picture is lruvec spinlock.
> >
> > Also if you could try the other patch (mglru.patch) please. It should
> > help reduce unnecessary rotations from deactivate_file_folio(), which
> > in turn should reduce the contention on the LRU lock for MGLRU.
>
> Thanks. With mglru.patch on a MGLRU-enabled system, the below latency
> trace record is no longer seen for a 30hr workload run.
>
> >
> >>       # tracer: preemptirqsoff
> >>       #
> >>       # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-mglru-irqstrc
> >>       # --------------------------------------------------------------------
> >>       # latency: 10382682 us, #4/4, CPU#128 | (M:desktop VP:0, KP:0, SP:0
> >> HP:0 #P:512)
> >>       #    -----------------
> >>       #    | task: fio-2701523 (uid:0 nice:0 policy:0 rt_prio:0)
> >>       #    -----------------
> >>       #  => started at: deactivate_file_folio
> >>       #  => ended at:   deactivate_file_folio
> >>       #
> >>       #
> >>       #                    _------=> CPU#
> >>       #                   / _-----=> irqs-off/BH-disabled
> >>       #                  | / _----=> need-resched
> >>       #                  || / _---=> hardirq/softirq
> >>       #                  ||| / _--=> preempt-depth
> >>       #                  |||| / _-=> migrate-disable
> >>       #                  ||||| /     delay
> >>       #  cmd     pid     |||||| time  |   caller
> >>       #     \   /        ||||||  \    |    /
> >>            fio-2701523 128...1.    0us$: deactivate_file_folio
> >> <-deactivate_file_folio
> >>            fio-2701523 128.N.1. 10382681us : deactivate_file_folio
> >> <-deactivate_file_folio
> >>            fio-2701523 128.N.1. 10382683us : tracer_preempt_on
> >> <-deactivate_file_folio
> >>            fio-2701523 128.N.1. 10382691us : <stack trace>
> >>        => deactivate_file_folio
> >>        => mapping_try_invalidate
> >>        => invalidate_mapping_pages
> >>        => invalidate_bdev
> >>        => blkdev_common_ioctl
> >>        => blkdev_ioctl
> >>        => __x64_sys_ioctl
> >>        => x64_sys_call
> >>        => do_syscall_64
> >>        => entry_SYSCALL_64_after_hwframe
>
> However the contention now has shifted to inode_hash_lock. Around 55
> softlockups in ilookup() were observed:
>
> # tracer: preemptirqsoff
> #
> # preemptirqsoff latency trace v1.1.5 on 6.10.0-rc3-trnmglru
> # --------------------------------------------------------------------
> # latency: 10620430 us, #4/4, CPU#260 | (M:desktop VP:0, KP:0, SP:0 HP:0
> #P:512)
> #    -----------------
> #    | task: fio-3244715 (uid:0 nice:0 policy:0 rt_prio:0)
> #    -----------------
> #  => started at: ilookup
> #  => ended at:   ilookup
> #
> #
> #                    _------=> CPU#
> #                   / _-----=> irqs-off/BH-disabled
> #                  | / _----=> need-resched
> #                  || / _---=> hardirq/softirq
> #                  ||| / _--=> preempt-depth
> #                  |||| / _-=> migrate-disable
> #                  ||||| /     delay
> #  cmd     pid     |||||| time  |   caller
> #     \   /        ||||||  \    |    /
>       fio-3244715 260...1.    0us$: _raw_spin_lock <-ilookup
>       fio-3244715 260.N.1. 10620429us : _raw_spin_unlock <-ilookup
>       fio-3244715 260.N.1. 10620430us : tracer_preempt_on <-ilookup
>       fio-3244715 260.N.1. 10620440us : <stack trace>
> => _raw_spin_unlock
> => ilookup
> => blkdev_get_no_open
> => blkdev_open
> => do_dentry_open
> => vfs_open
> => path_openat
> => do_filp_open
> => do_sys_openat2
> => __x64_sys_openat
> => x64_sys_call
> => do_syscall_64
> => entry_SYSCALL_64_after_hwframe
>
> It appears that scalability issues with inode_hash_lock has been brought
> up multiple times in the past and there were patches to address the same.
>
> https://lore.kernel.org/all/20231206060629.2827226-9-david@fromorbit.com/
> https://lore.kernel.org/lkml/20240611173824.535995-2-mjguzik@gmail.com/
>
> CC'ing FS folks/list for awareness/comments.

Note my patch does not enable RCU usage in ilookup, but this can be
trivially added.

I can't even compile-test at the moment, but the diff below should do
it. Also note the patches are present here
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.inode.rcu
, not yet integrated anywhere.

That said, if fio you are operating on the same target inode every
time then this is merely going to shift contention to the inode
spinlock usage in find_inode_fast.

diff --git a/fs/inode.c b/fs/inode.c
index ad7844ca92f9..70b0e6383341 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1524,10 +1524,14 @@ struct inode *ilookup(struct super_block *sb,
unsigned long ino)
 {
        struct hlist_head *head = inode_hashtable + hash(sb, ino);
        struct inode *inode;
+
 again:
-       spin_lock(&inode_hash_lock);
-       inode = find_inode_fast(sb, head, ino, true);
-       spin_unlock(&inode_hash_lock);
+       inode = find_inode_fast(sb, head, ino, false);
+       if (IS_ERR_OR_NULL_PTR(inode)) {
+               spin_lock(&inode_hash_lock);
+               inode = find_inode_fast(sb, head, ino, true);
+               spin_unlock(&inode_hash_lock);
+       }

        if (inode) {
                if (IS_ERR(inode))



-- 
Mateusz Guzik <mjguzik gmail.com>