linux-kernel - Re: Hard and soft lockups with FIO and LTP runs on a large system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOUHufbWNSEb6Mg4TOcnycD=aO0om0devY70NBqNNhwnwRzKvw@mail.gmail.com>
Date: Fri, 19 Jul 2024 01:06:16 -0600
From: Yu Zhao <yuzhao@...gle.com>
To: Bharata B Rao <bharata@....com>
Cc: Mateusz Guzik <mjguzik@...il.com>, Vlastimil Babka <vbabka@...e.cz>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, nikunj@....com, 
	"Upadhyay, Neeraj" <Neeraj.Upadhyay@....com>, Andrew Morton <akpm@...ux-foundation.org>, 
	David Hildenbrand <david@...hat.com>, willy@...radead.org, kinseyho@...gle.com, 
	Mel Gorman <mgorman@...e.de>
Subject: Re: Hard and soft lockups with FIO and LTP runs on a large system

On Fri, Jul 19, 2024 at 12:16 AM Bharata B Rao <bharata@....com> wrote:
>
> On 18-Jul-24 5:41 PM, Mateusz Guzik wrote:
> > On Thu, Jul 18, 2024 at 11:00 AM Bharata B Rao <bharata@....com> wrote:
> >>
> >> On 17-Jul-24 4:59 PM, Mateusz Guzik wrote:
> >>> As for clear_shadow_entry mentioned in the opening mail, the content is:
> >>>           spin_lock(&mapping->host->i_lock);
> >>>           xa_lock_irq(&mapping->i_pages);
> >>>           __clear_shadow_entry(mapping, index, entry);
> >>>           xa_unlock_irq(&mapping->i_pages);
> >>>           if (mapping_shrinkable(mapping))
> >>>                   inode_add_lru(mapping->host);
> >>>           spin_unlock(&mapping->host->i_lock);
> >>>
> >>> so for all I know it's all about the xarray thing, not the i_lock per se.
> >>
> >> The soft lockup signature has _raw_spin_lock and not _raw_spin_lock_irq
> >> and hence concluded it to be i_lock.
> >
> > I'm not disputing it was i_lock. I am claiming that the i_pages is
> > taken immediately after and it may be that in your workload this is
> > the thing with the actual contention problem, making i_lock a red
> > herring.
> >
> > I tried to match up offsets to my own kernel binary, but things went haywire.
> >
> > Can you please resolve a bunch of symbols, like this:
> > ./scripts/faddr2line vmlinux clear_shadow_entry+92
> >
> > and then paste the source code from reported lines? (I presume you are
> > running with some local patches, so opening relevant files in my repo
> > may still give bogus resutls)
> >
> > Addresses are: clear_shadow_entry+92 __remove_mapping+98 __filemap_add_folio+332
>
> clear_shadow_entry+92
>
> $ ./scripts/faddr2line vmlinux clear_shadow_entry+92
> clear_shadow_entry+92/0x180:
> spin_lock_irq at include/linux/spinlock.h:376
> (inlined by) clear_shadow_entry at mm/truncate.c:51
>
> 42 static void clear_shadow_entry(struct address_space *mapping,
> 43                                struct folio_batch *fbatch, pgoff_t
> *indices)
> 44 {
> 45         int i;
> 46
> 47         if (shmem_mapping(mapping) || dax_mapping(mapping))
> 48                 return;
> 49
> 50         spin_lock(&mapping->host->i_lock);
> 51         xa_lock_irq(&mapping->i_pages);
>
>
> __remove_mapping+98
>
> $ ./scripts/faddr2line vmlinux __remove_mapping+98
> __remove_mapping+98/0x230:
> spin_lock_irq at include/linux/spinlock.h:376
> (inlined by) __remove_mapping at mm/vmscan.c:695
>
> 684 static int __remove_mapping(struct address_space *mapping, struct
> folio *folio,
> 685                             bool reclaimed, struct mem_cgroup
> *target_memcg)
> 686 {
> 687         int refcount;
> 688         void *shadow = NULL;
> 689
> 690         BUG_ON(!folio_test_locked(folio));
> 691         BUG_ON(mapping != folio_mapping(folio));
> 692
> 693         if (!folio_test_swapcache(folio))
> 694                 spin_lock(&mapping->host->i_lock);
> 695         xa_lock_irq(&mapping->i_pages);
>
>
> __filemap_add_folio+332
>
> $ ./scripts/faddr2line vmlinux __filemap_add_folio+332
> __filemap_add_folio+332/0x480:
> spin_lock_irq at include/linux/spinlock.h:377
> (inlined by) __filemap_add_folio at mm/filemap.c:878
>
> 851 noinline int __filemap_add_folio(struct address_space *mapping,
> 852                 struct folio *folio, pgoff_t index, gfp_t gfp, void
> **shadowp)
> 853 {
> 854         XA_STATE(xas, &mapping->i_pages, index);
>              ...
> 874         for (;;) {
> 875                 int order = -1, split_order = 0;
> 876                 void *entry, *old = NULL;
> 877
> 878                 xas_lock_irq(&xas);
> 879                 xas_for_each_conflict(&xas, entry) {
>
> >
> > Most notably in __remove_mapping i_lock is conditional:
> >          if (!folio_test_swapcache(folio))
> >                  spin_lock(&mapping->host->i_lock);
> >          xa_lock_irq(&mapping->i_pages);
> >
> > and the disasm of the offset in my case does not match either acquire.
> > For all I know i_lock in this routine is *not* taken and all the
> > queued up __remove_mapping callers increase i_lock -> i_pages wait
> > times in clear_shadow_entry.
>
> So the first two are on i_pages lock and the last one is xa_lock.

Isn't xa_lock also i_pages->xa_lock, i.e., the same lock?

> > To my cursory reading i_lock in clear_shadow_entry can be hacked away
> > with some effort, but should this happen the contention is going to
> > shift to i_pages presumably with more soft lockups (except on that
> > lock). I am not convinced messing with it is justified. From looking
> > at other places the i_lock is not a problem in other spots fwiw.
> >
> > All that said even if it is i_lock in both cases *and* someone whacks
> > it, the mm folk should look into what happens when (maybe i_lock ->)
> > i_pages lock is held. To that end perhaps you could provide a
> > flamegraph or output of perf record -a -g, I don't know what's
> > preferred.
>
> I have attached the flamegraph but this is for the kernel that has been
> running with all the accumulated fixes so far. The original one (w/o
> fixes) did show considerable time spent on
> native_queued_spin_lock_slowpath but unfortunately unable to locate it now.
>
> Regards,
> Bharata.