linux-kernel - Re: Hard and soft lockups with FIO and LTP runs on a large system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <893a263a-0038-4b4b-9031-72567b966f73@amd.com>
Date: Mon, 15 Jul 2024 10:49:53 +0530
From: Bharata B Rao <bharata@....com>
To: Yu Zhao <yuzhao@...gle.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, nikunj@....com,
 "Upadhyay, Neeraj" <Neeraj.Upadhyay@....com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...hat.com>, willy@...radead.org, vbabka@...e.cz,
 kinseyho@...gle.com, Mel Gorman <mgorman@...e.de>, mjguzik@...il.com
Subject: Re: Hard and soft lockups with FIO and LTP runs on a large system

On 11-Jul-24 11:13 AM, Bharata B Rao wrote:
> On 09-Jul-24 11:28 AM, Yu Zhao wrote:
>> On Mon, Jul 8, 2024 at 10:31 PM Bharata B Rao <bharata@....com> wrote:
>>>
>>> On 08-Jul-24 9:47 PM, Yu Zhao wrote:
>>>> On Mon, Jul 8, 2024 at 8:34 AM Bharata B Rao <bharata@....com> wrote:
>>>>>
>>>>> Hi Yu Zhao,
>>>>>
>>>>> Thanks for your patches. See below...
>>>>>
>>>>> On 07-Jul-24 4:12 AM, Yu Zhao wrote:
>>>>>> Hi Bharata,
>>>>>>
>>>>>> On Wed, Jul 3, 2024 at 9:11 AM Bharata B Rao <bharata@....com> wrote:
>>>>>>>
>>>>> <snip>
>>>>>>>
>>>>>>> Some experiments tried
>>>>>>> ======================
>>>>>>> 1) When MGLRU was enabled many soft lockups were observed, no hard
>>>>>>> lockups were seen for 48 hours run. Below is once such soft lockup.
>>>>>>
>>>>>> This is not really an MGLRU issue -- can you please try one of the
>>>>>> attached patches? It (truncate.patch) should help with or without
>>>>>> MGLRU.
>>>>>
>>>>> With truncate.patch and default LRU scheme, a few hard lockups are 
>>>>> seen.
>>>>
>>>> Thanks.
>>>>
>>>> In your original report, you said:
>>>>
>>>>     Most of the times the two contended locks are lruvec and
>>>>     inode->i_lock spinlocks.
>>>>     ...
>>>>     Often times, the perf output at the time of the problem shows
>>>>     heavy contention on lruvec spin lock. Similar contention is
>>>>     also observed with inode i_lock (in clear_shadow_entry path)
>>>>
>>>> Based on this new report, does it mean the i_lock is not as contended,
>>>> for the same path (truncation) you tested? If so, I'll post
>>>> truncate.patch and add reported-by and tested-by you, unless you have
>>>> objections.
>>>
>>> truncate.patch has been tested on two systems with default LRU scheme
>>> and the lockup due to inode->i_lock hasn't been seen yet after 24 
>>> hours run.
>>
>> Thanks.
>>
>>>>
>>>> The two paths below were contended on the LRU lock, but they already
>>>> batch their operations. So I don't know what else we can do surgically
>>>> to improve them.
>>>
>>> What has been seen with this workload is that the lruvec spinlock is
>>> held for a long time from shrink_[active/inactive]_list path. In this
>>> path, there is a case in isolate_lru_folios() where scanning of LRU
>>> lists can become unbounded. To isolate a page from ZONE_DMA, sometimes
>>> scanning/skipping of more than 150 million folios were seen. There is
>>> already a comment in there which explains why nr_skipped shouldn't be
>>> counted, but is there any possibility of re-looking at this condition?
>>
>> For this specific case, probably this can help:
>>
>> @@ -1659,8 +1659,15 @@ static unsigned long
>> isolate_lru_folios(unsigned long nr_to_scan,
>>                  if (folio_zonenum(folio) > sc->reclaim_idx ||
>>                                  skip_cma(folio, sc)) {
>>                          nr_skipped[folio_zonenum(folio)] += nr_pages;
>> -                       move_to = &folios_skipped;
>> -                       goto move;
>> +                       list_move(&folio->lru, &folios_skipped);
>> +                       if (spin_is_contended(&lruvec->lru_lock)) {
>> +                               if (!list_empty(dst))
>> +                                       break;
>> +                               spin_unlock_irq(&lruvec->lru_lock);
>> +                               cond_resched();
>> +                               spin_lock_irq(&lruvec->lru_lock);
>> +                       }
>> +                       continue;
>>                  }
> 
> Thanks, this helped. With this fix, the test ran for 24hrs without any 
> lockups attributable to lruvec spinlock. As noted in this thread, 
> earlier isolate_lru_folios() used to scan millions of folios and spend a 
> lot of time with spinlock held but after this fix, such a scenario is no 
> longer seen.

However during the weekend mglru-enabled run (with above fix to 
isolate_lru_folios() and also the previous two patches: truncate.patch 
and mglru.patch and the inode fix provided by Mateusz), another hard 
lockup related to lruvec spinlock was observed.

Here is the hardlock up:

watchdog: Watchdog detected hard LOCKUP on cpu 466
CPU: 466 PID: 3103929 Comm: fio Not tainted 
6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
RIP: 0010:native_queued_spin_lock_slowpath+0x2b4/0x300
Call Trace:
   <NMI>
   ? show_regs+0x69/0x80
   ? watchdog_hardlockup_check+0x1b4/0x3a0
<SNIP>
   ? native_queued_spin_lock_slowpath+0x2b4/0x300
   </NMI>
   <IRQ>
   _raw_spin_lock_irqsave+0x5b/0x70
   folio_lruvec_lock_irqsave+0x62/0x90
   folio_batch_move_lru+0x9d/0x160
   folio_rotate_reclaimable+0xab/0xf0
   folio_end_writeback+0x60/0x90
   end_buffer_async_write+0xaa/0xe0
   end_bio_bh_io_sync+0x2c/0x50
   bio_endio+0x108/0x180
   blk_mq_end_request_batch+0x11f/0x5e0
   nvme_pci_complete_batch+0xb5/0xd0 [nvme]
   nvme_irq+0x92/0xe0 [nvme]
   __handle_irq_event_percpu+0x6e/0x1e0
   handle_irq_event+0x39/0x80
   handle_edge_irq+0x8c/0x240
   __common_interrupt+0x4e/0xf0
   common_interrupt+0x49/0xc0
   asm_common_interrupt+0x27/0x40

Here is the lock holder details captured by all-cpu-backtrace:

NMI backtrace for cpu 75
CPU: 75 PID: 3095650 Comm: fio Not tainted 
6.10.0-rc3-trnct_nvme_lruvecresched_sirq_inode_mglru #32
RIP: 0010:folio_inc_gen+0x142/0x430
Call Trace:
   <NMI>
   ? show_regs+0x69/0x80
   ? nmi_cpu_backtrace+0xc5/0x130
   ? nmi_cpu_backtrace_handler+0x11/0x20
   ? nmi_handle+0x64/0x180
   ? default_do_nmi+0x45/0x130
   ? exc_nmi+0x128/0x1a0
   ? end_repeat_nmi+0xf/0x53
   ? folio_inc_gen+0x142/0x430
   ? folio_inc_gen+0x142/0x430
   ? folio_inc_gen+0x142/0x430
   </NMI>
   <TASK>
   isolate_folios+0x954/0x1630
   evict_folios+0xa5/0x8c0
   try_to_shrink_lruvec+0x1be/0x320
   shrink_one+0x10f/0x1d0
   shrink_node+0xa4c/0xc90
   do_try_to_free_pages+0xc0/0x590
   try_to_free_pages+0xde/0x210
   __alloc_pages_noprof+0x6ae/0x12c0
   alloc_pages_mpol_noprof+0xd9/0x220
   folio_alloc_noprof+0x63/0xe0
   filemap_alloc_folio_noprof+0xf4/0x100
   page_cache_ra_unbounded+0xb9/0x1a0
   page_cache_ra_order+0x26e/0x310
   ondemand_readahead+0x1a3/0x360
   page_cache_sync_ra+0x83/0x90
   filemap_get_pages+0xf0/0x6a0
   filemap_read+0xe7/0x3d0
   blkdev_read_iter+0x6f/0x140
   vfs_read+0x25b/0x340
   ksys_read+0x67/0xf0
   __x64_sys_read+0x19/0x20
   x64_sys_call+0x1771/0x20d0
   do_syscall_64+0x7e/0x130

Regards,
Bharata.