linux-kernel - Re: Hard and soft lockups with FIO and LTP runs on a large system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7104d206-7ea4-4471-bbc8-0513350ff8b3@amd.com>
Date: Mon, 22 Jul 2024 09:42:17 +0530
From: Bharata B Rao <bharata@....com>
To: Yu Zhao <yuzhao@...gle.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, nikunj@....com,
 "Upadhyay, Neeraj" <Neeraj.Upadhyay@....com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...hat.com>, willy@...radead.org, vbabka@...e.cz,
 kinseyho@...gle.com, Mel Gorman <mgorman@...e.de>, mjguzik@...il.com
Subject: Re: Hard and soft lockups with FIO and LTP runs on a large system

On 20-Jul-24 1:51 AM, Yu Zhao wrote:
>> However during the weekend mglru-enabled run (with above fix to
>> isolate_lru_folios() and also the previous two patches: truncate.patch
>> and mglru.patch and the inode fix provided by Mateusz), another hard
>> lockup related to lruvec spinlock was observed.
> 
> Thanks again for the stress tests.
> 
> I can't come up with any reasonable band-aid at this moment, i.e.,
> something not too ugly to work around a more fundamental scalability
> problem.
> 
> Before I give up: what type of dirty data was written back to the nvme
> device? Was it page cache or swap?

This is how a typical dstat report looks like when we start to see the 
problem with lruvec spinlock.

------memory-usage----- ----swap---
used  free  buff  cach| used  free|
14.3G 20.7G 1467G  185M| 938M   15G|
14.3G 20.0G 1468G  174M| 938M   15G|
14.3G 20.3G 1468G  184M| 938M   15G|
14.3G 19.8G 1468G  183M| 938M   15G|
14.3G 19.9G 1468G  183M| 938M   15G|
14.3G 19.5G 1468G  183M| 938M   15G|

As you can see, most of the usage is in buffer cache and swap is hardly 
used. Just to recap from the original post...

====
FIO is run with a size of 1TB on each NVME partition with different
combinations of ioengine/blocksize/mode parameters and buffered-IO.
Selected FS tests from LTP are run on 256GB partitions of all NVME
disks. This is the typical NVME partition layout.

nvme2n1      259:4   0   3.5T  0 disk
├─nvme2n1p1  259:6   0   256G  0 part /data_nvme2n1p1
└─nvme2n1p2  259:7   0   3.2T  0 part

Though many different runs exist in the workload, the combination that
results in the problem is buffered-IO run with sync engine.

fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
-rwmixwrite=30  --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
-numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest
====

Regards,
Bharata.