[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7104d206-7ea4-4471-bbc8-0513350ff8b3@amd.com>
Date: Mon, 22 Jul 2024 09:42:17 +0530
From: Bharata B Rao <bharata@....com>
To: Yu Zhao <yuzhao@...gle.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, nikunj@....com,
"Upadhyay, Neeraj" <Neeraj.Upadhyay@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>, willy@...radead.org, vbabka@...e.cz,
kinseyho@...gle.com, Mel Gorman <mgorman@...e.de>, mjguzik@...il.com
Subject: Re: Hard and soft lockups with FIO and LTP runs on a large system
On 20-Jul-24 1:51 AM, Yu Zhao wrote:
>> However during the weekend mglru-enabled run (with above fix to
>> isolate_lru_folios() and also the previous two patches: truncate.patch
>> and mglru.patch and the inode fix provided by Mateusz), another hard
>> lockup related to lruvec spinlock was observed.
>
> Thanks again for the stress tests.
>
> I can't come up with any reasonable band-aid at this moment, i.e.,
> something not too ugly to work around a more fundamental scalability
> problem.
>
> Before I give up: what type of dirty data was written back to the nvme
> device? Was it page cache or swap?
This is how a typical dstat report looks like when we start to see the
problem with lruvec spinlock.
------memory-usage----- ----swap---
used free buff cach| used free|
14.3G 20.7G 1467G 185M| 938M 15G|
14.3G 20.0G 1468G 174M| 938M 15G|
14.3G 20.3G 1468G 184M| 938M 15G|
14.3G 19.8G 1468G 183M| 938M 15G|
14.3G 19.9G 1468G 183M| 938M 15G|
14.3G 19.5G 1468G 183M| 938M 15G|
As you can see, most of the usage is in buffer cache and swap is hardly
used. Just to recap from the original post...
====
FIO is run with a size of 1TB on each NVME partition with different
combinations of ioengine/blocksize/mode parameters and buffered-IO.
Selected FS tests from LTP are run on 256GB partitions of all NVME
disks. This is the typical NVME partition layout.
nvme2n1 259:4 0 3.5T 0 disk
├─nvme2n1p1 259:6 0 256G 0 part /data_nvme2n1p1
└─nvme2n1p2 259:7 0 3.2T 0 part
Though many different runs exist in the workload, the combination that
results in the problem is buffered-IO run with sync engine.
fio -filename=/dev/nvme1n1p2 -direct=0 -thread -size=1024G \
-rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=4k \
-numjobs=400 -runtime=25000 --time_based -group_reporting -name=mytest
====
Regards,
Bharata.
Powered by blists - more mailing lists