[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <94d9e935-c8a4-896a-13ac-263831a78dd5@suse.de>
Date: Thu, 22 Jun 2023 08:50:06 +0200
From: Hannes Reinecke <hare@...e.de>
To: Dave Chinner <david@...morbit.com>
Cc: Pankaj Raghav <p.raghav@...sung.com>, willy@...radead.org,
gost.dev@...sung.com, mcgrof@...nel.org, hch@....de,
jwong@...nel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC 0/4] minimum folio order support in filemap
On 6/22/23 07:51, Hannes Reinecke wrote:
> On 6/22/23 00:07, Dave Chinner wrote:
>> On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
>>> On 6/21/23 10:38, Pankaj Raghav wrote:
>>>> There has been a lot of discussion recently to support devices and
>>>> fs for
>>>> bs > ps. One of the main plumbing to support buffered IO is to have
>>>> a minimum
>>>> order while allocating folios in the page cache.
>>>>
>>>> Hannes sent recently a series[1] where he deduces the minimum folio
>>>> order based on the i_blkbits in struct inode. This takes a different
>>>> approach based on the discussion in that thread where the minimum and
>>>> maximum folio order can be set individually per inode.
>>>>
>>>> This series is based on top of Christoph's patches to have iomap aops
>>>> for the block cache[2]. I rebased his remaining patches to
>>>> next-20230621. The whole tree can be found here[3].
>>>>
>>>> Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a
>>>> buffered
>>>> IO on a nvme drive with bs>ps in QEMU without any issues:
>>>>
>>>> [root@...hlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
>>>> 16384
>>>> [root@...hlinux ~]# fio -bs=16k -iodepth=8 -rw=write
>>>> -ioengine=io_uring -size=500M
>>>> -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
>>>> io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W)
>>>> 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
>>>> fio-3.34
>>>> Starting 1 process
>>>> Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
>>>> io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21
>>>> 07:58:29 2023
>>>> read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
>>>> <snip>
>>>> Run status group 0 (all jobs):
>>>> READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s
>>>> (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
>>>> WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s
>>>> (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec
>>>>
>>>> Disk stats (read/write):
>>>> nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441,
>>>> in_queue=12725, util=96.27%
>>>>
>>>> One of the main dependency to work on a block device with bs>ps is
>>>> Christoph's work on converting block device aops to use iomap.
>>>>
>>>> [1] https://lwn.net/Articles/934651/
>>>> [2] https://lwn.net/ml/linux-kernel/20230424054926.26927-1-hch@lst.de/
>>>> [3]
>>>> https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1
>>>>
>>>> Luis Chamberlain (1):
>>>> block: set mapping order for the block cache in set_init_blocksize
>>>>
>>>> Matthew Wilcox (Oracle) (1):
>>>> fs: Allow fine-grained control of folio sizes
>>>>
>>>> Pankaj Raghav (2):
>>>> filemap: use minimum order while allocating folios
>>>> nvme: enable logical block size > PAGE_SIZE
>>>>
>>>> block/bdev.c | 9 ++++++++
>>>> drivers/nvme/host/core.c | 2 +-
>>>> include/linux/pagemap.h | 46
>>>> ++++++++++++++++++++++++++++++++++++----
>>>> mm/filemap.c | 9 +++++---
>>>> mm/readahead.c | 34 ++++++++++++++++++++---------
>>>> 5 files changed, 82 insertions(+), 18 deletions(-)
>>>>
>>>
>>> Hmm. Most unfortunate; I've just finished my own patchset
>>> (duplicating much
>>> of this work) to get 'brd' running with large folios.
>>> And it even works this time, 'fsx' from the xfstest suite runs
>>> happily on
>>> that.
>>
>> So you've converted a filesystem to use bs > ps, too? Or is the
>> filesystem that fsx is running on just using normal 4kB block size?
>> If the latter, then fsx is not actually testing the large folio page
>> cache support, it's mostly just doing 4kB aligned IO to brd....
>>
> I have been running fsx on an xfs with bs=16k, and it worked like a charm.
> I'll try to run the xfstest suite once I'm finished with merging
> Pankajs patches into my patchset.
> Well, would've been too easy.
'fsx' bails out at test 27 (collapse), with:
XFS (ram0): Corruption detected. Unmount and run xfs_repair
XFS (ram0): Internal error isnullstartblock(got.br_startblock) at line
5787 of file fs/xfs/libxfs/xfs_bmap.c. Caller
xfs_bmap_collapse_extents+0x2d9/0x320 [xfs]
Guess some more work needs to be done here.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@...e.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
Powered by blists - more mailing lists