linux-kernel - Re: [RFC 0/4] minimum folio order support in filemap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4270b5c7-04b4-28e0-6181-ef98d1f5130c@suse.de>
Date:   Thu, 22 Jun 2023 07:51:08 +0200
From:   Hannes Reinecke <hare@...e.de>
To:     Dave Chinner <david@...morbit.com>
Cc:     Pankaj Raghav <p.raghav@...sung.com>, willy@...radead.org,
        gost.dev@...sung.com, mcgrof@...nel.org, hch@....de,
        jwong@...nel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC 0/4] minimum folio order support in filemap

On 6/22/23 00:07, Dave Chinner wrote:
> On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
>> On 6/21/23 10:38, Pankaj Raghav wrote:
>>> There has been a lot of discussion recently to support devices and fs for
>>> bs > ps. One of the main plumbing to support buffered IO is to have a minimum
>>> order while allocating folios in the page cache.
>>>
>>> Hannes sent recently a series[1] where he deduces the minimum folio
>>> order based on the i_blkbits in struct inode. This takes a different
>>> approach based on the discussion in that thread where the minimum and
>>> maximum folio order can be set individually per inode.
>>>
>>> This series is based on top of Christoph's patches to have iomap aops
>>> for the block cache[2]. I rebased his remaining patches to
>>> next-20230621. The whole tree can be found here[3].
>>>
>>> Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a buffered
>>> IO on a nvme drive with bs>ps in QEMU without any issues:
>>>
>>> [root@...hlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
>>> 16384
>>> [root@...hlinux ~]# fio -bs=16k -iodepth=8 -rw=write -ioengine=io_uring -size=500M
>>> 		    -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
>>> io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
>>> fio-3.34
>>> Starting 1 process
>>> Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
>>> io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21 07:58:29 2023
>>>     read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
>>>     <snip>
>>> Run status group 0 (all jobs):
>>>      READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
>>>     WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec
>>>
>>> Disk stats (read/write):
>>>     nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441, in_queue=12725, util=96.27%
>>>
>>> One of the main dependency to work on a block device with bs>ps is
>>> Christoph's work on converting block device aops to use iomap.
>>>
>>> [1] https://lwn.net/Articles/934651/
>>> [2] https://lwn.net/ml/linux-kernel/20230424054926.26927-1-hch@lst.de/
>>> [3] https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1
>>>
>>> Luis Chamberlain (1):
>>>     block: set mapping order for the block cache in set_init_blocksize
>>>
>>> Matthew Wilcox (Oracle) (1):
>>>     fs: Allow fine-grained control of folio sizes
>>>
>>> Pankaj Raghav (2):
>>>     filemap: use minimum order while allocating folios
>>>     nvme: enable logical block size > PAGE_SIZE
>>>
>>>    block/bdev.c             |  9 ++++++++
>>>    drivers/nvme/host/core.c |  2 +-
>>>    include/linux/pagemap.h  | 46 ++++++++++++++++++++++++++++++++++++----
>>>    mm/filemap.c             |  9 +++++---
>>>    mm/readahead.c           | 34 ++++++++++++++++++++---------
>>>    5 files changed, 82 insertions(+), 18 deletions(-)
>>>
>>
>> Hmm. Most unfortunate; I've just finished my own patchset (duplicating much
>> of this work) to get 'brd' running with large folios.
>> And it even works this time, 'fsx' from the xfstest suite runs happily on
>> that.
> 
> So you've converted a filesystem to use bs > ps, too? Or is the
> filesystem that fsx is running on just using normal 4kB block size?
> If the latter, then fsx is not actually testing the large folio page
> cache support, it's mostly just doing 4kB aligned IO to brd....
> 
I have been running fsx on an xfs with bs=16k, and it worked like a charm.
I'll try to run the xfstest suite once I'm finished with merging
Pankajs patches into my patchset.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@...e.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman